debug pacemaker cluster easily

Debugging pacemaker cluster through logs will be much more easier post reading this article. This would help you in either troubleshooting or root cause analysis. It does not need any additional tools except your attention. Have reproduced and collected logs from various scenario to present here as example. I used SUSE 11 with HAE system for testing.

The first thing to do is look in system log for the terms ERROR and WARN.

#grep -e ERROR -e WARN /var/log/messages

If nothing looks appropriate, find the logs from crmd

#grep -e crmd\\[ -e crmd: /var/log/messages

As you know “crmd” is the master brain behind cluster system. Verifying the logs emitted by “crmd” is much more sufficient than anything else. Check the detailed info about pacemaker logs structures here.

Stop & Note
  • For node failures, you’ll always want the logs from the DC (or the node that become DC).
  • For resource failures, you’ll want the logs from the DC and the node on which the resource failed.
Op: Node Left/Join

Log entries like,

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

-indicates a node is no longer part of the cluster (either because it failed or was shut down)

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now member (was lost)

-indicates a node has (re)joined the cluster

Op: Internal recovery
crmd[1811]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE ...
crmd[1811]:   notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
crmd[1811]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE ...

-indicates recovery was attempted

Op: Monitoring resource
crmd[1811]:   notice: te_rsc_command: Initiating action 36: monitor www_monitor_0 on corosync-host-5
crmd[1811]:   notice: te_rsc_command: Initiating action 54: monitor mysql_monitor_10000 on corosync-host-4

-indicates we performed a resource action. The above log says, we are checking the status of the www resource on corosync-host-5 and starting a recurring health check for mysql on corosync-host-4.

Op: Fence given node
crmd[1811]:   notice: te_fence_node: Executing reboot fencing operation (83) on corosync-host-8 (timeout=60000)

-indicates that we are attempting to fence corosync-host-8.

crmd[1811]:   notice: tengine_stonith_notify: Peer corosync-host-8 was terminated (st_notify_fence) by corosync-host-1 for corosync-host-1: OK

-indicates that corosync-host-1 successfully fenced corosync-host-8.

– Please leave a comment if any. Have a great day !!

Leave a Reply

Your email address will not be published. Required fields are marked *