debug pacemaker cluster easily - Technical Admin Blog

Debugging pacemaker cluster through logs will be much more easier post reading this article. This would help you in either troubleshooting or root cause analysis. It does not need any additional tools except your attention. Have reproduced and collected logs from various scenario to present here as example. I used SUSE 11 with HAE system for testing.

The first thing to do is look in system log for the terms ERROR and WARN.

#grep -e ERROR -e WARN /var/log/messages

If nothing looks appropriate, find the logs from crmd

#grep -e crmd\\[ -e crmd: /var/log/messages

As you know “crmd” is the master brain behind cluster system. Verifying the logs emitted by “crmd” is much more sufficient than anything else. Check the detailed info about pacemaker logs structures here.

Stop & Note

For node failures, you’ll always want the logs from the DC (or the node that become DC).
For resource failures, you’ll want the logs from the DC and the node on which the resource failed.

Op: Node Left/Join

Log entries like,

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

-indicates a node is no longer part of the cluster (either because it failed or was shut down)

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now member (was lost)

-indicates a node has (re)joined the cluster

Op: Internal recovery

crmd[1811]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE ...
crmd[1811]:   notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
crmd[1811]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE ...

-indicates recovery was attempted

Op: Monitoring resource

crmd[1811]:   notice: te_rsc_command: Initiating action 36: monitor www_monitor_0 on corosync-host-5
crmd[1811]:   notice: te_rsc_command: Initiating action 54: monitor mysql_monitor_10000 on corosync-host-4

-indicates we performed a resource action. The above log says, we are checking the status of the www resource on corosync-host-5 and starting a recurring health check for mysql on corosync-host-4.

Op: Fence given node

crmd[1811]:   notice: te_fence_node: Executing reboot fencing operation (83) on corosync-host-8 (timeout=60000)

-indicates that we are attempting to fence corosync-host-8.

crmd[1811]:   notice: tengine_stonith_notify: Peer corosync-host-8 was terminated (st_notify_fence) by corosync-host-1 for corosync-host-1: OK

-indicates that corosync-host-1 successfully fenced corosync-host-8.

– Please leave a comment if any. Have a great day !!

Stop & Note

Op: Node Left/Join

Op: Internal recovery

Op: Monitoring resource

Op: Fence given node

Share this on ...