Linux kernel remounts the filesystem into read-only mode whenever it cannot process I/O. This might happen due to various reasons such as disk failure, SAN connectivity issue, disk with bad blocks etc. On Virtual machine and SAN based storage environment even the high latency may lead to I/O hung and result in read-only mode.
Debugging pacemaker cluster through logs will be much more easier post reading this article. This would help you in either troubleshooting or root cause analysis. It does not need any additional tools except your attention. Have reproduced and collected logs from various scenario to present here as example. I used SUSE 11 with HAE system for testing.
The first thing to do is look in system log for the terms ERROR and WARN.
#grep -e ERROR -e WARN /var/log/messages
How does Pacemaker Corosync cluster operates? What is the relation between pacemaker and corosync? The functionality and concept overview has been explained here. SUSE Linux high availability and Redhat high availability using pacemaker corosync cluster majorly. Still many other flavors such as Ubuntu, Debian are using pacemaker with corosync as their high availability solution.
SUSE cluster log is a single file where you found messages of cluster engine and all of it components. For each and every operation in cluster plenty of info will be logged into log file. While cluster is in smooth operation no one care about it. What if you run into some issue? Linux OS cluster do execute number of process in a second. With simple resource monitoring command we cannot trace out the whole operation history.
Having known about various parts of messages being written in SUSE cluster log file will help much during troubleshooting. The message to be written in logs with error code is predefined already. You may also see output of
Many of us will get confused with Corosync and Heartbeat. Have described about both corosync & heartbeat functionality.
This help you to have comparison of corosync vs heartbeat.
To get idea about where to use corosync and heartbeat?
What cluster engine been supported by corosync and heartbeat?
The Corosync Cluster Engine is a Group Communication System with additional features for implementing high availability within applications. The project provides four C Application Programming Interface features:
• A closed process group communication model with virtual synchrony guarantees for creating replicated state machines.
• A simple availability manager that restarts the application process when it has failed.
• A configuration and statistics in-memory database that provide the ability to set, retrieve, and receive change notifications of information.
• A quorum system that notifies applications when quorum is achieved or lost.
Corosync is used as a High Availability framework by projects such as Apache Qpid and Pacemaker.
Heartbeat is a daemon that provides cluster infrastructure (communication and membership) services to its clients. This allows clients to know about the presence (or disappearance!) of peer processes on other machines and to easily exchange messages with them.
In order to be useful to users, the Heartbeat daemon needs to be combined with a cluster resource manager (CRM) which has the task of starting and stopping the services (IP addresses, web servers, etc.) that cluster will make highly available. Pacemaker is the preferred cluster resource manager for clusters based on Heartbeat.
Sometimes our cib.conf (which is pacemaker cluster configuration file) might accumulate with white spaces. With current running live file we will not see any impact. But the actual headache starts when you close and reopen the cib.conf. Which means completely stop cluster service in all nodes and start back.
During cluster service startup it calculates md5 checksum value and compare with the one available in system. Here we get mismatch error and service startup will be failed.
As a temporary fix remove white space in cib.conf using below command.
#cibadmin -Q -o configuration | sed ‘s/^\s*//’ | sed ‘s/\s*$//’ | tr -d ‘\n’ | sed ‘s/ /\\n/g’| xmllint –copy – | cibadmin -R -o configuration -p
You must recreate md5 checksum value to have safe cluster operation.