Hi,
In this blog post, we will discuss cluster parameters, creating a service and test failover. Fencing is an important part of setting up a prod HA cluster. For simplicity, it is disabled in this example. We will discuss the fencing on the next blog post.
Firstly let me disable the fencing feature by running the following command:
sudo pcs property set stonith-enabled=false
The Fencing helps protect your data from being corrupted by nodes that might be failing or are unavailable. Pacemaker uses the term stonith (shoot the other node in the head) to describe fencing options. This configuration depends on particular hardware and a deeper understanding of the fencing process. For this reason, it is recommended that you disable the fencing feature.
Optionally, configure the cluster to ignore the quorum state by running the following command:
sudo pcs property set no-quorum-policy=ignore
Because this example uses a two-node cluster, disabling the no-quorum policy makes the most sense, as quorum technically requires a minimum of three nodes to be a viable configuration. Quorum is only achieved when more than half of the nodes agree on the status of the cluster. In the current release of Corosync, this issue is treated specially for two-node clusters, where the quorum value is artificially set to 1 so that the primary node is always considered in quorum. In the case where a network outage results in both nodes going offline for a period, the nodes race to fence each other and the first to succeed wins quorum.
And let me configure a migration policy by running the following command:
sudo pcs resource defaults update
Running these commands configures the cluster to move the service to a new node after a single failure.
To create a service and test failover:
Services are created and usually configured to run a resource agent that is responsible for starting and stopping processes. Most resource agents are created according to the OCF (Open Cluster Framework) specification, which is defined as an extension for the Linux Standard Base (LSB). There are many handy resource agents for commonly used processes that are included in the resource-agents packages, including a variety of heartbeat agents that track whether commonly used daemons or services are still running. In the following example, a service is set up that uses a DBMASTER resource agent created precisely for the purpose of testing Pacemaker. This agent is used because it requires a very basic configuration and does not make any assumptions about your environment or the types of services that you intend to run with Pacemaker.
1- Add the service as a resource by using the pcs resource create command:
sudo pcs resource create dbmaster_service ocf:pacemaker:DBMASTER op monitor interval=120s
dbmaster_service is the name that is provided for the service for this resource. To invoke the DBMASTER resource agent, a notation (ocf:pacemaker:DBMASTER) is used to specify that it conforms to the OCF standard, that it runs in the pacemaker namespace, and that the DBMASTER script should be used. If you were configuring a heartbeat monitor service for an Oracle Database, you might use the ocf:heartbeat:oracle resource agent. The resource is configured to use the monitor operation in the agent and an interval is set to check the health of the service. In this example, the interval is set to 120s to give the service sufficient time to fail while you are demonstrating failover. By default, this interval is usually set to 20 seconds, but it can be modified depending on the type of service and your particular environment. When you create a service, the cluster attempts to start the resource on a node by using the resource agent’s start command.
2- Show the resource start and run status
sudo pcs status
3- Run the crm_resource command to simulate service failure by force stopping the service directly:
sudo crm_resource –resource dbmaster_service –force-stop
4- Run the crm_mon command in interactive mode so that you can wait until a node fails, to view the Failed Actions message
sudo crm_mon
You should see the service restart on the alternate node. Note that the default monitor period is set to 120 seconds, so you may have to wait up to the full period before you see notification that a node has gone offline.
Note that you can use the Ctrl-C key combination to exit out of crm_mon.
If necessary, reboot the node where the service is running to determine whether failover also occurs in the case of node failure. Note that if you did not previously enable the corosync and pacemaker services to start on boot, you might need to manually start the services on the node that you rebooted by running the following command:
sudo pcs cluster start node1