Pacemaker(心脏起搏器)
是一个群集资源管理器,使用 Corosync
管理心跳。Pacemaker
是为 Heartbeat
项目而开发的 Cluster Resource Manager(CRM)
项目的延续。
Pacemaker 特点
- 主机和应用程序级别的故障检测和恢复
- 几乎支持任何冗余配置
- 同时支持多种集群配置模式
- Active/Active
- Active/Passive
- N+1
- N+M等
- 支持应用启动/关机顺序
- 支持多种模式的应用程序(如主/从)
- 可以测试任何故障或群集的群集状态
安装
环境说明
172.20.0.21 vm1
172.20.0.22 vm2
172.20.0.23 vm3
所有机器配置:
- host 配置 hostname 解析
- 关闭防火墙
- 关闭 Selinux
安装
以下操作在 vm1~3
上执行:
yum install -y pcs fence-agents-all
systemctl enable pcsd
systemctl start pcsd
echo xiexianbin.cn | passwd --stdin hacluster
- 在 vm1 上配置 cluster auth 验证
pcs cluster auth vm1 vm2 vm3 -u hacluster -p xiexianbin.cn [--force]
$ pcs cluster setup --start --name mycluster vm1 vm2 vm3 # --force 参数强制生成集群
Destroying cluster on nodes: vm1, vm2, vm3...
vm1: Stopping Cluster (pacemaker)...
vm2: Stopping Cluster (pacemaker)...
vm3: Stopping Cluster (pacemaker)...
vm1: Successfully destroyed cluster
vm2: Successfully destroyed cluster
vm3: Successfully destroyed cluster
Sending 'pacemaker_remote authkey' to 'vm1', 'vm2', 'vm3'
vm1: successful distribution of the file 'pacemaker_remote authkey'
vm2: successful distribution of the file 'pacemaker_remote authkey'
vm3: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
vm1: Succeeded
vm2: Succeeded
vm3: Succeeded
Starting cluster on nodes: vm1, vm2, vm3...
vm1: Starting Cluster (corosync)...
vm2: Starting Cluster (corosync)...
vm3: Starting Cluster (corosync)...
vm2: Starting Cluster (pacemaker)...
vm3: Starting Cluster (pacemaker)...
vm1: Starting Cluster (pacemaker)...
Synchronizing pcsd certificates on nodes vm1, vm2, vm3...
vm2: Success
vm3: Success
vm1: Success
Restarting pcsd on the nodes in order to reload the certificates...
vm2: Success
vm3: Success
vm1: Success
pcs cluster start --all
$ pcs cluster enable --all
vm1: Cluster Enabled
vm2: Cluster Enabled
vm3: Cluster Enabled
使用
help
$ pcs --help
Usage: pcs [-f file] [-h] [commands]...
Control and configure pacemaker and corosync.
Options:
-h, --help Display usage and exit.
-f file Perform actions on file instead of active CIB.
--debug Print all network traffic and external commands run.
--version Print pcs version information. List pcs capabilities if
--full is specified.
--request-timeout Timeout for each outgoing request to another node in
seconds. Default is 60s.
--force Override checks and errors, the exact behavior depends on
the command. WARNING: Using the --force option is
strongly discouraged unless you know what you are doing.
Commands:
cluster Configure cluster options and nodes.
resource Manage cluster resources.
stonith Manage fence devices.
constraint Manage resource constraints.
property Manage pacemaker properties.
acl Manage pacemaker access control lists.
qdevice Manage quorum device provider on the local host.
quorum Manage cluster quorum settings.
booth Manage booth (cluster ticket manager).
status View cluster status.
config View and manage cluster configuration.
pcsd Manage pcs daemon.
node Manage cluster nodes.
alert Manage pacemaker alerts.
client Manage pcsd client configuration.
查看 pcs 状态
$ pcs status
$ pcs status cluster
$ pcs status corosync
# 集群状态
$ pcs cluster status
Cluster Status:
Stack: corosync
Current DC: vm1 (version 1.1.19-8.el7-c3c624ea3d) - partition WITHOUT quorum
Last updated: Sun Aug 4 1 10:21:40 2019
Last change: Sun Aug 4 1 10:19:14 2019 by hacluster via crmd on vm1
3 nodes configured
0 resource instances configured
PCSD Status:
vm1: Online
vm2: Online
vm3: Online
查看 corosync 状态
crm_mon -1
pcs 集群配置
# 配置检测
$ crm_verify -L -V
$ pcs property --help
# Run 'man pengine' and 'man crmd' to get a description of the properties.
# 所有节点执行:没有Fencing设备时,禁用stonith(很重要)
# WARNING: no stonith devices and stonith-enabled is not false
$ pcs property set stonith-enabled=false
$ pcs property set pe-warn-series-max=1000 \
pe-input-series-max=1000 \
pe-error-series-max=1000 \
cluster-recheck-interval=3min
$ pcs property list
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: mycluster
cluster-recheck-interval: 3min
dc-version: 1.1.23-1.el7_9.1-9acf116022
have-watchdog: false
pe-error-series-max: 1000
pe-input-series-max: 1000
pe-warn-series-max: 1000
stonith-enabled: false
symmetric-cluster: false
# 设置资源默认粘性(防止资源回切)
pcs resource defaults resource-stickiness=100
# 设置资源超时时间
pcs resource op defaults timeout=10s
pcs resource op defaults
配置 vip
$ pcs resource create vip ocf:heartbeat:IPaddr2 \
ip=172.20.0.20 cidr_netmask=24 nic=ens33 \
op monitor interval=30s
[root@vm1 ~]# ip a show ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:50:56:38:88:62 brd ff:ff:ff:ff:ff:ff
inet 172.20.0.21/24 brd 172.20.0.255 scope global ens33
valid_lft forever preferred_lft forever
inet 172.20.0.20/24 brd 172.20.0.255 scope global secondary ens33
valid_lft forever preferred_lft forever
ip addr del 172.20.0.20/24 dev ens33
可以看到 vip 会重新配置到 vm1 或漂移到其他节点。
查看配置
[root@vm1 ~]# pcs resource
vip (ocf::heartbeat:IPaddr2): Stopped
[root@vm1 ~]# pcs resource show
vip (ocf::heartbeat:IPaddr2): Stopped
[root@vm1 ~]# pcs resource show vip
Resource: vip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=172.20.0.20 nic=ens33
Operations: monitor interval=30s (vip-monitor-interval-30s)
start interval=0s timeout=20s (vip-start-interval-0s)
stop interval=0s timeout=20s (vip-stop-interval-0s)
pcs config
pcs resource update vip ip=172.20.0.24
pcs resource delete vip
pcs resource cleanup
# 节点的添加和移除
pcs cluster node add <new server>
pcs cluster node remove [node]
# 控制节点的状态
pcs cluster standby <server>
pcs cluster standby --all
pcs cluster unstandby <server>
pcs cluster unstandby --all
# 配置检查
crm_verify -L -V
# 查看成员信息
corosync-cmapctl | grep members
corosync-cfgtool -s
相关文件
- /etc/corosync/corosync.conf,使用
pcs cluster sync
同步到其他节点 - /var/log/pacemaker.log
- /var/log/pcsd/pcsd.log
FAQ
节点故障处理
# 故障节点
systemctl stop pcsd pacemaker corosync
systemctl start pcsd
# 正常节点执行
pcs cluster sync
重启集群
pcs cluster stop --all
pcs cluster sync
pcs cluster start --all
配置vip不生效
vm3 crmd[9263]: notice: Result of probe operation for vip on vm3: 7 (not running)
集群节点没有达到 3 个,修复后解决问题。
partition WITHOUT quorum
不超过总结的一半,集群不工作。
Node vm1: UNCLEAN (offline)
pcs property set stonith-enabled=false