RabbitMQ 集群配置。
安装
节点
vim /etc/hosts
192.168.128.101 rabbitmq-101
192.168.128.102 rabbitmq-102
192.168.128.103 rabbitmq-103
systemctl stop firewalld.service
rpm安装
yum install rabbitmq-server -y
优化
systemctl enable rabbitmq-server
vim /usr/lib/systemd/system/rabbitmq-server.service
[Service]
LimitNOFILE = 300000
systemctl daemon-reload
修改 /etc/rabbitmq/rabbitmq.config:259/340
[
{rabbit,
...
{cluster_partition_handling, autoheal}
...
},
{rabbitmq_management,
...
{rates_mode, none} or
{rates_mode, basic} or
{rates_mode, detailed}
...
}
]
启动集群
准备
第一个节点rabbitmq-101:
systemctl start rabbitmq-server
cp -rp /var/lib/rabbitmq/.erlang.cookie /tmp/.erlang.cookie
chown rabbitmq. /tmp/.erlang.cookie -R
chown rabbitmq. /var/lib/rabbitmq/.erlang.cookie -R
chmod 0400 /var/lib/rabbitmq/.erlang.cookie
将上面的 .cookie 文件分发到其他节点
scp /var/lib/rabbitmq/.erlang.cookie rabbitmq-102:~
scp /var/lib/rabbitmq/.erlang.cookie rabbitmq-103:~
启动所有节点 RabbitMQ 服务:
systemctl start rabbitmq-server
组建集群
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit@rabbitmq-101 --ram;
rabbitmqctl start_app;
[root@rabbitmq-102 ~]# rabbitmqctl stop_app;
Stopping node 'rabbit@rabbitmq-102' ...
[root@rabbitmq-102 ~]# rabbitmqctl join_cluster rabbit@rabbitmq-101 --ram;
Clustering node 'rabbit@rabbitmq-102' with 'rabbit@rabbitmq-101' ...
[root@rabbitmq-102 ~]# rabbitmqctl start_app
Starting node 'rabbit@rabbitmq-102' ...
设置 HA
rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
启动管理插件
节点执行:
rabbitmq-plugins enable rabbitmq_management
[root@rabbitmq-101 ~]# rabbitmq-plugins enable rabbitmq_management
The following plugins have been enabled:
mochiweb
webmachine
rabbitmq_web_dispatch
amqp_client
rabbitmq_management_agent
rabbitmq_management
Applying plugin configuration to rabbit@rabbitmq-101... started 6 plugins.
[root@rabbitmq-101 ~]#
添加 monitor 用户
rabbitmqctl add_user monitor xiexianbin.cn
rabbitmqctl set_user_tags monitor monitoring
rabbitmqctl set_permissions -p / monitor ".*" ".*" ".*"
修改密码
rabbitmqctl change_password user_admin passwd_admin
设置 tag
rabbitmqctl set_user_tags user_admin administrator
# rabbitmqctl set_user_tags user_admin monitoring policymaker
Rabbitmq Tag 及作用:
- (None) : 无访问管理插件权限
- management : 用户通过消息传递协议可以做的任何事情:
- 列出他们可以通过AMQP登录的虚拟主机
- 在“它们的”虚拟主机中查看所有队列、交换器和绑定
- 查看和关闭自己的channels 和 connections
- 查看覆盖所有虚拟主机的“全局”统计数据,包括虚拟主机内其他用户的活动
- policymaker : 查看、创建和删除可以通过AMQP登录的虚拟主机的策略和参数
- monitoring : 包含management所有权限,以及:
- 列出所有虚拟主机,包括无法使用消息传递协议访问的主机
- 查看其他用户的连接和通道
- 查看节点级数据,如内存使用和集群
- 查看所有虚拟主机的真实全局统计信息
- administrator : 包含policymaker、management所有权限,以及:
- 创建和删除虚拟主机
- 查看、创建、删除用户
- 查看、创建和删除权限
- 关闭其他用户的连接
设置权限
rabbitmqctl set_permissions [-p <vhost>] <user> <conf> <write> <read>
rabbitmqctl set_permissions -p / user_admin ".*" ".*" ".*"
rabbitmqctl set_permissions -p / user_admin ".*" "^$" ".*"
rabbitmqctl list_permissions -p /
查看集群状态
[root@rabbitmq-101 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@rabbitmq-101' ...
[{nodes,[{disc,['rabbit@rabbitmq-101']},{ram,['rabbit@rabbitmq-102']}]},
{running_nodes,['rabbit@rabbitmq-102','rabbit@rabbitmq-101']},
{cluster_name,<<"rabbit@rabbitmq-101">>},
{partitions,[]},
{alarms,[{'rabbit@rabbitmq-102',[]},{'rabbit@rabbitmq-101',[]}]}]
[root@rabbitmq-101 ~]#
默认 master
节点为 disc
节点。
rabbitmqctl status
说明
内存使用
rabbitmq 内存占用量分析(针对bit-64位机器):
默认情况下,RabbitMQ 启动时内存限制为总内存量的 40%。可以在/etc/rabbitmq/rabbitmq.config
配置:
%% Resource Limits & Flow Control
%% ==============================
%%
%% See http://www.rabbitmq.com/memory.html for full details.
%% Memory-based Flow Control threshold.
%%
%% {vm_memory_high_watermark, 0.4}, # 配置占用比例
%% Alternatively, we can set a limit (in bytes) of RAM used by the node.
%%
%% {vm_memory_high_watermark, {absolute, 1073741824}}, # 配置内存占用量,单位bytes
%%
%% Or you can set absolute value using memory units.
%%
%% {vm_memory_high_watermark, {absolute, "1024M"}}, # 配置内存占用量,指定单位M
%%
%% Supported units suffixes:
%%
%% k, kiB: kibibytes (2^10 bytes)
%% M, MiB: mebibytes (2^20)
%% G, GiB: gibibytes (2^30)
%% kB: kilobytes (10^3)
%% MB: megabytes (10^6)
%% GB: gigabytes (10^9)
以上例说明:
机器物理内存:377G,启动RabbitMQ服务时,内存限制为总内存的40%,即150G左右,启动RabbitMQ时日志如下:
[root@xiexianbin_cn rabbitmq]# grep "Memory limit set" ./*
./rabbit@xiexianbin_cn.log-20170618:Memory limit set to 154706MB of 386766MB total.
进程占用内存说明:
[root@xiexianbin_cn ~]# cat /proc/28579/status
Name: beam.smp
State: S (sleeping)
Tgid: 28579
Ngid: 0
Pid: 28579
PPid: 1
...
VmPeak: 17934548 kB # 进程所占用最大虚拟内存大小
VmSize: 17883064 kB # 进程当前虚拟内存大小
...
VmHWM: 506508 kB # 进程所占用物理内存的峰值
VmRSS: 487876 kB # 进程当前占用物理内存的大小(与procrank中的RSS)
VmData: 17715576 kB # 进程数据段的大小
...
通过RabbitMQ查看内存占用量:
[root@xiexianbin_cn ~]# rabbitmqctl status
Status of node 'rabbit@xiexianbin_cn'
[{pid,28579},
...
{memory,
[{total,276131584}, # 实际使用内存,RabbitMQ manage 面板显示的内存使用大小
{connection_readers,764504},
{connection_writers,37440},
{connection_channels,7569392},
{connection_other,2074696},
{queue_procs,2507696},
{queue_slave_procs,57342936},
{plugins,36284088},
{other_proc,378664},
{mnesia,808528},
{metrics,304040},
{mgmt_db,33125952},
{msg_index,76312},
{other_ets,2747784},
{binary,48278352},
{code,27598779},
{atom,992409},
{other_system,55541244}]},
...
{vm_memory_high_watermark,0.4}, # 内存限比例
{vm_memory_limit,162221613056}, # 内存限制量
...
LimitNOFILE优化
按默认配置安装完RabbitMQ
后,发现其File descriptors
(即文件描述符)和Socket descriptors
都特别低,分别为924
和829
。客户端(消费者)保持长连接时很容易就将socket占满。
优化方法如下:
- 先找到rabbitmq-server.service文件的位置,用如下命令
systemctl cat rabbitmq-server.service
然后打开rabbitmq-server.service
文件,在[Service]
标签下加上下面的设置
[Service]
LimitNOFILE=300000
systemctl daemon-reload
systemctl restart rabbitmq-server.service
通过以下命令可以查看修改后的值
rabbitmqctl status
如何复现脑裂
ifdown 任以一个节点的网卡,1分钟后在 ifup,即可出现脑裂:
# 断掉rbtnode2
# rabbitmqctl cluster_status
Cluster status of node rabbit@rbtnode2 ...
[{nodes,[{disc,[rabbit@rbtnode1,rabbit@rbtnode2]},{ram,[rabbit@rbtnode3]}]},
{running_nodes,[rabbit@rbtnode2]},
{cluster_name,<<"rabbit@rbtnode1">>},
{partitions,[{rabbit@rbtnode2,[rabbit@rbtnode1,rabbit@rbtnode3]}]},
{alarms,[{rabbit@rbtnode2,[]}]}]
# 断掉rbtnode3
# rabbitmqctl cluster_status
Cluster status of node rabbit@rbtnode1 ...
[{nodes,[{disc,[rabbit@rbtnode1,rabbit@rbtnode2]},{ram,[rabbit@rbtnode3]}]},
{running_nodes,[rabbit@rbtnode2,rabbit@rbtnode1]},
{cluster_name,<<"rabbit@rbtnode1">>},
{partitions,[{rabbit@rbtnode2,[rabbit@rbtnode3]},
{rabbit@rbtnode1,[rabbit@rbtnode3]}]},
{alarms,[{rabbit@rbtnode2,[]},{rabbit@rbtnode1,[]}]}]
脑裂优化
波动节点错误日志:
2020-12-11 13:55:48.388 [info] <0.325.0> only running disc node went down~n
2020-12-11 13:55:48.388 [info] <0.325.0> rabbit on node rabbit@rbtnode1 down
2020-12-11 13:55:48.396 [info] <0.325.0> Node rabbit@rbtnode1 is down, deleting its listeners
2020-12-11 13:55:48.405 [error] <0.156.0> Mnesia(rabbit@rbtnode3): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rbtnode1}
2020-12-11 13:55:48.406 [error] <0.156.0> Mnesia(rabbit@rbtnode3): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rbtnode2}
2020-12-11 13:55:48.406 [info] <0.325.0> only running disc node went down~n
2020-12-11 13:55:48.407 [info] <0.325.0> node rabbit@rbtnode2 down: net_tick_timeout
2020-12-11 13:55:48.407 [info] <0.325.0> node rabbit@rbtnode1 down: net_tick_timeout
2020-12-11 13:55:48.407 [info] <0.325.0> node rabbit@rbtnode1 up
2020-12-11 13:55:48.407 [info] <0.325.0> node rabbit@rbtnode2 up
- cluster_partition_handling优化
RabbitMQ 3.1
之后提供集群脑裂的处理配置cluster_partition_handling
,该参数有3个配置:
- ignore:默认配置,发生网络分区时不作处理,当网络是可靠时选用该配置
- pause_minority:分区发生后判断自己所在分区内节点是否超过集群总节点数一半,如果没有超过则暂停这些节点(保证 CP,总节点数为奇数个)
- autoheal:关注服务的连续性而不是节点间的数据一致性时适合该配置(重启连接较少的节点,有数据丢失)
配置/etc/rabbitmq/rabbitmq.config
如下:
[
{rabbit, [
{default_user_tags, []},
{cluster_partition_handling, autoheal} # 新增
]},
{kernel, [{net_ticktime, 150}]} # 新增
].
vim /etc/rabbitmq/rabbitmq-env.conf
修改第二行:
CONFIG_FILE=/etc/rabbitmq/rabbitmq.config
依次重启rabbitmq节点:
systemctl restart rabbitmq-server.service
使用rabbitmqctl status | grep net_ticktime
查看是否生效。
PS:可以使用ps -ef|grep erlang
找到erlang的进程ID,然后杀掉。
ref:
常见问题
强制删除 NaN 队列
集群由于故障,出现大量 NaN 的队列,在 management 界面无法正常删除,可以采用如下方法删除
有些情况下导致队列无法删除,可使用如下命令强制删除:
rabbitmqctl eval 'rabbit_amqqueue:internal_delete({resource,<<"/">>,queue,<<"cinder.info">>}).'
rabbitmqctl eval "Q = rabbit_misc:r(<<\"$vhost\">>, queue, <<\"$queue\">>), rabbit_amqqueue:internal_delete(Q, <<\"cli\">>)."
rabbitmqctl list_queues -p xie-queue-name | awk -F " " '{print $1}' |grep -v "Timeout" | grep -v "Listing" | grep -Ev "^$" | xargs -p -I abcdef rabbitmqctl eval "Q = rabbit_misc:r(<<\"xie-queue-name\">>, queue, <<\"abcdef\">>), rabbit_amqqueue:internal_delete(Q, <<\"cli\">>)."
ref https://community.pivotal.io/s/article/Investigating-Ghost-queues-on-RabbitMQ?language=en_US
大量出现 NaN
的 queue 可以通过停止集群所有节点来清理掉。
脑裂问题
现象是在 RabbitMQ GUI上显示
Network partition detected
Mnesia reports that this RabbitMQ cluster has experienced a network partition. There is a risk of losing data. Please read RabbitMQ documentation about network partitions and the possible solutions.
原因分析:
这是由于网络问题导致集群出现了脑裂
解决办法:
在相对不怎么信任的分区里,对那个分区的节点实行,在出现问题的节点上执行:
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app
一些情况下可以直接重启RabbitMQ服务:
systemctl restart rabbitmq-server
加入集群失败问题
[root@rabbitmq-102 rabbitmq]# rabbitmqctl join_cluster rabbit@rabbitmq-101 --ram
Clustering node 'rabbit@rabbitmq-102' with 'rabbit@rabbitmq-101'
Error: {inconsistent_cluster,"Node 'rabbit@rabbitmq-101' thinks it's clustered with node 'rabbit@rabbitmq-102', but 'rabbit@rabbitmq-102' disagrees"}
修复步骤执行:
# rabbitmq-102 停 app
rabbitmqctl stop_app
# rabbitmq-102 重置
rabbitmqctl reset
# 在 rabbitmq-101 上将异常节点剔除
rabbitmqctl forget_cluster_node rabbitmq-102
# rabbitmq-102 上重新加入集群
rabbitmqctl join_cluster rabbit@rabbitmq-101 --ram/disc
# 启动 app
rabbitmqctl start_app
In some cases the last node to go offline cannot be brought back up. It can be removed from the cluster using the forget_cluster_node rabbitmqctl command.
数据损坏问题
页面上显示Virtual host / experienced an error on node rabbit@<hostname> and may be inaccessible
查看RabbitMQ的日志,里面有很多类似如下错误:
/var/lib/rabbitmq/mnesia/rabbit@<hostname>/msg_stores/vhosts/9Q355W65BH1CKHLV81SV8HAB4/recovery.dets
修复方式:
- 停止RabbitMQ集群
- 删除错误的
vhosts/9Q355W65BH1CKHLV81SV8HAB4
文件
- 启动RabbitMQ集群