RabbitMQ 集群配置

发布时间: 更新时间: 总字数:2650 阅读时间:6m 作者: IP上海 分享 网址
专栏文章
  1. RabbitMQ 集群配置(当前)
  2. RabbitMQ Shovel 配置 -- 消息在不同集群间转发
  3. 最简单的 RabbitMQ 监控及排错方法

RabbitMQ 集群配置。

安装

节点

vim /etc/hosts

192.168.128.101    rabbitmq-101
192.168.128.102    rabbitmq-102
192.168.128.103    rabbitmq-103
systemctl stop firewalld.service

rpm安装

yum install rabbitmq-server -y

优化

systemctl enable rabbitmq-server
vim /usr/lib/systemd/system/rabbitmq-server.service
[Service]
LimitNOFILE = 300000
systemctl daemon-reload

修改 /etc/rabbitmq/rabbitmq.config:259/340

[
 {rabbit,
   ...
   {cluster_partition_handling, autoheal}
   ...
  },
 {rabbitmq_management,
 ...
   {rates_mode, none} or
   {rates_mode, basic} or
   {rates_mode, detailed}
 ...
 }
]

启动集群

准备

第一个节点rabbitmq-101:

systemctl start rabbitmq-server
cp -rp /var/lib/rabbitmq/.erlang.cookie /tmp/.erlang.cookie

chown rabbitmq. /tmp/.erlang.cookie -R
chown rabbitmq.  /var/lib/rabbitmq/.erlang.cookie -R

chmod 0400 /var/lib/rabbitmq/.erlang.cookie

将上面的 .cookie 文件分发到其他节点

scp /var/lib/rabbitmq/.erlang.cookie rabbitmq-102:~
scp /var/lib/rabbitmq/.erlang.cookie rabbitmq-103:~

启动所有节点 RabbitMQ 服务:

systemctl start rabbitmq-server

组建集群

rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit@rabbitmq-101 --ram;
rabbitmqctl start_app;
[root@rabbitmq-102 ~]# rabbitmqctl stop_app;
Stopping node 'rabbit@rabbitmq-102' ...
[root@rabbitmq-102 ~]# rabbitmqctl join_cluster rabbit@rabbitmq-101 --ram;
Clustering node 'rabbit@rabbitmq-102' with 'rabbit@rabbitmq-101' ...
[root@rabbitmq-102 ~]# rabbitmqctl start_app
Starting node 'rabbit@rabbitmq-102' ...

设置 HA

rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'

启动管理插件

节点执行:

rabbitmq-plugins enable rabbitmq_management
[root@rabbitmq-101 ~]# rabbitmq-plugins enable rabbitmq_management
The following plugins have been enabled:
  mochiweb
  webmachine
  rabbitmq_web_dispatch
  amqp_client
  rabbitmq_management_agent
  rabbitmq_management

Applying plugin configuration to rabbit@rabbitmq-101... started 6 plugins.
[root@rabbitmq-101 ~]#

添加 monitor 用户

rabbitmqctl add_user monitor xiexianbin.cn
rabbitmqctl set_user_tags monitor monitoring
rabbitmqctl set_permissions -p / monitor ".*" ".*" ".*"

修改密码

rabbitmqctl change_password user_admin passwd_admin

设置 tag

rabbitmqctl set_user_tags user_admin administrator
# rabbitmqctl set_user_tags user_admin monitoring policymaker

Rabbitmq Tag 及作用:

  • (None) : 无访问管理插件权限
  • management : 用户通过消息传递协议可以做的任何事情:
    • 列出他们可以通过AMQP登录的虚拟主机
    • 在“它们的”虚拟主机中查看所有队列、交换器和绑定
    • 查看和关闭自己的channels 和 connections
    • 查看覆盖所有虚拟主机的“全局”统计数据,包括虚拟主机内其他用户的活动
  • policymaker : 查看、创建和删除可以通过AMQP登录的虚拟主机的策略和参数
  • monitoring : 包含management所有权限,以及:
    • 列出所有虚拟主机,包括无法使用消息传递协议访问的主机
    • 查看其他用户的连接和通道
    • 查看节点级数据,如内存使用和集群
    • 查看所有虚拟主机的真实全局统计信息
  • administrator : 包含policymaker、management所有权限,以及:
    • 创建和删除虚拟主机
    • 查看、创建、删除用户
    • 查看、创建和删除权限
    • 关闭其他用户的连接

设置权限

rabbitmqctl set_permissions [-p <vhost>] <user> <conf> <write> <read>
rabbitmqctl set_permissions -p / user_admin ".*" ".*" ".*"
rabbitmqctl set_permissions -p / user_admin ".*" "^$" ".*"
rabbitmqctl list_permissions -p /

查看集群状态

[root@rabbitmq-101 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@rabbitmq-101' ...
[{nodes,[{disc,['rabbit@rabbitmq-101']},{ram,['rabbit@rabbitmq-102']}]},
 {running_nodes,['rabbit@rabbitmq-102','rabbit@rabbitmq-101']},
 {cluster_name,<<"rabbit@rabbitmq-101">>},
 {partitions,[]},
 {alarms,[{'rabbit@rabbitmq-102',[]},{'rabbit@rabbitmq-101',[]}]}]
[root@rabbitmq-101 ~]#

默认 master 节点为 disc 节点。

rabbitmqctl status

说明

内存使用

rabbitmq 内存占用量分析(针对bit-64位机器):

默认情况下,RabbitMQ 启动时内存限制为总内存量的 40%。可以在/etc/rabbitmq/rabbitmq.config配置:

   %% Resource Limits & Flow Control
   %% ==============================
   %%
   %% See http://www.rabbitmq.com/memory.html for full details.

   %% Memory-based Flow Control threshold.
   %%
   %% {vm_memory_high_watermark, 0.4},  # 配置占用比例

   %% Alternatively, we can set a limit (in bytes) of RAM used by the node.
   %%
   %% {vm_memory_high_watermark, {absolute, 1073741824}},  # 配置内存占用量,单位bytes
   %%
   %% Or you can set absolute value using memory units.
   %%
   %% {vm_memory_high_watermark, {absolute, "1024M"}},  # 配置内存占用量,指定单位M
   %%
   %% Supported units suffixes:
   %%
   %% k, kiB: kibibytes (2^10 bytes)
   %% M, MiB: mebibytes (2^20)
   %% G, GiB: gibibytes (2^30)
   %% kB: kilobytes (10^3)
   %% MB: megabytes (10^6)
   %% GB: gigabytes (10^9)

以上例说明:

机器物理内存:377G,启动RabbitMQ服务时,内存限制为总内存的40%,即150G左右,启动RabbitMQ时日志如下:

[root@xiexianbin_cn rabbitmq]# grep "Memory limit set" ./*
./rabbit@xiexianbin_cn.log-20170618:Memory limit set to 154706MB of 386766MB total.

进程占用内存说明:

[root@xiexianbin_cn ~]# cat /proc/28579/status
Name:   beam.smp
State:  S (sleeping)
Tgid:   28579
Ngid:   0
Pid:    28579
PPid:   1
...
VmPeak: 17934548 kB  # 进程所占用最大虚拟内存大小
VmSize: 17883064 kB  # 进程当前虚拟内存大小
...
VmHWM:    506508 kB  # 进程所占用物理内存的峰值
VmRSS:    487876 kB  # 进程当前占用物理内存的大小(与procrank中的RSS)
VmData: 17715576 kB  # 进程数据段的大小
...

通过RabbitMQ查看内存占用量:

[root@xiexianbin_cn ~]# rabbitmqctl status
Status of node 'rabbit@xiexianbin_cn'
[{pid,28579},
...
 {memory,
     [{total,276131584},  # 实际使用内存,RabbitMQ manage 面板显示的内存使用大小
      {connection_readers,764504},
      {connection_writers,37440},
      {connection_channels,7569392},
      {connection_other,2074696},
      {queue_procs,2507696},
      {queue_slave_procs,57342936},
      {plugins,36284088},
      {other_proc,378664},
      {mnesia,808528},
      {metrics,304040},
      {mgmt_db,33125952},
      {msg_index,76312},
      {other_ets,2747784},
      {binary,48278352},
      {code,27598779},
      {atom,992409},
      {other_system,55541244}]},
...
 {vm_memory_high_watermark,0.4},  # 内存限比例
 {vm_memory_limit,162221613056},  # 内存限制量
 ...

LimitNOFILE优化

按默认配置安装完RabbitMQ后,发现其File descriptors(即文件描述符)和Socket descriptors都特别低,分别为924829。客户端(消费者)保持长连接时很容易就将socket占满。

优化方法如下:

  • 先找到rabbitmq-server.service文件的位置,用如下命令
systemctl cat  rabbitmq-server.service

然后打开rabbitmq-server.service文件,在[Service]标签下加上下面的设置

[Service]
LimitNOFILE=300000
  • 使用如下命令重新加载一下修改后的服务配置
systemctl daemon-reload
  • 最后重启rabbitmq服务
systemctl restart rabbitmq-server.service

通过以下命令可以查看修改后的值

rabbitmqctl status

如何复现脑裂

ifdown 任以一个节点的网卡,1分钟后在 ifup,即可出现脑裂:

# 断掉rbtnode2
# rabbitmqctl cluster_status
Cluster status of node rabbit@rbtnode2 ...
[{nodes,[{disc,[rabbit@rbtnode1,rabbit@rbtnode2]},{ram,[rabbit@rbtnode3]}]},
 {running_nodes,[rabbit@rbtnode2]},
 {cluster_name,<<"rabbit@rbtnode1">>},
 {partitions,[{rabbit@rbtnode2,[rabbit@rbtnode1,rabbit@rbtnode3]}]},
 {alarms,[{rabbit@rbtnode2,[]}]}]
# 断掉rbtnode3
# rabbitmqctl cluster_status
Cluster status of node rabbit@rbtnode1 ...
[{nodes,[{disc,[rabbit@rbtnode1,rabbit@rbtnode2]},{ram,[rabbit@rbtnode3]}]},
 {running_nodes,[rabbit@rbtnode2,rabbit@rbtnode1]},
 {cluster_name,<<"rabbit@rbtnode1">>},
 {partitions,[{rabbit@rbtnode2,[rabbit@rbtnode3]},
              {rabbit@rbtnode1,[rabbit@rbtnode3]}]},
 {alarms,[{rabbit@rbtnode2,[]},{rabbit@rbtnode1,[]}]}]

脑裂优化

波动节点错误日志:

2020-12-11 13:55:48.388 [info] <0.325.0> only running disc node went down~n
2020-12-11 13:55:48.388 [info] <0.325.0> rabbit on node rabbit@rbtnode1 down
2020-12-11 13:55:48.396 [info] <0.325.0> Node rabbit@rbtnode1 is down, deleting its listeners
2020-12-11 13:55:48.405 [error] <0.156.0> Mnesia(rabbit@rbtnode3): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rbtnode1}
2020-12-11 13:55:48.406 [error] <0.156.0> Mnesia(rabbit@rbtnode3): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rbtnode2}
2020-12-11 13:55:48.406 [info] <0.325.0> only running disc node went down~n
2020-12-11 13:55:48.407 [info] <0.325.0> node rabbit@rbtnode2 down: net_tick_timeout
2020-12-11 13:55:48.407 [info] <0.325.0> node rabbit@rbtnode1 down: net_tick_timeout
2020-12-11 13:55:48.407 [info] <0.325.0> node rabbit@rbtnode1 up
2020-12-11 13:55:48.407 [info] <0.325.0> node rabbit@rbtnode2 up
  • cluster_partition_handling优化

RabbitMQ 3.1之后提供集群脑裂的处理配置cluster_partition_handling,该参数有3个配置:

  • ignore:默认配置,发生网络分区时不作处理,当网络是可靠时选用该配置
  • pause_minority:分区发生后判断自己所在分区内节点是否超过集群总节点数一半,如果没有超过则暂停这些节点(保证 CP,总节点数为奇数个)
  • autoheal:关注服务的连续性而不是节点间的数据一致性时适合该配置(重启连接较少的节点,有数据丢失)

配置/etc/rabbitmq/rabbitmq.config如下:

[
    {rabbit, [
        {default_user_tags, []},
        {cluster_partition_handling, autoheal}  # 新增
    ]},
    {kernel, [{net_ticktime,  150}]}  # 新增
].

vim /etc/rabbitmq/rabbitmq-env.conf修改第二行:

CONFIG_FILE=/etc/rabbitmq/rabbitmq.config

依次重启rabbitmq节点:

systemctl restart rabbitmq-server.service

使用rabbitmqctl status | grep net_ticktime查看是否生效。

PS:可以使用ps -ef|grep erlang找到erlang的进程ID,然后杀掉。

ref:

常见问题

强制删除 NaN 队列

集群由于故障,出现大量 NaN 的队列,在 management 界面无法正常删除,可以采用如下方法删除

  • 方法一

有些情况下导致队列无法删除,可使用如下命令强制删除:

rabbitmqctl eval 'rabbit_amqqueue:internal_delete({resource,<<"/">>,queue,<<"cinder.info">>}).'
  • 方法二
rabbitmqctl eval "Q = rabbit_misc:r(<<\"$vhost\">>, queue, <<\"$queue\">>), rabbit_amqqueue:internal_delete(Q, <<\"cli\">>)."
rabbitmqctl list_queues -p xie-queue-name | awk -F " " '{print $1}' |grep -v "Timeout" | grep -v "Listing" | grep -Ev "^$" | xargs -p -I abcdef rabbitmqctl eval "Q = rabbit_misc:r(<<\"xie-queue-name\">>, queue, <<\"abcdef\">>), rabbit_amqqueue:internal_delete(Q, <<\"cli\">>)."

ref https://community.pivotal.io/s/article/Investigating-Ghost-queues-on-RabbitMQ?language=en_US

  • 方法三

大量出现 NaN 的 queue 可以通过停止集群所有节点来清理掉。

脑裂问题

现象是在 RabbitMQ GUI上显示

Network partition detected
Mnesia reports that this RabbitMQ cluster has experienced a network partition. There is a risk of losing data. Please read RabbitMQ documentation about network partitions and the possible solutions.

原因分析:

这是由于网络问题导致集群出现了脑裂

解决办法:

在相对不怎么信任的分区里,对那个分区的节点实行,在出现问题的节点上执行:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app

一些情况下可以直接重启RabbitMQ服务:

systemctl restart rabbitmq-server

加入集群失败问题

[root@rabbitmq-102 rabbitmq]# rabbitmqctl join_cluster rabbit@rabbitmq-101 --ram
Clustering node 'rabbit@rabbitmq-102' with 'rabbit@rabbitmq-101'
Error: {inconsistent_cluster,"Node 'rabbit@rabbitmq-101' thinks it's clustered with node 'rabbit@rabbitmq-102', but 'rabbit@rabbitmq-102' disagrees"}

修复步骤执行:

# rabbitmq-102 停 app
rabbitmqctl stop_app
# rabbitmq-102 重置
rabbitmqctl reset
# 在 rabbitmq-101 上将异常节点剔除
rabbitmqctl forget_cluster_node rabbitmq-102
# rabbitmq-102 上重新加入集群
rabbitmqctl join_cluster rabbit@rabbitmq-101 --ram/disc
# 启动 app
rabbitmqctl start_app
In some cases the last node to go offline cannot be brought back up. It can be removed from the cluster using the forget_cluster_node rabbitmqctl command.

数据损坏问题

页面上显示Virtual host / experienced an error on node rabbit@<hostname> and may be inaccessible

查看RabbitMQ的日志,里面有很多类似如下错误:

/var/lib/rabbitmq/mnesia/rabbit@<hostname>/msg_stores/vhosts/9Q355W65BH1CKHLV81SV8HAB4/recovery.dets

修复方式:

  1. 停止RabbitMQ集群
  2. 删除错误的vhosts/9Q355W65BH1CKHLV81SV8HAB4文件
  3. 启动RabbitMQ集群

参考

  1. http://www.rabbitmq.com/clustering.html
  2. https://www.rabbitmq.com/memory.html
  3. https://www.rabbitmq.com/memory-use.html
  4. http://tryrabbitmq.com/
Home Archives Categories Tags Statistics
本文总阅读量 次 本站总访问量 次 本站总访客数