Redis哨兵模式

redis哨兵模式

为什么要用redis哨兵模式:
————————————————————————————-
哨兵(Sentinel)主要是为了解决在主从复制架构中出现宕机的情况,主要分为两种情况:

1. 从Redis宕机

在Redis中从库重新启动后会自动加入到主从架构中,自动完成同步数据。在Redis2.8版本后,主从断线后恢复的情况下实现增量复制。

2. 主Redis宕机

需要以下2步才能完成

第一步:在从数据库中执行SLAVEOF NO ONE命令,断开主从关系并且提升为主库继续服务
第二步:将主库重新启动后,执行SLAVEOF命令,将其设置为其他库的从库,这时数据就能更新回来
由于这个手动完成恢复的过程其实是比较麻烦的并且容易出错,所以Redis提供的哨兵(sentinel)的功能来解决

————————————————————————————-

什么是redis哨兵模式:

Redis-Sentinel是用于管理Redis集群,该系统执行以下三个任务:
————————————————————————————-
1.监控(Monitoring):Sentinel会不断地检查你的主服务器和从服务器是否运作正常
2.提醒(Notification):当被监控的某个Redis服务器出现问题时,Sentinel可以通过API向管理员或者其他应用程序发送通知
3.自动故障迁移(Automatic failover):当一个主服务器不能正常工作时,Sentinel 会开始一次自动故障迁移操作,它会将失效主服务器的其中一个从服务器升级为新的主服务器,并让失效主服务器的其他从服务器改为复制新的主服务器;当客户端试图连接失效的主服务器时,集群也会向客户端返回新主服务器的地址,使得集群可以使用新主服务器代替失效服务器
————————————————————————————-

1. 单哨兵模式

进入redis安装目录,修改sentinel配置文件sentinel.conf

vim sentinel.conf
sentinel monitor mymaster 127.0.0.1 6380 1

解释:设置 sentinel monitor 为Master 地址,后面的数字1,表示最低通过票数;

新启动一个ssh连接,来启动查看sentinel的状态

启动哨兵

./src/redis-sentinel ../sentinel.conf

强制杀掉master的进程

[root@test3 ~]# kill 50668
[root@test3 ~]# 50668:signal-handler (1513352001) Received SIGTERM scheduling shutdown...
50668:M 15 Dec 23:33:21.984 # User requested shutdown...
50668:M 15 Dec 23:33:21.984 * Saving the final RDB snapshot before exiting.
50668:M 15 Dec 23:33:21.987 * DB saved on disk
50668:M 15 Dec 23:33:21.987 * Removing the pid file.
50668:M 15 Dec 23:33:21.987 # Redis is now ready to exit, bye bye...

等待半分钟,再将6380启动起来

sentinel的状态如下:

[root@test3 src]# redis-sentinel  ../sentinel.conf 
50718:X 15 Dec 22:41:36.711 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
50718:X 15 Dec 22:41:36.711 # Redis version=4.0.2, bits=64, commit=00000000, modified=0, pid=50718, just started
50718:X 15 Dec 22:41:36.711 # Configuration loaded
50718:X 15 Dec 22:41:36.712 * Increased maximum number of open files to 10032 (it was originally set to 1024).
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.2 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in sentinel mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 26379
 |    `-._   `._    /     _.-'    |     PID: 50718
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

# 哨兵ID
50718:X 15 Dec 22:41:36.715 # Sentinel ID is 777d65d1a6d22405689907e55518dcb00cf52df5
# 给master添加了一个监控,名为mymaster
50718:X 15 Dec 22:41:36.715 # +monitor master mymaster 127.0.0.1 6380 quorum 1
# 发现了下面2个slave节点
50718:X 15 Dec 22:41:36.716 * +slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6380
50718:X 15 Dec 22:41:36.717 * +slave slave 127.0.0.1:6382 127.0.0.1 6382 @ mymaster 127.0.0.1 6380

# SDOWN:subjectively down,直接翻译的为”主观”失效,即当前sentinel实例认为某个redis服务为”不可用”状态.
50718:X 15 Dec 22:44:14.693 # +sdown master mymaster 127.0.0.1 6380
50718:X 15 Dec 22:44:14.693 # +odown master mymaster 127.0.0.1 6380 #quorum 1/1
50718:X 15 Dec 22:44:14.693 # +new-epoch 1
# try-failover尝试故障转移
50718:X 15 Dec 22:44:14.693 # +try-failover master mymaster 127.0.0.1 6380
# 投票选举哨兵leader,当前leader ID为777d65d1a6d22405689907e55518dcb00cf52df5
50718:X 15 Dec 22:44:14.696 # +vote-for-leader 777d65d1a6d22405689907e55518dcb00cf52df5 1
# 选中leader
50718:X 15 Dec 22:44:14.696 # +elected-leader master mymaster 127.0.0.1 6380
# 选中其中一个slave作为master
50718:X 15 Dec 22:44:14.696 # +failover-state-select-slave master mymaster 127.0.0.1 6380
# 选中6382为master
50718:X 15 Dec 22:44:14.780 # +selected-slave slave 127.0.0.1:6382 127.0.0.1 6382 @ mymaster 127.0.0.1 6380
# 发送slaveof no one命令
50718:X 15 Dec 22:44:14.780 * +failover-state-send-slaveof-noone slave 127.0.0.1:6382 127.0.0.1 6382 @ mymaster 127.0.0.1 6380
# 等待升级master
50718:X 15 Dec 22:44:14.871 * +failover-state-wait-promotion slave 127.0.0.1:6382 127.0.0.1 6382 @ mymaster 127.0.0.1 6380
# 升级6382为master
50718:X 15 Dec 22:44:15.256 # +promoted-slave slave 127.0.0.1:6382 127.0.0.1 6382 @ mymaster 127.0.0.1 6380
# 故障转移状态切换到reconf-slaves
50718:X 15 Dec 22:44:15.256 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6380
# 6380正在设置新的master
50718:X 15 Dec 22:44:15.324 * +slave-reconf-sent slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6380
# 把6381的master配置由原来的6380更新
50718:X 15 Dec 22:44:16.298 * +slave-reconf-inprog slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6380
50718:X 15 Dec 22:44:16.298 * +slave-reconf-done slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6380
# 故障恢复完成
50718:X 15 Dec 22:44:16.381 # +failover-end master mymaster 127.0.0.1 6380
# master从6380转为6382
50718:X 15 Dec 22:44:16.381 # +switch-master mymaster 127.0.0.1 6380 127.0.0.1 6382
#下面2行分别添加6381 6380为6382的slave
50718:X 15 Dec 22:44:16.381 * +slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6382
50718:X 15 Dec 22:44:16.381 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6382
50718:X 15 Dec 22:44:41.158 * +convert-to-slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6382

查看redis的状态:

127.0.0.1:6382> info replication
# Replication
role:master
connected_slaves:2
slave0:ip=127.0.0.1,port=6381,state=online,offset=605905,lag=1
slave1:ip=127.0.0.1,port=6380,state=online,offset=605905,lag=1
master_replid:ef51779d1bbd69b9d24f78b843f0e19d5b41b2de
master_replid2:f265eeea184220f5030b262217ec0c0f85655a6f
master_repl_offset:605905
second_repl_offset:10510
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:561
repl_backlog_histlen:605345

发现6382已经成为新的master,而6380和6381为slave

分别查看6380和6381的redis配置文件发现slaveof配置发生变化:

slaveof 127.0.0.1 6382 master均指向了6382

2. 哨兵集群

复制3份不同的sentinel配置文件:

sentinel.conf sentinel/sentinel26380.conf sentinel/sentinel26381.conf

配置文件内容样板如下:

port 26379
daemonize yes
protected-mode no
dir "/tmp/redis26379"
loglevel debug
logfile "/tmp/sentinel26379/sentinel26379.log"
sentinel monitor mymaster 127.0.0.1 6382 1
sentinel down-after-milliseconds mymaster 3000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000

配置文件解析:

———————————————————————————————————-
port 26379

sentinel监听端口

daemonize yes

Redis默认不是以守护进程的方式运行,可以通过该配置项修改,使用yes启用守护进程

dir “/tmp/redis26379”

运行目录

sentinel monitor mymaster 127.0.0.1 6382 1

sentinel监听master名称及端口,myster名称为自定义 ,数字1,表示,当有多少个sentinel认为master失效时,master才算真正失效;即投票数

sentinel down-after-milliseconds mymaster 3000

指定master无法连接的失效时间,单位是毫秒,默认为3秒

sentinel failover-timeout mymaster 10000

指定连接master的超时时间,10秒

sentinel parallel-syncs mymaster 1

指定了在failover主备切换时最多可以有多少个slave同时对新的master进行 同步,数字越小,完成failover所需的时间就越长,但是如果数字越大,就意味着越多的slave因为replication而不可用。可以通过将这个值设为 1 来保证每次只有一个slave 处于不能处理命令请求的状态

————————————————————————————————————-

启动sentinel

[root@test3 redis-4.0.2]# ./src/redis-sentinel sentinel.conf 
52935:X 18 Dec 19:21:03.794 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
52935:X 18 Dec 19:21:03.794 # Redis version=4.0.2, bits=64, commit=00000000, modified=0, pid=52935, just started
52935:X 18 Dec 19:21:03.794 # Configuration loaded

[root@test3 redis-4.0.2]# ./src/redis-sentinel sentinel/sentinel26380.conf 
52940:X 18 Dec 19:22:33.536 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
52940:X 18 Dec 19:22:33.536 # Redis version=4.0.2, bits=64, commit=00000000, modified=0, pid=52940, just started
52940:X 18 Dec 19:22:33.536 # Configuration loaded

[root@test3 redis-4.0.2]# ./src/redis-sentinel sentinel/sentinel26381.conf 
52945:X 18 Dec 19:22:36.728 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
52945:X 18 Dec 19:22:36.728 # Redis version=4.0.2, bits=64, commit=00000000, modified=0, pid=52945, just started
52945:X 18 Dec 19:22:36.728 # Configuration loaded

验证:

kill掉之前master(master为6382)的进程,然后查看sentinel输出日志:

# 查看进程号
ps aux |grep redis
root      50551  0.1  0.2 149352  4068 pts/0    Sl   12月15   7:40 /usr/local/src/redis-4.0.2/src/redis-server 127.0.0.1:6381
root      50683  0.1  0.2 147304  4164 pts/1    Sl   12月15   7:47 /usr/local/src/redis-4.0.2/src/redis-server 127.0.0.1:6382
root      50725  0.1  0.2 147304  4140 pts/1    Sl   12月15   7:30 /usr/local/src/redis-4.0.2/src/redis-server 127.0.0.1:6380
root      52936  0.3  0.1 145256  2388 ?        Ssl  19:21   0:00 ./src/redis-sentinel *:26379 [sentinel]
root      52941  0.3  0.1 145256  2384 ?        Ssl  19:22   0:00 ./src/redis-sentinel *:26380 [sentinel]
root      52946  0.3  0.1 145256  2320 ?        Ssl  19:22   0:00 ./src/redis-sentinel *:26381 [sentinel]
root      52951  0.0  0.0 112676   984 pts/0    R+   19:22   0:00 grep --color=auto redis

# kill掉master
[root@test3 redis-4.0.2]# kill 50683

# sentinel输出日志
[root@test3 redis-4.0.2]# 50551:S 18 Dec 19:23:43.583 # Connection with master lost.
50551:S 18 Dec 19:23:43.583 * Caching the disconnected master state.
50551:S 18 Dec 19:23:44.384 * Connecting to MASTER 127.0.0.1:6382
50551:S 18 Dec 19:23:44.385 * MASTER <-> SLAVE sync started
50551:S 18 Dec 19:23:44.385 # Error condition on socket for SYNC: Connection refused
50551:S 18 Dec 19:23:45.393 * Connecting to MASTER 127.0.0.1:6382
50551:S 18 Dec 19:23:45.393 * MASTER <-> SLAVE sync started
50551:S 18 Dec 19:23:45.393 # Error condition on socket for SYNC: Connection refused
50551:S 18 Dec 19:23:46.400 * Connecting to MASTER 127.0.0.1:6382
50551:S 18 Dec 19:23:46.401 * MASTER <-> SLAVE sync started
50551:S 18 Dec 19:23:46.401 # Error condition on socket for SYNC: Connection refused
50551:M 18 Dec 19:23:46.892 # Setting secondary replication ID to ef51779d1bbd69b9d24f78b843f0e19d5b41b2de, valid up to offset: 45804038. New replication ID is 58f257d0ff3b4dd1aa6c507c1d0760213205db97
50551:M 18 Dec 19:23:46.892 * Discarding previously cached master state.
50551:M 18 Dec 19:23:46.893 * MASTER MODE enabled (user request from 'id=18 addr=127.0.0.1:48044 fd=10 name=sentinel-d15ec52e-cmd age=73 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
50551:M 18 Dec 19:23:46.897 # CONFIG REWRITE executed with success.
50551:M 18 Dec 19:23:47.412 * Replication backlog freed after 3600 seconds without connected slaves.
50551:M 18 Dec 19:23:48.071 * Slave 127.0.0.1:6380 asks for synchronization
50551:M 18 Dec 19:23:48.071 * Unable to partial resync with slave 127.0.0.1:6380 for lack of backlog (Slave request was: 45804038).
50551:M 18 Dec 19:23:48.071 * Starting BGSAVE for SYNC with target: disk
50551:M 18 Dec 19:23:48.072 * Background saving started by pid 52953
52953:C 18 Dec 19:23:48.077 * DB saved on disk
52953:C 18 Dec 19:23:48.078 * RDB: 0 MB of memory used by copy-on-write
50551:M 18 Dec 19:23:48.116 * Background saving terminated with success
50551:M 18 Dec 19:23:48.116 * Synchronization with slave 127.0.0.1:6380 succeeded

# 重新开始选举master,并指定
[root@test3 redis-4.0.2]# 52985:S 18 Dec 19:31:45.500 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
52985:S 18 Dec 19:31:45.500 * SLAVE OF 127.0.0.1:6382 enabled (user request from 'id=2 addr=127.0.0.1:49012 fd=7 name=sentinel-de5fe35f-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
52985:S 18 Dec 19:31:45.507 # CONFIG REWRITE executed with success.
52985:S 18 Dec 19:31:45.526 * Connecting to MASTER 127.0.0.1:6382
52985:S 18 Dec 19:31:45.527 * MASTER <-> SLAVE sync started
52985:S 18 Dec 19:31:45.527 * Non blocking connect for SYNC fired the event.
52985:S 18 Dec 19:31:45.529 * Master replied to PING, replication can continue...
52985:S 18 Dec 19:31:45.530 * Trying a partial resynchronization (request bd99ddc9c78acc08fd33fa3ad85b47b77a7313a6:1).
52956:M 18 Dec 19:31:45.530 * Slave 127.0.0.1:6381 asks for synchronization
52956:M 18 Dec 19:31:45.530 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for 'bd99ddc9c78acc08fd33fa3ad85b47b77a7313a6', my replication IDs are '9d0c14e8fb55532f12c861e982f75390d4b35164' and 'd4399da2538dc650b92fe019c3f1e2292c1493b3')
52956:M 18 Dec 19:31:45.530 * Starting BGSAVE for SYNC with target: disk
52956:M 18 Dec 19:31:45.531 * Background saving started by pid 52992
52985:S 18 Dec 19:31:45.532 * Full resync from master: 9d0c14e8fb55532f12c861e982f75390d4b35164:45896676
52985:S 18 Dec 19:31:45.532 * Discarding previously cached master state.
52992:C 18 Dec 19:31:45.537 * DB saved on disk
52992:C 18 Dec 19:31:45.539 * RDB: 0 MB of memory used by copy-on-write
52956:M 18 Dec 19:31:45.616 * Background saving terminated with success
52956:M 18 Dec 19:31:45.616 * Synchronization with slave 127.0.0.1:6381 succeeded
52985:S 18 Dec 19:31:45.616 * MASTER <-> SLAVE sync: receiving 178 bytes from master
52985:S 18 Dec 19:31:45.616 * MASTER <-> SLAVE sync: Flushing old data
52985:S 18 Dec 19:31:45.616 * MASTER <-> SLAVE sync: Loading DB in memory
52985:S 18 Dec 19:31:45.616 * MASTER <-> SLAVE sync: Finished with success

由上面的日志可以看出,6381被指定为新的master,再次将6382启动起来,查看6381的info信息如下:

127.0.0.1:6382> INFO replication
# Replication
role:master
connected_slaves:2
slave0:ip=127.0.0.1,port=6380,state=online,offset=45916803,lag=0
slave1:ip=127.0.0.1,port=6381,state=online,offset=45916803,lag=0
master_replid:9d0c14e8fb55532f12c861e982f75390d4b35164
master_replid2:d4399da2538dc650b92fe019c3f1e2292c1493b3
master_repl_offset:45916803
second_repl_offset:45889031
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:45813331
repl_backlog_histlen:103473 
标签:Redis 发布于:2019-10-31 17:27:34