MHA 在线切换是MHA除了自动监控切换换提供的另外一种方式,多用于诸如硬件升级,MySQL数据库迁移等等。该方式提供快速切换和优雅的阻塞写入,无关关闭原有服务器,整个切换过程在0.5-2s 的时间左右,大大减少了停机时间。Online master switch开始只有当所有下列条件得到满足:
1. IO threads on all slaves are running // 在所有slave上IO线程运行。 2. SQL threads on all slaves are running //SQL线程在所有的slave上正常运行。 3. Seconds_Behind_Master on all slaves are less or equal than --running_updates_limit seconds // 在所有的slaves上 Seconds_Behind_Master 要小于等于 running_updates_limit seconds 4. On master, none of update queries take more than --running_updates_limit seconds in the show processlist output // 在主上,没有更新查询操作多于running_updates_limit seconds 在show processlist输出结果上。
这些限制的原因是出于安全原因,并尽快切换到新主库。
1.校验当前是否启用masterha_manager(建议停掉)
[root@DBproxy app2]# masterha_check_status --conf=/data/masterha/app1/app1.cnfapp1 (pid:6769) is running(0:PING_OK), master:192.168.0.50[root@DBproxy app2]#
2.校验slave的IO_threads、SQL_threads、Seconds_Behind_Master
[mysql@MyDB02 masterha]$ mysql -uroot -p123456 -h192.168.0.60 -e 'show slave status \G'|grep -E "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master"Warning: Using a password on the command line interface can be insecure. Slave_IO_Running: Yes Slave_SQL_Running: Yes Seconds_Behind_Master: 0 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it[mysql@MyDB02 masterha]$
3.实施在线切换
[root@DBproxy masterha]# masterha_master_switch --conf=/data/masterha/app1/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0Sat Jul 16 09:11:00 2016 - [info] MHA::MasterRotate version 0.56.Sat Jul 16 09:11:00 2016 - [info] Starting online master switch..Sat Jul 16 09:11:00 2016 - [info] Sat Jul 16 09:11:00 2016 - [info] * Phase 1: Configuration Check Phase..Sat Jul 16 09:11:00 2016 - [info] Sat Jul 16 09:11:00 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.Sat Jul 16 09:11:00 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf..Sat Jul 16 09:11:00 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf..Sat Jul 16 09:11:00 2016 - [info] GTID failover mode = 0Sat Jul 16 09:11:00 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306)Sat Jul 16 09:11:00 2016 - [info] Alive Slaves:Sat Jul 16 09:11:00 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabledSat Jul 16 09:11:00 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306)Sat Jul 16 09:11:00 2016 - [info] Primary candidate for the new Master (candidate_master is set)Sat Jul 16 09:11:00 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..Sat Jul 16 09:11:00 2016 - [info] ok.Sat Jul 16 09:11:00 2016 - [info] Checking MHA is not monitoring or doing failover..Sat Jul 16 09:11:00 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln142] Getting advisory lock failed on the current master. MHA Monitor runs on the current master. Stop MHA Manager/Monitor and try again.Sat Jul 16 09:11:00 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53[root@DBproxy masterha]#将MHA停掉再进行测试[root@DBproxy masterha]# masterha_stop --conf=/data/masterha/app1/app1.cnfStopped app1 successfully.[2]- Exit 1 nohup masterha_manager --conf=/data/masterha/app1/app1.cnf 2>&1 (wd: /data/masterha/app2)(wd now: /data/masterha)[root@DBproxy masterha]#
4.再次实施在线切换
[root@DBproxy masterha]# masterha_master_switch --conf=/data/masterha/app1/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0Sat Jul 16 09:15:03 2016 - [info] MHA::MasterRotate version 0.56.Sat Jul 16 09:15:03 2016 - [info] Starting online master switch..Sat Jul 16 09:15:03 2016 - [info] Sat Jul 16 09:15:03 2016 - [info] * Phase 1: Configuration Check Phase..Sat Jul 16 09:15:03 2016 - [info] Sat Jul 16 09:15:03 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.Sat Jul 16 09:15:03 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf..Sat Jul 16 09:15:03 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf..Sat Jul 16 09:15:03 2016 - [info] GTID failover mode = 0Sat Jul 16 09:15:03 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306)Sat Jul 16 09:15:03 2016 - [info] Alive Slaves:Sat Jul 16 09:15:03 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabledSat Jul 16 09:15:03 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306)Sat Jul 16 09:15:03 2016 - [info] Primary candidate for the new Master (candidate_master is set)Sat Jul 16 09:15:03 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..Sat Jul 16 09:15:03 2016 - [info] ok.Sat Jul 16 09:15:03 2016 - [info] Checking MHA is not monitoring or doing failover..Sat Jul 16 09:15:03 2016 - [info] Checking replication health on 192.168.0.60..Sat Jul 16 09:15:03 2016 - [info] ok.Sat Jul 16 09:15:03 2016 - [info] 192.168.0.60 can be new master.Sat Jul 16 09:15:03 2016 - [info] From:192.168.0.50(192.168.0.50:3306) (current master) +--192.168.0.60(192.168.0.60:3306)To:192.168.0.60(192.168.0.60:3306) (new master) +--192.168.0.50(192.168.0.50:3306)Sat Jul 16 09:15:03 2016 - [info] Checking whether 192.168.0.60(192.168.0.60:3306) is ok for the new master..Sat Jul 16 09:15:03 2016 - [info] ok.Sat Jul 16 09:15:03 2016 - [info] 192.168.0.50(192.168.0.50:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.Sat Jul 16 09:15:03 2016 - [info] 192.168.0.50(192.168.0.50:3306): Resetting slave pointing to the dummy host.Sat Jul 16 09:15:03 2016 - [info] ** Phase 1: Configuration Check Phase completed.Sat Jul 16 09:15:03 2016 - [info] Sat Jul 16 09:15:03 2016 - [info] * Phase 2: Rejecting updates Phase..Sat Jul 16 09:15:03 2016 - [info] Sat Jul 16 09:15:03 2016 - [warning] master_ip_online_change_script is not defined. Skipping disabling writes on the current master.Sat Jul 16 09:15:03 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root):Sat Jul 16 09:15:03 2016 - [info] Executing FLUSH TABLES WITH READ LOCK..Sat Jul 16 09:15:03 2016 - [info] ok.Sat Jul 16 09:15:03 2016 - [info] Orig master binlog:pos is mysql-bin.000009:40355591.Sat Jul 16 09:15:03 2016 - [info] Waiting to execute all relay logs on 192.168.0.60(192.168.0.60:3306)..Sat Jul 16 09:15:03 2016 - [info] master_pos_wait(mysql-bin.000009:40355591) completed on 192.168.0.60(192.168.0.60:3306). Executed 0 events.Sat Jul 16 09:15:03 2016 - [info] done.Sat Jul 16 09:15:03 2016 - [info] Getting new master's binlog name and position..Sat Jul 16 09:15:03 2016 - [info] mysql-bin.000006:120Sat Jul 16 09:15:03 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000006', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx';Sat Jul 16 09:15:03 2016 - [info] Sat Jul 16 09:15:03 2016 - [info] * Switching slaves in parallel..Sat Jul 16 09:15:03 2016 - [info] Sat Jul 16 09:15:03 2016 - [info] Unlocking all tables on the orig master:Sat Jul 16 09:15:03 2016 - [info] Executing UNLOCK TABLES..Sat Jul 16 09:15:03 2016 - [info] ok.Sat Jul 16 09:15:03 2016 - [info] Starting orig master as a new slave..Sat Jul 16 09:15:03 2016 - [info] Resetting slave 192.168.0.50(192.168.0.50:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306)..Sat Jul 16 09:15:03 2016 - [info] Executed CHANGE MASTER.Sat Jul 16 09:15:14 2016 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln784] Slave could not be started on 192.168.0.50(192.168.0.50:3306)! Check slave status.Sat Jul 16 09:15:14 2016 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln862] Starting slave IO/SQL thread on 192.168.0.50(192.168.0.50:3306) failed!Sat Jul 16 09:15:14 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln573] Failed!Sat Jul 16 09:15:14 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln602] Switching master to 192.168.0.60(192.168.0.60:3306) done, but switching slaves partially failed.[root@DBproxy masterha]#
通过主从机本身的日志判断 可能是主从机中ip和主机名的未做映射导致的。修改hosts
主机的/etc/hosts127.0.0.1 MyDB01从机的/etc/hosts127.0.0.1 MyDB02修改后主从机器的/etc/hosts[root@MyDB02 ~]# more /etc/hosts192.168.0.60 MyDB02192.168.0.50 MyDB01
因之前的操作为完全成功,导致两台机器为双主架构。手动切换后调整为最初架构一主一从。在线切换前做一次检查:
[root@DBproxy app1]# masterha_check_repl --conf=/data/masterha/app1/app1.cnfSat Jul 16 10:24:49 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.Sat Jul 16 10:24:49 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf..Sat Jul 16 10:24:49 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf..Sat Jul 16 10:24:49 2016 - [info] MHA::MasterMonitor version 0.56.Sat Jul 16 10:24:49 2016 - [info] GTID failover mode = 0Sat Jul 16 10:24:49 2016 - [info] Dead Servers:Sat Jul 16 10:24:49 2016 - [info] Alive Servers:Sat Jul 16 10:24:49 2016 - [info] 192.168.0.50(192.168.0.50:3306)Sat Jul 16 10:24:49 2016 - [info] 192.168.0.60(192.168.0.60:3306)Sat Jul 16 10:24:49 2016 - [info] Alive Slaves:Sat Jul 16 10:24:49 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabledSat Jul 16 10:24:49 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306)Sat Jul 16 10:24:49 2016 - [info] Primary candidate for the new Master (candidate_master is set)Sat Jul 16 10:24:49 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306)Sat Jul 16 10:24:49 2016 - [info] Checking slave configurations..Sat Jul 16 10:24:49 2016 - [info] read_only=1 is not set on slave 192.168.0.60(192.168.0.60:3306).Sat Jul 16 10:24:49 2016 - [info] Checking replication filtering settings..Sat Jul 16 10:24:49 2016 - [info] binlog_do_db= , binlog_ignore_db= Sat Jul 16 10:24:49 2016 - [info] Replication filtering check ok.Sat Jul 16 10:24:49 2016 - [info] GTID (with auto-pos) is not supportedSat Jul 16 10:24:49 2016 - [info] Starting SSH connection tests..Sat Jul 16 10:24:50 2016 - [info] All SSH connection tests passed successfully.Sat Jul 16 10:24:50 2016 - [info] Checking MHA Node version..Sat Jul 16 10:24:51 2016 - [info] Version check ok.Sat Jul 16 10:24:51 2016 - [info] Checking SSH publickey authentication settings on the current master..Sat Jul 16 10:24:51 2016 - [info] HealthCheck: SSH to 192.168.0.50 is reachable.Sat Jul 16 10:24:51 2016 - [info] Master MHA Node version is 0.56.Sat Jul 16 10:24:51 2016 - [info] Checking recovery script configurations on 192.168.0.50(192.168.0.50:3306)..Sat Jul 16 10:24:51 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/3306/binlog --output_file=/data/masterha/app1/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000010 Sat Jul 16 10:24:51 2016 - [info] Connecting to root@192.168.0.50(192.168.0.50:22).. Creating /data/masterha/app1 if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /data/mysql/3306/binlog, up to mysql-bin.000010Sat Jul 16 10:24:52 2016 - [info] Binlog setting check done.Sat Jul 16 10:24:52 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..Sat Jul 16 10:24:52 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.0.60 --slave_ip=192.168.0.60 --slave_port=3306 --workdir=/data/masterha/app1 --target_version=5.6.29-log --manager_version=0.56 --relay_log_info=/data/mysql/3306/data/relay-log.info --relay_dir=/data/mysql/3306/data/ --slave_pass=xxxSat Jul 16 10:24:52 2016 - [info] Connecting to root@192.168.0.60(192.168.0.60:22).. Checking slave recovery environment settings.. Opening /data/mysql/3306/data/relay-log.info ... ok. Relay log found at /data/mysql/3306/binlog, up to relay-bin.000002 Temporary relay log file is /data/mysql/3306/binlog/relay-bin.000002 Testing mysql connection and privileges.. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done.Sat Jul 16 10:24:53 2016 - [info] Slaves settings check done.Sat Jul 16 10:24:53 2016 - [info] 192.168.0.50(192.168.0.50:3306) (current master) +--192.168.0.60(192.168.0.60:3306)Sat Jul 16 10:24:53 2016 - [info] Checking replication health on 192.168.0.60..Sat Jul 16 10:24:53 2016 - [info] ok.Sat Jul 16 10:24:53 2016 - [warning] master_ip_failover_script is not defined.Sat Jul 16 10:24:53 2016 - [warning] shutdown_script is not defined.Sat Jul 16 10:24:53 2016 - [info] Got exit code 0 (Not master dead).MySQL Replication Health is OK.
5.实施切换
[root@DBproxy app1]# masterha_master_switch --conf=/data/masterha/app1/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0Sat Jul 16 10:26:59 2016 - [info] MHA::MasterRotate version 0.56.Sat Jul 16 10:26:59 2016 - [info] Starting online master switch..Sat Jul 16 10:26:59 2016 - [info] Sat Jul 16 10:26:59 2016 - [info] * Phase 1: Configuration Check Phase..Sat Jul 16 10:26:59 2016 - [info] Sat Jul 16 10:26:59 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.Sat Jul 16 10:26:59 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf..Sat Jul 16 10:26:59 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf..Sat Jul 16 10:26:59 2016 - [info] GTID failover mode = 0Sat Jul 16 10:26:59 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306)Sat Jul 16 10:26:59 2016 - [info] Alive Slaves:Sat Jul 16 10:26:59 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabledSat Jul 16 10:26:59 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306)Sat Jul 16 10:26:59 2016 - [info] Primary candidate for the new Master (candidate_master is set)Sat Jul 16 10:26:59 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..Sat Jul 16 10:26:59 2016 - [info] ok.Sat Jul 16 10:26:59 2016 - [info] Checking MHA is not monitoring or doing failover..Sat Jul 16 10:26:59 2016 - [info] Checking replication health on 192.168.0.60..Sat Jul 16 10:26:59 2016 - [info] ok.Sat Jul 16 10:26:59 2016 - [info] 192.168.0.60 can be new master.Sat Jul 16 10:26:59 2016 - [info] From:192.168.0.50(192.168.0.50:3306) (current master) +--192.168.0.60(192.168.0.60:3306)To:192.168.0.60(192.168.0.60:3306) (new master) +--192.168.0.50(192.168.0.50:3306)Sat Jul 16 10:26:59 2016 - [info] Checking whether 192.168.0.60(192.168.0.60:3306) is ok for the new master..Sat Jul 16 10:26:59 2016 - [info] ok.Sat Jul 16 10:26:59 2016 - [info] 192.168.0.50(192.168.0.50:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.Sat Jul 16 10:26:59 2016 - [info] 192.168.0.50(192.168.0.50:3306): Resetting slave pointing to the dummy host.Sat Jul 16 10:26:59 2016 - [info] ** Phase 1: Configuration Check Phase completed.Sat Jul 16 10:26:59 2016 - [info] Sat Jul 16 10:26:59 2016 - [info] * Phase 2: Rejecting updates Phase..Sat Jul 16 10:26:59 2016 - [info] Sat Jul 16 10:26:59 2016 - [warning] master_ip_online_change_script is not defined. Skipping disabling writes on the current master.Sat Jul 16 10:26:59 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root):Sat Jul 16 10:26:59 2016 - [info] Executing FLUSH TABLES WITH READ LOCK..Sat Jul 16 10:26:59 2016 - [info] ok.Sat Jul 16 10:26:59 2016 - [info] Orig master binlog:pos is mysql-bin.000010:120.Sat Jul 16 10:26:59 2016 - [info] Waiting to execute all relay logs on 192.168.0.60(192.168.0.60:3306)..Sat Jul 16 10:27:00 2016 - [info] master_pos_wait(mysql-bin.000010:120) completed on 192.168.0.60(192.168.0.60:3306). Executed 0 events.Sat Jul 16 10:27:00 2016 - [info] done.Sat Jul 16 10:27:00 2016 - [info] Getting new master's binlog name and position..Sat Jul 16 10:27:00 2016 - [info] mysql-bin.000008:239Sat Jul 16 10:27:00 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000008', MASTER_LOG_POS=239, MASTER_USER='repl', MASTER_PASSWORD='xxx';Sat Jul 16 10:27:00 2016 - [info] Sat Jul 16 10:27:00 2016 - [info] * Switching slaves in parallel..Sat Jul 16 10:27:00 2016 - [info] Sat Jul 16 10:27:00 2016 - [info] Unlocking all tables on the orig master:Sat Jul 16 10:27:00 2016 - [info] Executing UNLOCK TABLES..Sat Jul 16 10:27:00 2016 - [info] ok.Sat Jul 16 10:27:00 2016 - [info] Starting orig master as a new slave..Sat Jul 16 10:27:00 2016 - [info] Resetting slave 192.168.0.50(192.168.0.50:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306)..Sat Jul 16 10:27:00 2016 - [info] Executed CHANGE MASTER.Sat Jul 16 10:27:00 2016 - [info] Slave started.Sat Jul 16 10:27:00 2016 - [info] All new slave servers switched successfully.Sat Jul 16 10:27:00 2016 - [info] Sat Jul 16 10:27:00 2016 - [info] * Phase 5: New master cleanup phase..Sat Jul 16 10:27:00 2016 - [info] Sat Jul 16 10:27:00 2016 - [info] 192.168.0.60: Resetting slave info succeeded.Sat Jul 16 10:27:00 2016 - [info] Switching master to 192.168.0.60(192.168.0.60:3306) completed successfully.[root@DBproxy app1]#