MySQL Replication Troubleshooting

Sometimes things don’t work out with replication. When I first started experimenting with it I thought this was a “setup and forget about it” kind of job.

Experience has shown though that you have to regularly triple check and see if things may have broken (despite a good and once working setup).

Let’s take a look at what you can do when your Slave isn’t replicating anymore. If you want to know more about how to setup replication, have a look at my previous article in which I explain how this works.

Master Troubleshooting

Replication problems on your Slave do not usually indicate a fault on the Master; as long as it’s creating a log, all is well. You can check this by issuing a couple of “show master status” commands in 30 second intervals and see the log position change.

Provided data is being created or changed, the log positions should be different. If no log status shows up, then you’re being told that “binary loggin is disabled” – in which case you need to update your my.cnf file (see my other article about this).

Slave Troubleshooting

More often replication fails because of something which isn’t working well on the Slave. Common scenarios include:

login problems: wrong user/password/host combination
data import/synchronization problems
problematic statements
corrupt tables or databases

Let’s go through these one by one. I intend to update this article if I find other problems (and solutions). Feel free to share your experiences in a comment at the bottom.

How to check what’s bothering your Slave

Your best place to see where something’s gone wrong is the MySQL log. On CentOS and many other Linux distributions that’s in /var/log/mysqld.log and you can check the last 10 lines with

tail /var/log/mysqld.log

Not the easiest thing to read, but you get the basic idea of what the problem is.

Slave can’t login to the Master

Any issues pertaining to log data not being read, or connections not being possible are most certainly due to login problems.

See if you yourself can login to your Master as the replication user if you’ve set one up. Log in like this:

mysql -h domain+or+ip -u replicationuser -p

If this isn’t possible then you need to look at your credentials on the Master. Perhaps the replication password is wrong, perhaps the user doesn’t have the “replication slave” privilege, or is not allowed to connect from your host.

You can always change your slave to use your root user momentarily to see if connections in general are possible. Note that this is not meant as a long term solution.

Master and Slave are out of sync

If you CAN login and your Slave still throws up errors then it’s usually because the Slave is trying to execute a command from the Master log and fails. For example, he may try to delete a database that doesn’t exist. Such things can happen if you’ve (accidentally) changed data on the Slave. That’s a big no-no.

It’s “read only”, all else is just a replica of the Master. Even though you can introduce new data onto the Slave, changing data that has been replicated from the Master will cause problems.

Sync problems can also occur when you didn’t restart MySQL on the Slave just after the big data import.

Lucky for us this is not a biggie though: simply stop the Slave, restart MySQL and give the your Slave the Master Log Coordinates manually (that’s why it’s handy to make a note of those).

You do it like this, obviously replacing the values with your own:

stop slave;

change master to 
MASTER_LOG_FILE = 'mysql-bin.000037', 
MASTER_LOG_POS = 14462;

start slave;

Corrupt Data on the Slave

Sometimes tables or databases can get corrupt when they’re not properly closed. These things can happen on the Master – and as a result find their way onto the Slave, or something bad can happen on the Slave.

In either case, the Slave may not be able to process a statement and as a result will also stop and complain in his log.

To overcome this, you have two options:

skip the bad statement and carry on
repairing the corrupt tables and carry on

Let’s discuss how to tackle both.

Skip a Statement or two (not recommended)

This is the quick and dirty way of making the Slave ignore one (or several) Master log statements. This is not a problem if we’re dealing with disposable or temporary data – but it is a problem if you’re not (and frankly, who can really tell).

Be aware that skipping statements on the Slave will cause data inconsistencies – if these are noticeable or not depends entirely on your data and applications.

Here’s how you tell the Slave to skip the next 1 statement:

stop slave;
set global sql_slave_skip_counter = 1;
start slave;

You can skip as many events as you like by increasing the value. Once the Slave is running again, go back to the log and see if processing goes well. Use this approach with extreme caution!

There’s an interesting little script here by Vito Botta on how to do this automatically.

Repairing your Tables and Databases

Ignoring a problem and hoping it doesn’t bite you later is never a good idea – repairing what’s broken on the other hand sounds like a good plan.

You can repair single tables or databases with the “repair table” and “repair database” commands respectively – but what happens when you have hundreds of databases containing several thousand tables? You’d sit there all night issuing commands.

Lucky for us there’s a handy tool that does it for us:

mysqlcheck – check it out in the MySQL Manual.

From the OS command prompt, issue this:

mysqlcheck -p -A --auto-repair

This can take some time, but it will check and repair every table in every database you have. You can even perform this on a live server without taking it offline, how cool is that.

Once done, start your Slave again, give it some time to catch up and will hopefully resume its replication duties.

Last Resort

If none of the above works for you, then you can really only do one thing: lock your master again, take a new snapshot and re-import it to the Slave.