Wednesday, October 19, 2016

PXC crash (All nodes)

At work, we had an experience where our PXC cluster had all its nodes crash in a domino effect, like 1 second between each crash. Luckily for us, we always keep slaves as both backups and also active for reads. So when our entire PXC crashed, we continued our music streaming services using our slave databases even though during that time nothing could be written to the database. Its also important to note that we never keep our nodes belonging to the same cluster on the same rack for HA, so this is not like power went off on a rack in the data center.

This is my theory of what has been causing our PXC crash in domino effect style. I suspect this is caused by corrupted data file or indexes, which themselves get corrupted when the MySQL server or system server is killed in the middle of an update. The kill of the mysqld with a “signal 11” might be a result of some hardware crash, like the RAID crash we had before when we experienced the crash of the entire kdb servers. But since Galera employs lazy replication, updates are propagated to all PXC nodes synchronously [1] with no logs available or indicators from the system to tell other nodes if the updates were successful or not. This approach is referred to as lazy replication in the synchronous replication world.

So with lazy replication, we can draw a conclusion that updates are propagated before they are committed, leaving a very very small window in which the nodes in the cluster are not in sync. In a single transaction, nodes are only kept in sync when the transaction commits and all nodes have the same value [2].

So let me try to explain much deeper why this theory holds water.

According to the Galera documentation [1], "The wsrep API uses a replication model that considers the database server to have a state. The state refers to the contents of the database. When a database is in use, clients modify the database content, thus changing its state. The wsrep API represents the changes in the database state as a series of atomic changes, or transactions. [1]”. These changes in the database state, represented by the wsrep API as series of transactions, is what is propagated to other nodes in the Galera-based cluster. Changes on a single node are propagated to other nodes synchronously. So when something happens in the middle of an update, and ends up corrupting some transactions (which might end up in a rollback) [2], there are no logs available or indicators from the system to tell if the updates were successful. So the corrupt transactions will be propagated to other nodes in the Galera-based cluster. In some cases, these corrupted transactions might confuse the mysqld and cause it to die [2], and the same scenario is experienced on all the other nodes in the cluster always have the same state.

Conclusion:

So with this theory, and looking at our last crash of the entire kdb cluster, when the disk on one of our servers failed, its possible this corrupted some data files or indexes while MySQL was in the middle of processing an update, confusing the mysqld and causing it die. But since PXC (which uses Galera) is synchronous, this corrupt data was propagated almost instantly while other servers where still in the middle of processing complex updates, which also confused these other mysqld instances and caused them to die with sigsegv.

This crash by the entire cluster is very rare and very difficult to reproduce. I tried to reproduce it in order to have a practical support of my theory, but it was hard to do. So this might be a corner case i.e, the crash only manifests itself when multiple environmental variables or conditions are simultaneously at extreme levels e.g, CPU, Memory, connections, load etc. With multiple environmental variables or conditions being potential causes, this makes it harder and more expensive to reproduce, test, and optimize because they require maximal configurations in multiple dimensions. The complication also comes in that the combination of factors might be different each time we experience the domino effect crash.

REFERENCES

[1] http://galeracluster.com/documentation-webpages/architecture.html
[2]http://galeracluster.com/documentation-webpages/introduction.html

No comments:

Post a Comment