The Day the Replication Died

On Thursday, March 7th, we scrambled. Most of Kickstarter's traffic is served from replicated copies of the MySQL database, and those replicas had quit updating. The problem wasn't on one replica that we could easily replace; it was on all the replicas. And MySQL was telling us something new:

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Last_Error: Could not execute Update_rows event on table
kickstarter.backings; Duplicate entry '123456-789' for key
'index_backings_on_project_id_and_sequence', Error_code: 1062;
handler error HA_ERR_FOUND_DUPP_KEY; the event's master log
mysql-bin-changelog.169933, end_log_pos 12969124

We immediately set to work. Over the next few hours we kept the site stable, minimized the effects of stale replicas, communicated the issue to users, recovered, sounded the all clear, and then watched as the whole cycle repeated itself.

But that's a different story. This is about discovering a MySQL bug. Let's talk shop.

Background: Replication

To understand the problem we first had to dig into MySQL's replication modes. We rely on Amazon RDS for managed MySQL, and their default replication mode is MIXED. According to MySQL's docs this is a best-of-both-worlds hybrid between statement- and row-based replication.

To summarize:

Statement-Based Replication

This is the most efficient replication. In this mode, MySQL replicates the query itself, with additional context such as the current time or the next insert id. It minimizes how much the master must write to its binlog, and efficiently replays the same query on each replica.

The downside is that some queries may not be deterministic: they may not replay the same on each replica.

Row-Based Replication

This is the most accurate replication. Instead of replicating the query, it replicates the new version of each row in its entirety. The replicas simply replace their version with the new version.

Mixed-Mode Replication

In this mode, MySQL favors efficient statement-based replication until it recognizes an unsafe query. Then it temporarily switches to row-based replication.

Breaking Down the Problem

Once the replication error told us where to look, we were able to easily spot our inconsistent data: a range of rows where the replicas were out of sync with the master. But this particular data had been inconsistent for days, and when we expanded our search, we found some inconsistent data over a month old. Why had it waited to break?

master> select from tbl;    replica> select from tbl;
+----+------+----------+    +----+------+----------+
| id | foo  | uniq_col |    | id | foo  | uniq_col |
+----+------+----------+    +----+------+----------+
| .. | ...  | ...      |    | .. | ...  | ...      |
| 12 | bar  | 10       |    | 12 | bar  | 4        |
| 13 | baz  | 4        |    | 13 | baz  | 10       |
| .. | ...  | ...      |    | .. | ...  | ...      |
+----+------+----------+    +----+------+----------+

Inconsistent data is bad enough on its own, but it was only half of our issue. It wasn't until a later unsafe query triggered row-based replication that replication broke.

-- An example unsafe query
master> UPDATE tbl SET foo = 'qux' ORDER BY rand() LIMIT 1;

-- Would replicate like (decoded as SQL):
replica> UPDATE tbl SET id = 13, foo = 'qux', uniq_col = 4 WHERE id = 13;
ERROR 1062: Duplicate entry '4' for key 'index_foo_on_uniq_col'

The Affected Feature

The inconsistent data was a handful of backer sequences. We aim to assign each backer a unique and incremental number for that project when they complete the pledge process. This is pretty helpful for reports that we give to creators.

In an effort to avoid race conditions and unnecessary rollbacks/retries from duplicate keys, we opted for a background job that updates recent sequences for a given project using a user-defined counter variable. The order is maintained through an appropriate timestamp column.

SELECT COALESCE(MAX(sequence), 0)
FROM backings
WHERE project_id = ?
INTO @sequence;

UPDATE backings
SET sequence = @sequence := @sequence + 1
WHERE project_id = ?
  AND sequence IS NULL
  AND pledged_at IS NOT NULL
ORDER BY pledged_at ASC, id ASC;

Somehow that query had given backings a different sequence on the replicas. But it has a well-specified ORDER BY; why didn't it work?

Back In Time

We found evidence that in the timeframe when the data became inconsistent, a set of transactions hung while waiting for locks after writing to the backings table. InnoDB's transaction engine is optimized for COMMIT, which means it writes each query to disk such that COMMIT has nothing to do but mark the transaction as complete and release locks.

Then, the transactions finished, but out of order. Since MySQL flushes a transaction's queries to the binlog on COMMIT, this means that the order in which records were written to disk on the master was different than the order in which the replicas wrote to disk when replaying the binlog.

But this only matters if there's no explicit ORDER BY clause, and we had one. It just didn't match up. Puzzling on this led us to discover the final piece of the puzzle: a bug where MySQL will sometimes ignore the ORDER BY clause. Without that clause, the master and the replicas relied on their own versions of the implicit order, ran the sequencing, and fell out of sync.

Lessons Learned

Databases are intricate. In the Rails community we sometimes treat them as simple data stores, hidden behind ActiveRecord and ActiveRelation. But it's important to find opportunities to better understand how they do what we ask them to do. The answers are illuminating!

Comments

    1. Fb_profile_picture.small

      Creator Rene Romero Benavides on April 10, 2013

      One thing I don't understand is the rationale behind the statement "This is the most efficient replication" given that it poses so many inconsistency issues. My guess is that with pure binary replication this wouldn't have happened. I usually work with PostgreSQL where binary replication is extremely efficient you might give it a try someday. Regards!

    2. Fb_profile_picture.small

      Creator Mike DeSanctis on April 10, 2013

      This blog will be fantastic to follow. Given the impending death of Google Reader, I would love to see a dedicated Facebook or Twitter page to make it easy to stay up to date.

    3. Nycafe.small

      Creator Lance Ivy on April 10, 2013

      @Rene: It's more efficient because it's sparser. That saves I/O at many steps: writing to the binlog, reading from the binlog for each replica, writing to the replay log on each replica, and ultimately replaying the actual change against the replica.

      You're right that SBR has consistency issues though, which is why mixed mode recognizes the key moments to switch to RBR (binary) for accuracy. In this case that was all working as intended! Unfortunately it couldn't account for a bug with some UPDATE/ORDER queries -- that's the true culprit as I see it.

    4. 101315-(2).small

      Creator Chris Williams on April 10, 2013

      Great first post! I'm curious, since the problem surfaced as a unique key collision, how long did it take before it was obvious that MySQL was source of the issue? Any tips on finding the root of esoteric bugs like this one?

    5. Backofmyhead.small

      Creator Carson Baker on April 10, 2013

      That is so over my head, but the code formatting, and all the other typography on this page, looks killer. I think *code* is my favorite. Looking forward to more posts!

    6. Nycafe.small

      Creator Lance Ivy on April 10, 2013

      @Chris: We knew that all of the clues were in MySQL, but it took a few days of searching to bring all the pieces together. It wasn't until the end that we happened on the bug itself and everything clicked. I still think it was a somewhat lucky find -- the reduction is fairly picky, and depending on how I approached it I may not have even noticed.

    7. Click_icon_whtprl.small

      Creator Be Good Today Inc. on April 11, 2013

      wow awesome that you guys shared this. we will pass along this info to our team. thanks.

    8. Missing_small

      Creator peter murphy on April 11, 2013

      When I need a unique id, I always opt for using guids. They can be created from any tier in the app, not needing to be centrally assigned from a db, which removes many of these design problems. I would think a db in 4th normal form would have separate fields for the unique id, and project id displayed to the user. just my 2 cents. And, thanks for sharing. It is courageous to be transparent.

    9. Fb_profile_picture.small

      Creator Brian R Cavanagh on April 24, 2013

      Updates using session variables aren't reliable, particularly when your server comes under load. Unless you have a really great reason only use row based replication. Don't worry about efficiency, consistency is the only thing that matters.

    10. Nycafe.small

      Creator Lance Ivy on April 24, 2013

      @Peter We considered using an existing unique key, but wanted to try and provide something meaningful to our users. Thanks though!

      @Brian We suspected the session variable as well, and quickly rewrote the query another way just in case. Now that I've seen how MySQL handles them in the binary log though, I don't see them as unreliable.

      In retrospect I could have communicated this more clearly in the post, but the mixed mode replication here worked just as promised. The ultimate issue seems to have been MySQL's inconsistent handling of an UPDATE...ORDER statement, which would've been a problem even with pure row-based replication.

      Since the bug affects MySQL 5.5, 5.6, and 5.7, our takeaway is to avoid the type of query altogether for the foreseeable future.