Elasticsearch at Kickstarter

Back in December 2012, we developed a new version of our project search tool on Kickstarter using Elasticsearch. We're really happy with the results and have since found Elasticsearch's filtering and faceting features useful in tools for project creators, our message inbox and in other areas of the site. I'd like to write a little on how we gradually rolled out Elasticsearch as it might be useful for others looking at adding secondary data stores to their stack.

Ramping up

We liked Elasticsearch's features but were initially cautious about how it would behave in production, so we deployed a change that let us divert a percentage of project search requests to a new version built with Elasticsearch:

ratio = File.read(ELASTICSEARCH_RATIO_PATH).to_f rescue 0.0
experimental = (rand <= ratio)

Over time we ramped up the percentage of traffic sent to our Elasticsearch implementation, keeping a close eye on internal metrics to evaluate its performance. Fortunately we didn't hit any major snags, so before long we were sending 100% of our project search traffic to Elasticsearch. There are some great posts by Etsy and Flickr that go into more detail on config flags and rolling out features gradually for more reading on the topic.

An index primer

An index in Elasticsearch is a logical namespace for data and can store multiple types of documents. Types roughly correspond to business models, and each index has a mapping that defines how it stores its types. At Kickstarter, each index defines the mapping for just one type, so an index for projects only defines the mapping for a project type. A very simplistic mapping for a project type with a name and goal might look like this:

$ curl -XGET 'http://localhost:9200/projects/project/_mapping'
{
  "project": {
    "properties": {
      "name": {
        "type": "string"
      },
      "goal": {
        "type": "double",
        "null_value": 0.0
      }
    }
  }
}

Keeping indices up to date

MySQL is our canonical data store. When an index in Elasticsearch is first created, it contains no documents, so a full index must be performed from MySQL to populate it. Once the index has been populated, it's ready to respond to search requests. However, the data in MySQL changes over time. New projects are created, existing projects are updated. These changes need to be sent to Elasticsearch or the projects index will have stale/outdated data.

Each document has an ID in Elasticsearch, and a document can be updated by performing an index operation using that ID. Each project document in Elasticsearch has the same ID as its corresponding record in MySQL. When the project changes in MySQL, we're able to reindex just that project document in Elasticsearch so that our search index is only a few seconds delayed behind MySQL.

The need for new indices

Performing updates to documents in Elasticsearch to keep them in sync with MySQL has taken us some time to get right (a topic for another blog post!), but one way we've mitigated problems with stale data is by making it really easy to create a new index and fully populate it with the latest data from MySQL. This is also useful when the mapping for a type needs to change. Rather than updating the mapping for an existing index, we create an index with the new mapping and populate it, and any old indices are left as is. This avoids having to deal with mapping merge conflicts or inconsistencies with documents having been indexed using different mappings.

This process of creating and populating new indices started off with a cron task to fully index projects every 20 minutes. As we improved our ability to keep Elasticsearch in sync with MySQL, we reduced the frequency of the cron task so that now the full index is only performed nightly.

The full indexing nitty-gritty

Each time we create a new index, it is given a name based on the type and time, e.g.  projects_2013_05_19_13_33_27. It takes some time to fully populate a new index, so while it is building, all our reads continue go to the existing projects index. Elasticsearch has a nifty aliasing feature that allows us to associate indices with an alias. Search requests are sent to the alias, which directs the requests to any indices that it has been associated with. Our application code directs all project read requests to an alias named projects. When full indexing is complete, the projects alias is atomically switched from the old index to the new index, so we never need to hardcode index names like projects_2013_05_19_13_33_27 into our application.

Some of our more complex indices take several hours to build, so we also had to figure out what to do with records that updated while performing a full index. Both the new and existing indices need to be updated, otherwise one would have stale data.

When a new index for projects is being populated, we associate it with the projects_new alias. We tried sending a bulk request to index changes in both projects and projects_new, but if a full index isn't taking place then this request would 404 since no index would be associated with the projects_new alias. Instead, we query Elasticsearch before each write to retrieve the indices aliased to projects and projects_new, and perform a single bulk indexing request directly against those indices.

The nice thing about our setup is that performing a full index has no user impact. The existing index is kept up to date, and the new index is only switched once it's completely ready.

Comments

    1. Missing_small

      Creator Jeremy Taylor on August 9, 2013

      Awesome stuff! Thanks for posting this!