A Bug-Free, Downtime-Free, Major-Version Upgrade of Elasticsearch

Some parts of your software stack can be tricky to upgrade. In our case, we upgraded to Elasticsearch 0.9 over two years ago, and since then it became unsupported, had a CVE announced that affected developer machines, and our Java 6 runtime had several CVEs. On top of all that, search is a complicated feature and difficult to test.

The Experiment

We decided to bite the bullet. But what was the upgrade path? We approached the upgrade as an experiment, with the following hypotheses:

  • ES 1.7 searches would be faster and more stable/reliable than on ES 0.9
  • A Java 8 runtime would also give us a performance boost over Java 6

As part of our philosophy of continuous delivery, we also required there be zero downtime during the switch.

The Method

  • Launch a new ES 1.7 cluster with the same settings and number of nodes
  • Index data into both 0.9 and 1.7 clusters
  • Switch our search features to 1.7, one by one
  • Test our hypotheses by comparing response times and mismatches, using Github's scientist gem

The scientist gem calls itself a "Ruby library for carefully refactoring critical paths." It's similar to feature flags, but adds metrics and can run multiple code paths in the same context.

An experiment with scientist for our ES upgrade looked like this:


def response
  experiment = EsSearch::UpgradeExperiment.new name: "es-search-upgrade-faqs"

  # Control: ES 0.9
  experiment.use { request(...) }

  # Candidate: ES 1.7
  experiment.try { with_new_elasticsearch_client { request(...) } }

  # Store search term from mismatches in Redis  
  experiment.context(search_term: @search_term)

  # Clean the mismatched results that we store in Redis
  experiment.clean { |results| extract_ids_from_results(results) }

  # Tell scientist how to compare the results
  experiment.compare do |control, candidate|
    extract_ids_from_results(control) == extract_ids_from_results(candidate)
  end

  experiment.run
end

The Results

We were able to switch some search features over to ES 1.7 very quickly. Our FAQ search is infrequent, but the experiment results were enough to show that ES 1.7 was slightly slower on average:

Response Times in millseconds: Candidate - Control (negative is better)
Response Times in millseconds: Candidate - Control (negative is better)

But on the bright side, we didn't see any mismatches between ES 0.9 and ES 1.7 results!

We found more issues with other features, such as our project search tool. Performance was often slightly better:

Response Times in milliseconds: Candidate - Control (negative is better)
Response Times in milliseconds: Candidate - Control (negative is better)

 But we saw mismatches in about 15% of the results:

Number of Mismatched Results in ES 1.7 against ES 0.9
Number of Mismatched Results in ES 1.7 against ES 0.9

As it turned out, when we looked into the mismatches, the results contained the same results — they just occasionally had slightly different orders! The change in sorting was an acceptable difference for us.

Another search feature's experiment mysteriously showed occasional mismatches. After investigating, we found that it stemmed from some missing documents in the ES 1.7 cluster. These documents had been rejected during our bulk indexing because of a limit on the bulk index threadpool size in ES 1.7. Ironically, that limit had been added just one patch version above the old ES 0.9 version we were running. :D

Lessons Learned

After we completely switched our search features over to ES 1.7, we found that our two hypotheses were wrong: ES 1.7 running on Java 8 didn't perform better than ES 0.9 on Java 6. The difference was marginal though, so being on the latest supported version was worth the upgrade. 

If we use the scientist gem again in the future, it’ll probably be with a smaller set of changes, since correctly analyzing the results of an experiment can take time. If you need to do something similar, this gem is worth checking out. We're very happy that this upgrade was done with no disruption and we're now on a current version of Elasticsearch.

Our SQL Style Guide

From beginners working towards their first commits to experts trying to ease into a new codebase, style guides represent valuable investments in helping your team work together.

Since much of our Data Team's day-to-day work involves querying Redshift using SQL, we've put time into refining a query style guide. Many of the recommendations in our guide have been unapologetically lifted from previous guides we've encountered at past jobs, but much of it also stems from things we've discovered collaborating on thousands of queries.

Here's a sample on how to format SELECT clauses:

SELECT

Align all columns to the first column on their own line:


SELECT
  projects.name,
  users.email,
  projects.country,
  COUNT(backings.id) AS backings_count
FROM ...

We've got other sections on FROM, JOIN, WHERE, CASE, and how to write well formatted Common Table Expressions.

Checkout the full guide here.

This is the story of analytics at Kickstarter

If you’ve built a product of any size, chances are you’ve evaluated and deployed at least one analytics service. We have too, and that is why we wanted to share with you the story of analytics at Kickstarter. From Google Analytics, to Mixpanel, to our own infrastructure, this post will detail the decisions we’ve made (technical and otherwise) and the path we’ve taken over the last 6 years. It will culminate with a survey of our current custom analytics stack that we’ve built on top of AWS Kinesis, Redshift, and Looker. 

Early Days 

In late 2009, the early days of Kickstarter, one of the first services we used was Google Analytics. We were small enough that we weren’t going to hit any data caps, it was free, and the limitations of researching user behavior by analyzing page views weren’t yet clear to us.

But users play videos. Their browsers send multiple asynchronous JavaScript requests related to one action. They trigger back-end events that aren’t easily tracked in JavaScript. So to get the best possible understanding of user behavior on Kickstarter, we knew we would have to go deeper and start looking beyond merely looking at which URLs were requested.

While GA provided some basic tools for tracking events, the amount of metadata about an event (i.e., properties like a project name or category) that we could attach was limited, and the GA Measurement Protocol didn’t exist yet so we couldn’t send events outside the browser.

Finally, the GA UI became increasingly sluggish as it struggled to cope with our growing traffic, and soon our data was being aggressively sampled, resulting in reports based on extrapolated trends. This was particularly problematic for reports that had dimensions with many unique values (i.e., high cardinality), which effectively prevented us from analyzing specific trends in a fine-grained way. For example, we’d frequently run into the dreaded (other) row in GA reports: this meant that there was a long tail of data which GA sampling could detect but couldn’t report on. Without knowing a particular URL to investigate, GA prevented us from truly exploring our data and diving deep.

Enter Mixpanel

In early 2012, we heard word of a service called Mixpanel. Instead of tracking page views, Mixpanel was designed to track individual events. While this required manually instrumenting those events (effectively whitelisting which behavior we wanted to track), this approach was touted as being particularly useful for mobile devices where the page view metaphor made even less sense.

Mixpanel’s event-driven model provided a solution to the problems we were encountering with Google’s page views: we could track video plays, signups, password changes, etc., and those events could be aggregated and split in exactly the same way page views could be.

Even better, we wouldn’t have to wait 24-48 hours to analyze the data and access all our reports — Mixpanel would deliver data in real time to their polished web UI. They also allowed us to use an API to export the raw data in bulk every night, which was a huge selling point when deciding to invest in the service.

In May of that year we deployed Mixpanel, and focused on instrumenting our flow from project page to checkout. This enabled us for the first time to calculate such things as conversion rates across project categories, but also to tie individual events to particular projects, so we could spot trends and accurately correlate them with particular subsets of users or projects.

Pax Mixpanela

For many years, Mixpanel served us incredibly well. The data team, engineers, product managers, designers, members of our community support and integrity teams, and even our CEO used it daily to dive deep on trends and analyze product features and engagement.

As our desire to better analyze the increasing volume of data we were sending the service grew, we found their bulk export API to be invaluable – we built a data pipeline to ingest our Mixpanel events into a Redshift cluster. We were subsequently able to conduct even finer-grained analysis using SQL and R.

The flexibility of Mixpanel’s event model also allowed us to build our own custom A/B testing framework without much additional overhead. By using event properties to send experiment names and variant identifiers, we didn’t have to create new events for A/B tests. We could choose to investigate which behaviors a test might affect after the fact, without having to hardcode what a conversion “was” into the test beforehand. This overcame a frequent limitation of other A/B testing frameworks that we had evaluated.

Build vs. Buy

As Kickstarter grew, we wanted more and more from our event data. Mixpanel’s real-time dashboards were nice, but programmatically accessing the raw data in real time was impossible. Additionally, we wanted to send more data to Mixpanel without worrying about a ballooning monthly bill.

By 2014, granular event data became mission-critical for Kickstarter’s day-to-day existence. Whereas previously event level data was considered a nice-to-have complement to the transactional data generated by our application database, we began depending on it for analyzing product launches, supplying stats for board meetings, and for other essential projects.

At this point we started reconsidering the Build vs. Buy tradeoff. Mixpanel had provided incredible value by allowing us to get a first-class analytics service running overnight, but it was time to do the hard work of moving things in-house.

A Way Forward

As we loaded more and more data into our cluster thanks to Mixpanel’s export API, Redshift had become our go-to tool for our serious analytics work. We had invested significant time and effort into building and maintaining our data warehouse – we were shoving as much data as we possibly could into it and had many analysts and data scientists using it full time. Redshift itself had barely broken a sweat, so it felt natural to use it to anchor our in-house analytics.

With Redshift as our starting point, we had to figure out how to get data into it in close-to-real-time. We have a modest volume of data – tens of millions of events a day – but our events are rich, and ever-changing. We had to make sure that engineers, product managers, and analysts had the freedom to add new events and add or change properties on existing events, all while getting feedback in real time.

Since the majority of our analytics needs are ad-hoc, reaching for a streaming framework like Storm didn’t make sense. However, using some kind of streaming infrastructure would let us get access to our data in real time. For all of the reasons that distributed logs are awesome, we ended up building around AWS Kinesis, Kafka’s hosted cousin.

Our current stack ingests events through an HTTPS Collector and sends them to a Kinesis stream. Streams act as our source of truth for event data, and are continuously written to S3. As data arrives in S3, we use SQS to notify services that transcode the data and load it into Redshift. It takes seconds to see an event appear in a Kinesis stream, and under 10 minutes to see it appear in Redshift.

Here’s a rough sketch:

This architecture has helped us realize our goal of real-time access to our data. Having event data in Kinesis means that any analyst or engineer can get at a real-time feed of their data programmatically or visually inspect it with a command-line tool we whipped up.

Looker

While work began on our backbone infrastructure, we also began seriously investigating Looker as a tool to enable even greater data access across Kickstarter. Looker is a business intelligence tool that was appealing to us because it allows people across the company to query data, create visualizations, and combine them into dashboards.

Once we got comfortable with Looker, it dawned on us that we could use it to replicate much of Mixpanel’s reporting functionality. Looker’s DSL for building dashboards, called LookML and their templated filters provided a powerful way to make virtually any dashboard imaginable.

This made it just as easy to access our data in Looker as it was in Mixpanel - anyone can still pull and visualize data without having to understand SQL or R.

As we became more advanced in our Looker development we were able to build dashboards similar to Mixpanel's event segmentation report:

Most significantly, we were able to take advantage of Kickstarter specific knowledge and practices to create even more complex dashboards. One of the ones we’re most proud of is a dashboard that visualizes the results of A/B tests:

The Future

Owning your own analytics infrastructure isn’t merely about replicating services you’re already comfortable with. It is about opening up a field of opportunities for new products and insights beyond your team’s current roadmap and imagination.

Replacing a best-in-class service like Mixpanel isn’t for the faint of heart, and requires serious engineering, staffing, and infrastructure investments. However, given the maturity and scale of our application and community, the benefits were clear and worth it.

If this post was helpful to you, or you’ve built something similar, let us know!

The Kickstarter Engineering and Data Team Ladder

Over the last year, we've doubled the size of the Engineering and Data teams at Kickstarter. Prior to that growth, our teams’ structure was very flat, and titles were very generic. Now we've got folks who have differing levels of skills and experience, we need a structure to help us organize ourselves. We decided we should build an engineering ladder and define roles across the teams.

Deciding to design and implement an engineering ladder can be tricky. It needs to start right, exert flexibility as we evolve, scale as we grow, and the process needs to be as consultative and inclusive as possible. Thankfully, earlier in the year, Camille Fournier, then CTO at Rent the Runway, shared her team's Engineering Ladder. It was enormously influential in guiding our thinking around how Engineering should be leveled and structured. (We should also thank Harry Heymann, Jason Liszka, and Andrew Hogue from Foursquare, who inspired Rent the Runway in the first place).

We took the material and ideas we found in Fournier’s work and modified them to suit our requirements. We then shared the document with the team and asked for feedback and review. After lots of discussion and editing, we ended up with roles that people understood and were excited to grow into. We've now deployed the roles — and in the spirit of giving back to the community that inspired us to do this work, we wanted to share the ladder we created.


Technical Data People
Junior Software Engineer - -
Software Engineer Data Analyst -
Senior Software Engineer Data Scientist Engineering Manager
Staff Engineer VP of Data Engineering Director
Principal Engineer - CTO

You can see the full details here.

If you’re in the process of thinking through how you organize your team, we hope this can be of some help. And if you use this as a starting point for building your own ladder, and tailoring it to your own needs, we’d love to hear about it!

Kickstarter Data-Driven University

The Kickstarter Data team’s mission is to support our community, staff, and organization with data-driven research and infrastructure. Core to that, we’ve made it our goal to cultivate increased data literacy throughout the entire company. Whether it’s knowing when to use a line chart or a bar plot, or explaining why correlation does not equal causation, we strongly believe that basic data skills benefit everyone: designers and engineers, product managers and community support staff, all the way up to our senior team.

During my time working at LinkedIn on their Insights team, our leadership helped establish a program called Data-Driven University (DDU). DDU was a two-day bootcamp of best practices on working with data: tips on how to communicate effectively using data, how to use data to crack business problems, and how to match a visualization with the right story to tell. It was a transformative experience for me; I witnessed leaders of some of the largest business units discover techniques to help their teams make better decisions with data.

When I joined Kickstarter’s Data team last year, I saw an opportunity to use the same approach with our own staff. Our intention was to create a series of courses that was open to everyone, not just a select few; hence, Kickstarter Data-Driven University (KDDU) was born.

First, we surveyed the company on a number of voluntary data-related sessions taught by our team. Analyzing the themes in our survey response data led us to settle on offering three sessions: Data Skepticism (how to think critically using data), Data Visualization (how to effectively present data visually), and Data Storytelling (how to communicate compellingly with data).

After several weeks of prep work, we held four classes (including an additional workshop on conducting A/B tests). The results were encouraging: more than 50% of the company attended at least one class, and our final Net Promoter Score was 73 (taken by survey after KDDU wrapped up), on par with the Apple Macbook. Not bad! We also heard positive feedback directly from our staff, such as the following:


“Broke down complicated terms/jargon and offered real-use cases to help the audience better grasp how data is analyzed/presented.”


The Data team had such a good time presenting KDDU internally that we volunteered to give the seminar two more times. So in July, we partnered with New York Tech Talent Pipeline (NYTTP) for their Beyond Coding program and gave a slightly modified version of KDDU to their new grads and students looking to build skills before entering the workforce.

Today, we’re making those slides available for you to leverage with your own teams to help increase data skills and literacy:

Here are some of our takeaways from teaching data skills to our colleagues:

Keep it simple

We could have talked about our favorite Data team subjects: our infrastructure, the nuances of Postgres 8.0.2, or our favorite R packages … but we knew we had to keep data approachable for a broader audience. We decided to focus on giving our audience a set of simple rules and principles that would help them work with data more effectively in their day-to-day.

Know your audience

We sent out a brief survey to see what topics our coworkers wanted to learn about most. This both made it easier to decide on which topics to present, and also meant we knew the topics we chose would be interesting to our audience.

Within the individual presentations we focused on selecting examples that would resonate with our audience, highlighting trends from actual Kickstarter data, insights into past A/B tests we’ve run, and other familiar and relevant stats.

Always be measuring

As an old boss used to say, if you can’t measure it, you can’t manage it. So after we completed KDDU, we sent out a second brief survey, this one to collect feedback on the overall selection of courses and the individual lessons. This data has helped refine our approach for a second round of KDDU sessions that we’re considering offering as our company grows.

We couldn’t be more excited to share our experience with you, and hope you find it valuable to increasing data-driven decision-making and skills at your organization!

Introducing mail-x_smtpapi: a structured email header for Ruby

At Kickstarter we use SendGrid to deliver transactional and campaign emails, and use SendGrid's X-SMTPAPI header for advanced features like batch deliveries and email event notifications. Developing the first of these features went well — but the second and third features became entangled when we tried to share an unstructured hash that was unceremoniously encoded into an email header at a less-than-ideal time.

Custom Mail Header

Our solution was to add first-class header support to Ruby's Mail gem. This gave us a structured value object that we could write to from any location with access to the mail object, allowing our mailing infrastructure to remain focused and decoupled.

Today we’re announcing our open source Mail extension, appropriately titled mail-x_smtpapi. With this gem you can write to a structured mail.smtpapi value object from anywhere in your mailing pipeline, including a Rails mailer, template helper, or custom mail interceptor.

Example

Here's a basic example from the gem's README to get you started. This variation from the Rails Guide gives you extra detail in SendGrid's email event notifications:


class UserMailer < ActionMailer::Base

  def welcome(user)
    @user = user

    mail.smtpapi.category = 'welcome'
    mail.smtpapi.unique_args['user_id'] = @user.id

    mail(to: @user.email, subject: 'Welcome to My Awesome Site')
  end

end

Enjoy

We hope you find this as useful as we did, or find inspiration here to develop header classes for your own custom uses. As always, we love feedback, especially in the form of pull requests or bug reports.

If you take delight in discovering simple solutions to stubborn code, why not browse our jobs page? We're hiring!

Introducing cfn-flow: a practical workflow for AWS CloudFormation

If you’re looking for a simple, reliable way to develop with AWS CloudFormation, check out cfn-flow on GitHub.

As an Ops Engineer, I’m always seeking better ways to manage Kickstarter’s server infrastructure. It can never be too easy, secure, or resilient.

I’ve been excited about AWS CloudFormation as a way to make our infrastructure provisioning simpler and replicable. Some recent greenfield projects provided a great opportunity to try it out.

We quickly found we wanted tooling to consistently launch and manage CloudFormation stacks. And each project presented the same workflow decisions, like how to organize resources in templates, where to store templates, and when to update existing stacks or launch new ones.

I built cfn-flow to reflect Kickstarter’s best practices for using CloudFormation and give developers a consistent, productive deploy process. Two especially helpful constraints of the workflow are worth highlighting:

Red/black deploys

cfn-flow embraces the red/black deployment pattern to gracefully switch between two immutable application versions. For each deployment, we launch a new CloudFormation stack then delete the old stack once we’ve verified that the new one works well. This is preferable to modifying long-running stacks because rollbacks are trivial (just delete the new stack), and deployment errors won’t leave stacks in unpredictable states. 

Separate ephemeral resources from backing resources

Since deployments launch and delete stacks, templates can only include ephemeral resources that can safely be destroyed. For our apps, that usually means a LaunchConfig, an AutoScalingGroup, and, optionally, an ELB with a Route53 weighted DNS record and an InstanceProfile. 

Resources that are part of your service that do not change in each deployment are considered backing resources. These include RDS databases, security groups that let both new and old EC2 servers communicate, SQS queues, etc. We extract backing resources to a separate template that’s deployed less frequently. Backing resources are then passed as parameters to our app stack via our cfn-flow.yml configuration.

cfn-flow is a command-line tool distributed as a RubyGem. You track your CloudFormation templates in the same directory as your application code, and use the cfn-flow.yml configuration file to tie it all together. Check out the cfn-flow README for details and examples.

We’ve been using it for a few months with great success. It gives developers good, easy affordances to build robust services in AWS.

I encourage anyone else interested in CloudFormation to give cfn-flow a try. If it’s not making your job easier, please file a GitHub Issue.

Introducing Telekinesis: A Kinesis Client for JRuby

Kickstarter exists to help make it easier for people to create new things. And when it comes to code, there’s one very simple way to help others create — by sharing the things we’ve already built. That’s why, over the past month, we’ve been open-sourcing a new library each week. Today’s is called Telekinesis, and, well … we’ll let Ben explain it.

At Kickstarter we use a variety of AWS services to help us build infrastructure quickly, easily, and affordably. Last winter, we started experimenting with Kinesis, Amazon’s hosted Kafka equivalent, as the backbone for some of our data infrastructure. After deciding that we needed a distributed log, we settled on using Kinesis based on cost and ease of operation.

Kickstarter is all about Ruby, so it made sense for us to do our prototyping in Ruby. Since the Kinesis Client Library (KCL) is primarily built for Java, we quickly decided that building on top of JRuby was our best option. We already have some Java expertise in-house, so we also knew that running and deploying the JVM would be relatively straightforward. It’s been going so well that we haven’t looked back — despite Amazon’s announcement that they officially support Ruby through the Multilang Daemon.

As part of open source month, we’re releasing Telekinesis, the library we’ve built up around the KCL. It includes some helpers to make using a Consumer from Ruby a little more idiomatic.


require 'telekinesis/consumer'

class MyProcessor
  def init(shard_id)
    $stderr.puts "Started processing #{@shard_id}"
  end

  def process_records(records, checkpointer)
    records.each do |r|
      puts String.from_java_bytes(r.data.array)
    end
  end

  def shutdown
    $stderr.puts "Shutting down #{@shard_id}"
  end
end

Telekinesis::Consumer::DistributedConsumer.new(stream: 'a_stream', app: 'my_app') do
  MyProcessor.new
end

It also includes a multi-threaded producer that we’ve been using in production for a couple months. Head on over to Github for a closer look.

Looking for more of our tools? Just poke around Backing & Hacking, or see our open-source projects on GitHub! And if you're excited by what you see, you might be even more excited to know that we're hiring...