If you’ve built a product of any size, chances are you’ve evaluated and deployed at least one analytics service. We have too, and that is why we wanted to share with you the story of analytics at Kickstarter.
From Google Analytics, to Mixpanel, to our own infrastructure, this post will detail the decisions we’ve made (technical and otherwise) and the path we’ve taken over the last 6 years. It will culminate with a survey of our current custom analytics stack that we’ve built on top of AWS Kinesis, Redshift, and Looker.
In late 2009, the early days of Kickstarter, one of the first services we used was Google Analytics. We were small enough that we weren’t going to hit any data caps, it was free, and the limitations of researching user behavior by analyzing page views weren’t yet clear to us.
While GA provided some basic tools for tracking events, the amount of metadata about an event (i.e., properties like a project name or category) that we could attach was limited, and the GA Measurement Protocol didn’t exist yet so we couldn’t send events outside the browser.
Finally, the GA UI became increasingly sluggish as it struggled to cope with our growing traffic, and soon our data was being aggressively sampled, resulting in reports based on extrapolated trends. This was particularly problematic for reports that had dimensions with many unique values (i.e., high cardinality), which effectively prevented us from analyzing specific trends in a fine-grained way. For example, we’d frequently run into the dreaded (other) row in GA reports: this meant that there was a long tail of data which GA sampling could detect but couldn’t report on. Without knowing a particular URL to investigate, GA prevented us from truly exploring our data and diving deep.
In early 2012, we heard word of a service called Mixpanel. Instead of tracking page views, Mixpanel was designed to track individual events. While this required manually instrumenting those events (effectively whitelisting which behavior we wanted to track), this approach was touted as being particularly useful for mobile devices where the page view metaphor made even less sense.
Mixpanel’s event-driven model provided a solution to the problems we were encountering with Google’s page views: we could track video plays, signups, password changes, etc., and those events could be aggregated and split in exactly the same way page views could be.
Even better, we wouldn’t have to wait 24-48 hours to analyze the data and access all our reports — Mixpanel would deliver data in real time to their polished web UI. They also allowed us to use an API to export the raw data in bulk every night, which was a huge selling point when deciding to invest in the service.
In May of that year we deployed Mixpanel, and focused on instrumenting our flow from project page to checkout. This enabled us for the first time to calculate such things as conversion rates across project categories, but also to tie individual events to particular projects, so we could spot trends and accurately correlate them with particular subsets of users or projects.
For many years, Mixpanel served us incredibly well. The data team, engineers, product managers, designers, members of our community support and integrity teams, and even our CEO used it daily to dive deep on trends and analyze product features and engagement.
As our desire to better analyze the increasing volume of data we were sending the service grew, we found their bulk export API to be invaluable – we built a data pipeline to ingest our Mixpanel events into a Redshift cluster. We were subsequently able to conduct even finer-grained analysis using SQL and R.
The flexibility of Mixpanel’s event model also allowed us to build our own custom A/B testing framework without much additional overhead. By using event properties to send experiment names and variant identifiers, we didn’t have to create new events for A/B tests. We could choose to investigate which behaviors a test might affect after the fact, without having to hardcode what a conversion “was” into the test beforehand. This overcame a frequent limitation of other A/B testing frameworks that we had evaluated.
Build vs. Buy
As Kickstarter grew, we wanted more and more from our event data. Mixpanel’s real-time dashboards were nice, but programmatically accessing the raw data in real time was impossible. Additionally, we wanted to send more data to Mixpanel without worrying about a ballooning monthly bill.
By 2014, granular event data became mission-critical for Kickstarter’s day-to-day existence. Whereas previously event level data was considered a nice-to-have complement to the transactional data generated by our application database, we began depending on it for analyzing product launches, supplying stats for board meetings, and for other essential projects.
At this point we started reconsidering the Build vs. Buy tradeoff. Mixpanel had provided incredible value by allowing us to get a first-class analytics service running overnight, but it was time to do the hard work of moving things in-house.
A Way Forward
As we loaded more and more data into our cluster thanks to Mixpanel’s export API, Redshift had become our go-to tool for our serious analytics work. We had invested significant time and effort into building and maintaining our data warehouse – we were shoving as much data as we possibly could into it and had many analysts and data scientists using it full time. Redshift itself had barely broken a sweat, so it felt natural to use it to anchor our in-house analytics.
With Redshift as our starting point, we had to figure out how to get data into it in close-to-real-time. We have a modest volume of data – tens of millions of events a day – but our events are rich, and ever-changing. We had to make sure that engineers, product managers, and analysts had the freedom to add new events and add or change properties on existing events, all while getting feedback in real time.
Since the majority of our analytics needs are ad-hoc, reaching for a streaming framework like Storm didn’t make sense. However, using some kind of streaming infrastructure would let us get access to our data in real time. For all of the reasons that distributed logs are awesome, we ended up building around AWS Kinesis, Kafka’s hosted cousin.
Our current stack ingests events through an HTTPS Collector and sends them to a Kinesis stream. Streams act as our source of truth for event data, and are continuously written to S3. As data arrives in S3, we use SQS to notify services that transcode the data and load it into Redshift. It takes seconds to see an event appear in a Kinesis stream, and under 10 minutes to see it appear in Redshift.
Here’s a rough sketch:
This architecture has helped us realize our goal of real-time access to our data. Having event data in Kinesis means that any analyst or engineer can get at a real-time feed of their data programmatically or visually inspect it with a command-line tool we whipped up.
While work began on our backbone infrastructure, we also began seriously investigating Looker as a tool to enable even greater data access across Kickstarter. Looker is a business intelligence tool that was appealing to us because it allows people across the company to query data, create visualizations, and combine them into dashboards.
Once we got comfortable with Looker, it dawned on us that we could use it to replicate much of Mixpanel’s reporting functionality. Looker’s DSL for building dashboards, called LookML and their templated filters provided a powerful way to make virtually any dashboard imaginable.
This made it just as easy to access our data in Looker as it was in Mixpanel - anyone can still pull and visualize data without having to understand SQL or R.
As we became more advanced in our Looker development we were able to build dashboards similar to Mixpanel's event segmentation report:
Most significantly, we were able to take advantage of Kickstarter specific knowledge and practices to create even more complex dashboards. One of the ones we’re most proud of is a dashboard that visualizes the results of A/B tests:
Owning your own analytics infrastructure isn’t merely about replicating services you’re already comfortable with. It is about opening up a field of opportunities for new products and insights beyond your team’s current roadmap and imagination.
Replacing a best-in-class service like Mixpanel isn’t for the faint of heart, and requires serious engineering, staffing, and infrastructure investments. However, given the maturity and scale of our application and community, the benefits were clear and worth it.
If this post was helpful to you, or you’ve built something similar, let us know!