Backing & Hacking

    • GitHub for Poets


      po·et ˈpōət,ˈpōit noun 1. You're not a programmer (yet). But you want to have a hand in crafting the code that runs our site. That's awesome and you're awesome.


      Recently, our operations engineer Aaron Suggs had an amazing idea that would peel back the curtain a bit from the engineering team workflow. What if we gave a workshop to interested non-dev Kickstarter employees (“poets”) on how to engage with our code via GitHub's web flow? This GitHub tool eases the burden of performing the various Terminal.app commands that can be intimidating to beginners. Teaching employees how to perform code changes this way could have several benefits including but not limited to:

      • Providing the tools for interested employees to make copy edits 
      • Empowering members of Kickstarter's Community and Operations teams to participate in the world of code 
      • Eliminating developer unavailability as an obstacle to making safe changes

      I co-taught the course because my own entrance into the world of programming had similar ideological roots. I originally worked on the Community team at Kickstarter, but after months of enthusiastic code consumption and a few features deployed to production, I moved to our engineering team full-time. An accessible, inviting, and compassionate engineering team culture can change lives!

      After putting together a rough syllabus, Aaron and I held 4 separate “GitHub for Poets” sessions attended by 5-8 different employees from various teams in the company. The sessions had individual flavors; one was fueled by a need to make a copy edit on our Support navigation bar that was eventually merged. Another was led by the desire to make silly changes to the homepage for ultimate "Wow!" factor, which I personally supported because the changes were so visible and exciting (though they remained unmerged):

      In all cases, the structure of each hour-long course unfolded as follows: 

      1. Healthy programming attitudes (scratch your own itch, don't-repeat-yourself, etc)
      2. Introduction to our tech stack (HAML, Sass, JavaScript, Ruby)
      3. Rails directory structure
      4. Creating feature branches via the GitHub UI
      5. Finding files, editing, and committing changes with helpful commit messages
      6. Opening a pull request for the feature branch

      We taught each of these sections to the enthusiastic maximum, and each employee was encouraged to add commits to the branch. The mock pull requests went out to the whole dev team, who responded enthusiastically with their comments, suggestions, and emoji. Good vibes abounded!

        Since the first round of GitHub for Poets courses ended, multiple employees who aren't on the engineering team have made commits that were ultimately merged, including changes to our jobs page, policy pages, and support resources. One of these changes even touched some Ruby code. We require each potential change to be done within a feature branch, pull requested, and merged by a developer themselves, but these little bits of process are no hindrances to the passion of poets.

        Working for a tech company involves relying on code that fuels our jobs, our users, and our community, but it often happens that only 20% of the company can touch that same code! By encouraging personal responsibility, a willingness to ask for help, and constructive feedback between engineers and poets, any startup can help open the doors to more inclusive development.

      6 comments
    • Hierarchy of DevOps needs

      Operations teams have many responsibilities: monitoring, provisioning, building internal tools, and so on.

      To better understand these responsibilities, I organize them as a hierarchy of needs.

      Hierarchy of DevOps needs
      Hierarchy of DevOps needs

      Low levels of the pyramid are basic needs, and prerequisites for higher levels. Higher levels are more personally fulfilling and valuable to your organization.

      The goal is to deal with low level issues effectively so you can spend most of your attention at higher levels.

      For example, if you’re one server crash away from losing your data (backups), it hardly matters if you make deployments faster (team efficacy). Fixing your backups is the priority.

      At each level of the pyramid, there’s lots of work to do and successful careers to be had. As your organization evolves, you'll inevitably need to attend to each tier. A high-functioning team solves issues thoroughly and effectively, so they can focus their attention up the pyramid.

      Even if your job is focused on a single tier, say, security, there are ways to move your attention up the pyramid. You can use frameworks, static analysis, and linting tools to prevent many types of security vulnerabilities (team efficacy). You can use blameless post mortems for training and remediation (team happiness).

      This hierarchy was inspired by Maslow’s hierarchy of needs about human motivation, and Heinemeier Hansson’s levels of aspiration for programmers.

      It's a useful model for ranking work to be done, and gauging the effectiveness your operations team.

      Leave a comment
    • Reporting iOS crashes to StatsD

      Here at Kickstarter we use Crashlytics for our crash reporting in the iOS app. It serves us well and does a lot of work that we don’t want to have to worry about. However, it also introduces more fragmentation into our toolset for monitoring our services. We already have an extensive set of StatsD metrics that we monitor, so it would be nice to see graphs of crashes right next to graphs of our HTTP 200s and 500s.

      Crashlytics provides a delegate method -crashlytics:didDetectCrashDuringPreviousExecution: that is called when the library has detected that the last session of the app ended in a crash. We use this method to hit an endpoint on our servers that will increment a StatsD bucket. The request contains info about the build number and iOS version at the time of the crash, and we include that in the StatsD bucket:


      STATSD.increment "ios.crash.#{build}.#{ios_version}"
      

      Now we can build a graph that shows the total number of crashes over time, or break out into many graphs for crashes per iOS version or app version. The graphs help us to tell a story about the overall stability of our releases, or what fixes have been effective.

      For example, when looking at the total number of crashes over time, we can clearly see an increase in mid September:

      Total # of crashes
      Total # of crashes

      This is right around the time iOS 7 came out. In order to confirm this we can look at a graph that breaks out crashes per iOS version:

      Crashes by iOS Version
      Crashes by iOS Version

      Now it is very clear that iOS 7 has elevated crash rates, though each minor release has been slightly better. For example, the first release of iOS 7 seemed to have a bug in CFNetwork causing hundreds of crashes a day in a function named mmapFileDeallocate. This crash has not happened since 7.0.2, which is reflected in the red and purple curves.

      Since the crash rates are still higher than what we saw in iOS 6 we looked for other ways to workaround the most common crashes. One of the more perplexing iOS 7 specific crashes we see has to do with data detectors in UITextView instances. It happens seemingly randomly, and has occurred in every version of iOS 7. In our most recent update we wrote a UITextView subclass that provided a custom implementation of data detectors in hopes of getting around this crash. The benefits of this work can now be seen by looking at a graph of crashes by build #:

      Crash by build #
      Crash by build #

      Build #510 (the blue curve) is the first build with the fix, and it has the lowest number of crashes we’ve seen for awhile.

      This form of crash tracking has been very useful to us. In fact, it’s become so important that we put a version of these graphs on our deploy dashboard so that we can immediately see if an API or mobile web change affects the crash rate of the app. By leveraging the tools and infrastructure that we are already comfortable with from our web app we allow every engineer to take part in the monitoring and diagnosing of iOS app problems.

      3 comments
    • Drew Conway on Large-Scale Non-Expert Text Coding

      As part of our series of informal engineering and data focused talks at Kickstarter, we hosted Drew Conway to present on his PhD thesis large-scale non-expert text coding.

      Drew's a leader in the New York data scene, so we were already excited to have him, but the fact that his work focused on Amazon's Mechanical Turk service got me really excited. 

      In his talk, Drew goes into detail on how he used the platform to determine whether non-experts could properly identify bias in political text. Drew's a great speaker and this presentation was no exception --the video is a must-watch if you're interested in pursuing large scale research on the web.

      Thanks again, Drew!

      Leave a comment
    • Kickstarter meetup at RubyConf


      Join Kickstarter engineers @tiegz, @emilyreese, and @ktheory at RubyConf in Miami Beach.

      We're hosting a meetup and drinks at the Segafredo l'Originale bar Friday, Nov 8 at 8pm. We'd love to hang out and talk shop; whether that's crafting mature web applications, inspiring OSS projects, or your latest DIY 3D printer hacks.

      The Details:

      Leave a comment
    • Unit Testing for Rails Views

      Sometimes, view logic happens to good people.

      It might be checking for empty result sets. Or rendering optional related data. Sometimes it's state- or role-based logic. It happens.

      A typical Rails app tests the view layer during functional controller tests. But this coverage feels accidental at best and inefficient at worst. We wanted better coverage with faster tests that people would actually want to write and run.

      So we split apart our functional tests. Here's how.

      A View Testing Pattern

      First we experimented to find a good pattern. We extended ActionView::TestCase and worked out a 3-part syntax based on shoulda.


      class ProjectsViewTest < Kickstarter::ViewTest
      
        context "show" do
          setup { @project = FactoryGirl.build(:project) }
          subject { render 'projects/show', formats: [:html] }
          should "render" do
            assert_select '*'
          end
        end
      
      end
      

      To make this work, we needed to ensure that subject was triggered:


      class Kickstarter::ViewTest < ActionView::TestCase
      
        # shoulda lazily evaluates subject and then memoizes it.
        # so we just need to reference it. the result of any
        # `render` in the subject block will be available for
        # assert_select thanks to ActionView::TestCase.
        def assert_select_with_subject(*args, &block)
          subject
          assert_select_without_subject(*args, &block)
        end
        alias_method_chain :assert_select, :subject
      
        # ensure a default subject_block
        def self.subject_block; proc {} end
      end
      

      One nice part about using subject this way is that we could easily test variations with nested contexts:


      class ProjectsViewTest < Kickstarter::ViewTest
      
        context "show" do
          subject { render 'projects/show', formats: [:html] }
      
          context "for a live project" do
            setup { @project = FactoryGirl.build(:project, :live) }
            should "render" do
              assert_select '*'
            end
          end
      
          context "for a successful project" do
            setup { @project = FactoryGirl.build(:project, :successful) }
            should "render" do
              assert_select '*'
            end
          end
        end
      
      end
      

      Transitioning

      Once we had a syntax we liked, we presented it to the team and talked about how to proceed. We wanted to write new tests this way, but back-filling the suite was daunting.

      So we set our sights on a realistic goal. The current controller tests had some existing accidental coverage. Why not just move it?

      First, we identified the existing coverage using loggers patched into the controller's render method. This generated a list of templates that looked lengthy but doable.

      Then we created an ERB template to auto-generate test stubs for all of these templates. Some of the auto-generated test stubs worked right away, so our list immediately got shorter.

      It took a few days to finish out the test suite. This gave us time to think about our patterns and set precedents for the rest of the team to follow. In some cases we even got to refactor a bit of code so the tests would be easier. Bonus!

      Disabling Controller Rendering

      Once we had a suite of view tests, we decided to disable controller rendering. The trick to accomplishing this was in supporting assert_response and assert_template without messing up the template compilation that we needed for the view tests. Here's the simplest patch we came up with:


      class ActionController::TestCase
        setup do
          ActionView::Template.class_eval{ alias_method :render, :stubbed_render }
        end
        teardown do
          ActionView::Template.class_eval{ alias_method :render, :unstubbed_render }
        end
      end
      
      class ActionView::Template
        alias_method :unstubbed_render, :render
        def stubbed_render(*)
          ActiveSupport::Notifications.instrument("!render_template.action_view", :virtual_path => @virtual_path) do
            # nothing but the instrumentation
          end
        end
      end
      

      The payoff was worth it. We cut about 35% off our controller suite time, and since we used build and build_stubbed strategies for the view test factories, only a fraction of that time was spent running the new view tests. Nice!

      In Review

      Our test suite feels more focused, more performant, and easier to extend. We have more confidence in our ability to make certain refactors in the views layer. And we can still rely on integration tests to put all the pieces together for critical paths.

      We're continuing to explore the pattern, but so far this just feels right.

      1 comment
    • HTML5 Video First

      Video is an integral part of Kickstarter, and ever since our launch, we’ve depended on Adobe’s Flash to serve videos in a proprietary player. Today, thanks to HTML5’s <video> tag, that changes.

      Previously we had been serving our project videos to desktop machines using Flash. If Flash was not detected, our players would fall back to a HTML5 <video> element. 

      Starting now, however, we are inverting this logic and will only serve Flash videos if users' browsers are so old they don't support the <video> tag (including some versions of Firefox that don't support h.264).

      Why

      Most mobile devices do not support Flash. It’s simpler to use the exact same software on both the desktop and mobile.

      Some computers do not ship with Flash. We don’t want to require users to install software to use our website. We still use Flash for other features (users have to use it to upload media), and we will work to remove those Flash requirements.

      We have never had an in-house Flash developer, because while our video player is important, most of the client-side code is written in JavaScript. Every time we have wanted to redesign our Flash player or add a new feature we have had to ask an outside consultant, which took time and made quick turnarounds difficult.

      Flash is insecure. Due to ExternalInterface.call, it’s extremely easy to accidentally allow the execution of Javascript sent via a param.

      How

      For a couple weeks now, we’ve been building up our own HTML5 player. We started serving our new player to people who didn’t have Flash installed, which allowed us to work out the UX and bugs with a small set of users. A few days ago, we enabled the HTML5 player for employees of Kickstarter, widening the user base even more.

      We haven’t found the need to use a larger library because the video element specification is so clear and simple to use. Our designers are very skilled in CSS, and we used those skills to design the look and feel of the video player. It is a pleasure to use our core abilities on something that is so core to our site.

      Who

      Design by Zack Sears and Brent Jackson.

      Interaction by Samuel Cole.

      6 comments
    • Elasticsearch at Kickstarter

      Back in December 2012, we developed a new version of our project search tool on Kickstarter using Elasticsearch. We're really happy with the results and have since found Elasticsearch's filtering and faceting features useful in tools for project creators, our message inbox and in other areas of the site. I'd like to write a little on how we gradually rolled out Elasticsearch as it might be useful for others looking at adding secondary data stores to their stack.

      Ramping up

      We liked Elasticsearch's features but were initially cautious about how it would behave in production, so we deployed a change that let us divert a percentage of project search requests to a new version built with Elasticsearch:

      ratio = File.read(ELASTICSEARCH_RATIO_PATH).to_f rescue 0.0
      experimental = (rand <= ratio)

      Over time we ramped up the percentage of traffic sent to our Elasticsearch implementation, keeping a close eye on internal metrics to evaluate its performance. Fortunately we didn't hit any major snags, so before long we were sending 100% of our project search traffic to Elasticsearch. There are some great posts by Etsy and Flickr that go into more detail on config flags and rolling out features gradually for more reading on the topic.

      An index primer

      An index in Elasticsearch is a logical namespace for data and can store multiple types of documents. Types roughly correspond to business models, and each index has a mapping that defines how it stores its types. At Kickstarter, each index defines the mapping for just one type, so an index for projects only defines the mapping for a project type. A very simplistic mapping for a project type with a name and goal might look like this:

      $ curl -XGET 'http://localhost:9200/projects/project/_mapping'
      {
        "project": {
          "properties": {
            "name": {
              "type": "string"
            },
            "goal": {
              "type": "double",
              "null_value": 0.0
            }
          }
        }
      }

      Keeping indices up to date

      MySQL is our canonical data store. When an index in Elasticsearch is first created, it contains no documents, so a full index must be performed from MySQL to populate it. Once the index has been populated, it's ready to respond to search requests. However, the data in MySQL changes over time. New projects are created, existing projects are updated. These changes need to be sent to Elasticsearch or the projects index will have stale/outdated data.

      Each document has an ID in Elasticsearch, and a document can be updated by performing an index operation using that ID. Each project document in Elasticsearch has the same ID as its corresponding record in MySQL. When the project changes in MySQL, we're able to reindex just that project document in Elasticsearch so that our search index is only a few seconds delayed behind MySQL.

      The need for new indices

      Performing updates to documents in Elasticsearch to keep them in sync with MySQL has taken us some time to get right (a topic for another blog post!), but one way we've mitigated problems with stale data is by making it really easy to create a new index and fully populate it with the latest data from MySQL. This is also useful when the mapping for a type needs to change. Rather than updating the mapping for an existing index, we create an index with the new mapping and populate it, and any old indices are left as is. This avoids having to deal with mapping merge conflicts or inconsistencies with documents having been indexed using different mappings.

      This process of creating and populating new indices started off with a cron task to fully index projects every 20 minutes. As we improved our ability to keep Elasticsearch in sync with MySQL, we reduced the frequency of the cron task so that now the full index is only performed nightly.

      The full indexing nitty-gritty

      Each time we create a new index, it is given a name based on the type and time, e.g.  projects_2013_05_19_13_33_27. It takes some time to fully populate a new index, so while it is building, all our reads continue go to the existing projects index. Elasticsearch has a nifty aliasing feature that allows us to associate indices with an alias. Search requests are sent to the alias, which directs the requests to any indices that it has been associated with. Our application code directs all project read requests to an alias named projects. When full indexing is complete, the projects alias is atomically switched from the old index to the new index, so we never need to hardcode index names like projects_2013_05_19_13_33_27 into our application.

      Some of our more complex indices take several hours to build, so we also had to figure out what to do with records that updated while performing a full index. Both the new and existing indices need to be updated, otherwise one would have stale data.

      When a new index for projects is being populated, we associate it with the projects_new alias. We tried sending a bulk request to index changes in both projects and projects_new, but if a full index isn't taking place then this request would 404 since no index would be associated with the projects_new alias. Instead, we query Elasticsearch before each write to retrieve the indices aliased to projects and projects_new, and perform a single bulk indexing request directly against those indices.

      The nice thing about our setup is that performing a full index has no user impact. The existing index is kept up to date, and the new index is only switched once it's completely ready.

      1 comment
    • Rack::Attack: protection from abusive clients

      I'm excited to introduce Rack::Attack, a ruby rack middleware for throttling abusive requests. We depend on it to keep Kickstarter fast and reliable.

      If you've looked at web server logs, you know there are some weird clients out there. Malicious scripts probe for exploits. Scrapers download the same page dozens of times each second, or request the 10,000th page of comments for a post with only 2 comments.

      Tackling each curious anomaly that threatens your site's reliability saps developer productivity and happiness. Rack::Attack lets you throttle abusive requests with just a few lines of code. Check out the README for more details about how it works. Seriously, the README does a great job explaining how to use it. Okay, I'm going to assume you've skimmed the README. Moving on.

      What kind of requests do we throttle?

      We limit the number of requests that can be made per IP address in a short time period like this:

      Rack::Attack.throttle('ip', limit: x, period: y) do |req|
        req.ip
      end

      Pro tip: to allow occasional bursts, set the limit and period to an higher multiple. Instead of limit: 1, period: 1 (1 req/s), do limit: 10, period: 10. The long-term average still can't exceed 1req/s.

      Typical visitors never come close to our limit. But aggressive scrapers often do. Of course we graph it.

      throttle

      Those shark fin-shaped spikes are our database thanking us.

      For the security of our users, we have a stricter throttle for login attempts. This makes it very time consuming for attackers to guess users' passwords.

      # Throttle logins per ip
      Rack::Attack.throttle("login_ip", limit: x, period: y) { |req|
        req.ip if req.post? && req.path == "/session"
      }
      # Throttle logins per email param (regardless of ip)
      Rack::Attack.throttle("login_email", limit: x, period: y) { |req|
        req.params['email'].presence if req.post? && req.path == "/session"
      }

      We also use the IPCat ruby library to detect requests from well-known datacenters. You could block login attempts from datacenters with this:

      Rack::Attack.blacklist('bad_login_ip') { |req|
        req.post? && req.path == "/session" && IPCat.datacenter?(req.ip))
      }

      Easily graph requests

      Rack::Attack can also track requests without blocking them. On Feb 14, we launched our iPhone app, and wanted an easy way to monitor the HTTP requests it generates. Since the app uses a special header, it was simple to track with Rack::Attack:

      Rack::Attack.track("ios_app") { |req|
        req.env.key?("HTTP_OUR_CUSTOM_HEADER")
      }

      We are very happy with how it went:

      iphone launch

      We rely on Rack::Attack to let developers quickly track and throttle requests. It helps keep our site reliable, so we can spend more energy building better features. We're glad to make it publicly available to the open source community.

      6 comments
    • An Engineering Talk with Kickstarter Creator Dan Shiffman

      We've been hosting a series of informal engineering-focused talks at Kickstarter and a couple of weeks ago we invited NYU professor, prolific backer, and Kickstarter creator Dan Shiffman to come talk about his book, "The Nature of Code", which raised 631% of its goal thanks to 1,189 backers.

      In Dan's highly entertaining talk, he took us through some great examples of what you can do with genetic algorithms. If you're at all interested in evolutionary science, simulations, or just want a great crash course in genetic algorithms, we highly recommend you take a peek:

      Also worth checking is the site for his book, which hosts a number of interactive examples from the book alongside the entire text, but we really recommend you just pick up a hardcopy.

      Dan's also done a great job open sourcing the book — he's licensed it under a Creative Commons Attribution-NonCommercial license, but in addition, has made its entire full source material available on GitHub. Even better, Dan's been merging in pull requests as readers suggest edits and additional examples. 

      We couldn't be happier to have him as part of the community!

      1 comment
    • The Day the Replication Died

      On Thursday, March 7th, we scrambled. Most of Kickstarter's traffic is served from replicated copies of the MySQL database, and those replicas had quit updating. The problem wasn't on one replica that we could easily replace; it was on all the replicas. And MySQL was telling us something new:

      mysql> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
      Last_Error: Could not execute Update_rows event on table
      kickstarter.backings; Duplicate entry '123456-789' for key
      'index_backings_on_project_id_and_sequence', Error_code: 1062;
      handler error HA_ERR_FOUND_DUPP_KEY; the event's master log
      mysql-bin-changelog.169933, end_log_pos 12969124
      

      We immediately set to work. Over the next few hours we kept the site stable, minimized the effects of stale replicas, communicated the issue to users, recovered, sounded the all clear, and then watched as the whole cycle repeated itself.

      But that's a different story. This is about discovering a MySQL bug. Let's talk shop.

      Background: Replication

      To understand the problem we first had to dig into MySQL's replication modes. We rely on Amazon RDS for managed MySQL, and their default replication mode is MIXED. According to MySQL's docs this is a best-of-both-worlds hybrid between statement- and row-based replication.

      To summarize:

      Statement-Based Replication

      This is the most efficient replication. In this mode, MySQL replicates the query itself, with additional context such as the current time or the next insert id. It minimizes how much the master must write to its binlog, and efficiently replays the same query on each replica.

      The downside is that some queries may not be deterministic: they may not replay the same on each replica.

      Row-Based Replication

      This is the most accurate replication. Instead of replicating the query, it replicates the new version of each row in its entirety. The replicas simply replace their version with the new version.

      Mixed-Mode Replication

      In this mode, MySQL favors efficient statement-based replication until it recognizes an unsafe query. Then it temporarily switches to row-based replication.

      Breaking Down the Problem

      Once the replication error told us where to look, we were able to easily spot our inconsistent data: a range of rows where the replicas were out of sync with the master. But this particular data had been inconsistent for days, and when we expanded our search, we found some inconsistent data over a month old. Why had it waited to break?

      master> select from tbl;    replica> select from tbl;
      +----+------+----------+    +----+------+----------+
      | id | foo  | uniq_col |    | id | foo  | uniq_col |
      +----+------+----------+    +----+------+----------+
      | .. | ...  | ...      |    | .. | ...  | ...      |
      | 12 | bar  | 10       |    | 12 | bar  | 4        |
      | 13 | baz  | 4        |    | 13 | baz  | 10       |
      | .. | ...  | ...      |    | .. | ...  | ...      |
      +----+------+----------+    +----+------+----------+
      

      Inconsistent data is bad enough on its own, but it was only half of our issue. It wasn't until a later unsafe query triggered row-based replication that replication broke.

      -- An example unsafe query
      master> UPDATE tbl SET foo = 'qux' ORDER BY rand() LIMIT 1;
      
      -- Would replicate like (decoded as SQL):
      replica> UPDATE tbl SET id = 13, foo = 'qux', uniq_col = 4 WHERE id = 13;
      ERROR 1062: Duplicate entry '4' for key 'index_foo_on_uniq_col'
      

      The Affected Feature

      The inconsistent data was a handful of backer sequences. We aim to assign each backer a unique and incremental number for that project when they complete the pledge process. This is pretty helpful for reports that we give to creators.

      In an effort to avoid race conditions and unnecessary rollbacks/retries from duplicate keys, we opted for a background job that updates recent sequences for a given project using a user-defined counter variable. The order is maintained through an appropriate timestamp column.

      SELECT COALESCE(MAX(sequence), 0)
      FROM backings
      WHERE project_id = ?
      INTO @sequence;
      
      UPDATE backings
      SET sequence = @sequence := @sequence + 1
      WHERE project_id = ?
        AND sequence IS NULL
        AND pledged_at IS NOT NULL
      ORDER BY pledged_at ASC, id ASC;
      

      Somehow that query had given backings a different sequence on the replicas. But it has a well-specified ORDER BY; why didn't it work?

      Back In Time

      We found evidence that in the timeframe when the data became inconsistent, a set of transactions hung while waiting for locks after writing to the backings table. InnoDB's transaction engine is optimized for COMMIT, which means it writes each query to disk such that COMMIT has nothing to do but mark the transaction as complete and release locks.

      Then, the transactions finished, but out of order. Since MySQL flushes a transaction's queries to the binlog on COMMIT, this means that the order in which records were written to disk on the master was different than the order in which the replicas wrote to disk when replaying the binlog.

      But this only matters if there's no explicit ORDER BY clause, and we had one. It just didn't match up. Puzzling on this led us to discover the final piece of the puzzle: a bug where MySQL will sometimes ignore the ORDER BY clause. Without that clause, the master and the replicas relied on their own versions of the implicit order, ran the sequencing, and fell out of sync.

      Lessons Learned

      Databases are intricate. In the Rails community we sometimes treat them as simple data stores, hidden behind ActiveRecord and ActiveRelation. But it's important to find opportunities to better understand how they do what we ask them to do. The answers are illuminating!

      10 comments
    • Welcome to Backing & Hacking

      Kickstarter is a platform for creative projects. But what about the platform that runs the platform? Backing & Hacking is our new blog for us to tell stories, share open source contributions and generally geek out about what it takes to build, run and maintain Kickstarter.

      We've all learned a lot from other engineering/tech blogs. Our aim here is to join the fun and give a peek behind the site.

      Creativity

      The creativity and ingenuity of the project creators on Kickstarter amaze us every day. Everyone behind the scenes also approches their work with creativity — even if it's just changing a few lines of CSS. Not only that, but it's fun!

      Open Source

      We have released a few things on github. However, there's not a lot of context there — and we have more coming. This blog is a place for us to talk about the Why behind the code.

      Making mistakes, so you don't have to

      One of the greatest things about working at Kickstarter is the freedom to take risks and explore interesting avenues — in design, implementation and even "boring" things like deployment. Sometimes things break or don't go quite as planned. Every once in a while something will break so magnificently that we think other people can learn from it. Our goal isn't to be perfect (and not to look perfect, either), our goal is to get better every day, on a personal, product and team level. There's no shame in screwing up — only in hiding from mistakes.

      Buzzword compliance

      As Kickstarter grows, so does our technology stack. As we try out new things, swap out the old and work on sweeping changes, we now have an outlet to share. For those who love TLA's, here's some of our current stack:
      AWS, RoR, RDS, EC2, ES, DJ, SASS.

      2 comments