# A/B Test Reporting in Looker

One of the Data team’s priorities this year has been improving the Kickstarter A/B testing process. To this end, I’ve been been focused on making it easier to set up, run, and analyze experiments. This will make it more likely we'll use data from experiments to inform product design.

Until recently, we monitored A/B tests in an ad hoc way. We use our event tracking infrastructure to log A/B test data, so while a test was running, a Product Manager or Data Analyst watched the number of users in the experiment's event stream until it reached the required sample size. At that point, we ran the numbers through an R script and sent out the results.

# Enter Looker

Kickstarter recently adopted a business intelligence tool called Looker to support data reporting and ad hoc analysis. Looker connects directly to our Redshift cluster, which is where we store the raw event data from our A/B tests. This made me wonder whether we could use Looker to monitor experiments and report results.

One feature of Looker we like is the ability to save and schedule queries, with the results delivered via email. If we could find a way to analyze A/B tests via SQL, then Looker could handle the rest.

# Back to School

How can we do statistics in SQL without access to probability distributions? There are methods for generating normally distributed data in SQL, but this approach seems like overkill. We don't need to recreate a standard normal distribution on the fly. The values don't change.

My aha moment was remembering my old statistics textbooks with look-up tables in the back. By adding look-up tables for probability distributions to Redshift, we can get good approximations of power, p-values, and confidence intervals for the typical A/B tests we run. Although this means we're simulating a continuous distribution with a discrete one, we don't rely exclusively on p-values to interpret our tests, so a difference of a few thousandths of a point won't make much difference.

# The Nitty Gritty

As an example, I'm going to use a common type of test we run — a hypothesis test of the difference of two proportions. (If you'd like to learn more about the statistics behind this test, this is a good place to start).

To make this concrete, let's say we're testing a new design of the Discover page, and we want to know whether it affects the number of users clicking through to project pages.

To generate a test statistic for this type of test, we need a standard normal distribution. I generated a set of z-scores and their probabilities in R and loaded this into Redshift as `standard_normal_distribution`

.

The table looks something like this:

z_score | probability |
---|---|

0 | 0.5 |

0.0000009999999992516 | 0.50000039894228 |

0.00000199999999939138 | 0.500000797884561 |

0.00000299999999953116 | 0.500001196826841 |

0.00000399999999967093 | 0.500001595769122 |

0.00000499999999981071 | 0.500001994711402 |

0.00000599999999995049 | 0.500002393653682 |

0.00000700000000009027 | 0.500002792595963 |

0.00000799999999934187 | 0.500003191538243 |

0.00000899999999948164 | 0.500003590480523 |

0.00000999999999962142 | 0.500003989422804 |

0.0000109999999997612 | 0.500004388365084 |

0.000011999999999901 | 0.500004787307365 |

0.0000130000000000408 | 0.500005186249645 |

0.0000139999999992924 | 0.500005585191925 |

In the sections below, I'll use the output of that query to calculate the sample proporution, standard error, and other sample statistics using subqueries called common table expressions (CTEs). If you aren't familiar with this flavor of SQL syntax, you can think of CTEs as forming temporary tables that can be used in subsequent parts of the query.

Using a CTE, we calculate [inline_math]\hat{p}[/inline_math], the pooled proportion under the null hypothesis:

... ), p_hat AS ( SELECT (control.x + experimental.x) / (control.n + experimental.n) AS p FROM control, experimental ), ...

Next we calculate the pooled standard error under the null hypothesis:

... ), se_pooled AS ( SELECT SQRT((p_hat.p * (1 - p_hat.p)) * (1 / control.n + 1 / experimental.n)) AS se FROM control, experimental, p_hat ), ...

This allows us to calculate an exact z-score from the data:

... ), z_exact AS ( SELECT ABS((control.p - experimental.p) / se_pooled.se) AS z FROM control, experimental, se_pooled ), ...

Then we find the nearest z-score in our standard normal look-up table and use that to calculate a p-value:

... ), z_nearest AS ( SELECT standard_normal_distribution.z_score AS z_score FROM standard_normal_distribution, z_exact ORDER BY ABS(standard_normal_distribution.z_score - z_exact.z) ASC LIMIT 1 ), p_value AS ( SELECT (1 - standard_normal_distribution.probability) * 2 AS p FROM z_nearest INNER JOIN standard_normal_distribution ON z_nearest.z_score = standard_normal_distribution.z_score ), ...

Having a p-value is a good start, but we also want to generate confidence intervals for the test. While we're at it, we'd also like to conduct a power analysis so the test results only display when we've reached the minimum sample size.

To do that properly, we need some details about the test design: the significance level, the power, and the minimum change to detect between the two variants. These are all added to the query using Looker's templated filters, which take user input and add them as parameters.

Unfortunately, Looker cannot simply add an arbitrary value (e.g. 0.05) to any part of a query. To get around this, I filter a table column with user input and then use the resulting value.

For example, in the following section, the query takes user input as `significance_level`

, matches it against the `probability`

column of the `standard_normal_distribution`

table (after some rounding to ensure a match), and saves that value as `alpha`

:

... ), significance_level AS ( SELECT ROUND(probability, 3) AS alpha FROM standard_normal_distribution WHERE {% condition significance_level %} ROUND(probability, 3) {% endcondition %} LIMIT 1 ), ...

Note Looker's syntax for what it calls templated filters:

WHERE {% condition significance_level %} ROUND(probability, 3) {% endcondition %}

If the user input is 0.05 for the `significance_level`

filter, Looker converts this to:

WHERE ROUND(probability, 3) = 0.05

See the appendix below for the entire query.

Admittedly, doing all this in SQL is kind of preposterous, but it means that we can add it to Looker as the engine of an A/B Test Dashboard. The dashboard abstracts away all the calculations and presents a clean UI for taking user input on the parameters of the test design, allowing people without any special engineering or data expertise to use it. Now that it's built into Looker, it's part of our larger data reporting infrastructure.

After taking input about the test design, the dashboard calculates the number of users in each variant, their conversion rates, and the minimum sample size. If the sample size has been met, the dashboard also outputs a p-value and confidence interval for the test. The dashboard can be scheduled to run daily, and we can even set it up to email only when there are results to report.

Now when we implement a new A/B test, we add it to Looker so we can get daily status emails, including statistical results when the test is complete. This can be done by someone on the Product or Engineering teams, freeing up Data team resources to focus on designing experiments well and running more complex tests.

# The Results

This kind of dashboard pushes Looker to its limits, so naturally there are some drawbacks to doing A/B test reporting this way. It separates the implementation of the test from the implementation of the reporting, so there is some duplicated effort. Furthermore, it only works for specific types of tests where the math can be handled by a SQL query and a static probability distribution.

On the other hand, we're happy that Looker is flexible enough to allow us to prototype internal data tools. The A/B Test Dashboard has automated what was a very manual process before, and it has reduced the dependency on the Data team for monitoring and reporting the results of common types of tests. This all means we can run more experiments to create a better experience for our users.

# Find this interesting?

If this kind of thing gets data juices flowing, you should know we're hiring for a**Data Scientist**! Head over to the job description to learn more.

# Appendix

Our query in full:WITH control AS ( -- count x and n as floats ), experimental AS ( -- count x and n as floats ), p_hat AS ( SELECT (control.x + experimental.x) / (control.n + experimental.n) AS p FROM control, experimental ), se_pooled AS ( SELECT SQRT((p_hat.p * (1 - p_hat.p)) * (1 / control.n + 1 / experimental.n)) AS se FROM control, experimental, p_hat ), z_exact AS ( SELECT ABS((control.p - experimental.p) / se_pooled.se) AS z FROM control, experimental, se_pooled ), z_nearest AS ( SELECT standard_normal_distribution.z_score AS z_score FROM standard_normal_distribution, z_exact ORDER BY ABS(standard_normal_distribution.z_score - z_exact.z) ASC LIMIT 1 ), p_value AS ( SELECT (1 - standard_normal_distribution.probability) * 2 AS p FROM z_nearest INNER JOIN standard_normal_distribution ON z_nearest.z_score = standard_normal_distribution.z_score ), se_unpooled AS ( SELECT SQRT(((control.p * (1 - control.p)) / control.n) + ((experimental.p * (1 - experimental.p)) / experimental.n)) AS se FROM control, experimental ), significance_level AS ( SELECT ROUND(probability, 3) AS alpha FROM standard_normal_distribution WHERE {% condition significance_level %} ROUND(probability, 3) {% endcondition %} LIMIT 1 ), power AS ( SELECT ROUND(probability, 3) AS beta FROM standard_normal_distribution WHERE {% condition power %} ROUND(probability, 3) {% endcondition %} LIMIT 1 ), change_to_detect AS ( SELECT ROUND(probability, 3) AS change_in_proportion FROM standard_normal_distribution WHERE {% condition minimum_change_to_detect %} ROUND(probability, 3) {% endcondition %} LIMIT 1 ), z_alpha AS ( SELECT standard_normal_distribution.z_score AS z FROM standard_normal_distribution, significance_level WHERE ROUND(standard_normal_distribution.probability, 3) = ROUND(1 - alpha / 2, 3) ORDER BY ABS(standard_normal_distribution.probability - (1 - alpha / 2)) ASC LIMIT 1 ), z_beta AS ( SELECT standard_normal_distribution.z_score AS z FROM standard_normal_distribution, power WHERE ROUND(standard_normal_distribution.probability, 3) = ROUND(beta, 3) ORDER BY ABS(standard_normal_distribution.probability - beta) ASC LIMIT 1 ), confidence_interval AS ( SELECT (experimental.p - control.p) - (z_alpha.z * se_unpooled.se) AS lower, (experimental.p - control.p) + (z_alpha.z * se_unpooled.se) AS upper FROM control, experimental, se_unpooled, z_alpha ), proportions AS ( SELECT control.p AS p1, (control.p * change_in_proportion) + control.p AS p2 FROM control, change_to_detect ), standard_errors AS ( SELECT SQRT(2 * ((p1 + p2) / 2.0) * (1 - ((p1 + p2) / 2.0))) AS se1, SQRT((p1 * (1 - p1)) + (p2 * (1 - p2))) AS se2 FROM proportions ), minimum_sample_size AS ( SELECT (((z_alpha.z * se1) + (z_beta.z * se2))^2) / (p2 - p1)^2 AS n FROM z_alpha, z_beta, proportions, standard_errors ) SELECT control.n AS step1_control, control.x AS step2_control, control.p AS rate_control, experimental.n AS step1_experimental, experimental.x AS step2_experimental, experimental.p AS rate_experimental, CASE WHEN control.n >= minimum_sample_size.n AND experimental.n >= minimum_sample_size.n THEN confidence_interval.lower / control.p ELSE NULL END AS lower_confidence_interval, CASE WHEN control.n >= minimum_sample_size.n AND experimental.n >= minimum_sample_size.n THEN (experimental.p - control.p) / control.p ELSE NULL END AS relative_change_in_rates, CASE WHEN control.n >= minimum_sample_size.n AND experimental.n >= minimum_sample_size.n THEN confidence_interval.upper / control.p ELSE NULL END AS upper_confidence_interval, CASE WHEN control.n >= minimum_sample_size.n AND experimental.n >= minimum_sample_size.n THEN p_value.p ELSE NULL END AS p_value, minimum_sample_size.n AS minimum_sample_size FROM control, experimental, p_value, confidence_interval, minimum_sample_size;