-
Notifications
You must be signed in to change notification settings - Fork 2
AB Testing
Bayesian A/B Testing in 5 Minutes https://towardsdatascience.com/bayesian-a-b-testing-and-its-benefits-a7bbe5cb5103
Frequentist vs. Bayesian approach in A/B testing https://www.dynamicyield.com/lesson/bayesian-testing/
The Bayesian Approach to A/B Testing https://www.dynamicyield.com/lesson/bayesian-approach-to-ab-testing/
Frequentism and Bayesianism: A Practical Introduction http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/
Very good article that explains common issues with A/B Testing https://towardsdatascience.com/the-joy-of-a-b-testing-theory-practice-and-pitfalls-de58acbdb04a
Good explanation of Hypothesis A/B Testing (Type 1, Type 2 error) https://towardsdatascience.com/a-quick-start-guide-to-a-b-testing-da71de09b61d
A/B testing — Is there a better way? An exploration of multi-armed bandits. The MAB algorithm is an optimisation process that changes the allocation of impressions to the best-performing version. It answers the explore vs exploit dilemma. https://towardsdatascience.com/a-b-testing-is-there-a-better-way-an-exploration-of-multi-armed-bandits-98ca927b357d
In a context of very large volumes and no time constraints, the frequentist approach is easy to implement and gives satisfying results. However, the p-value concept is often misunderstood and the business may think we answer the following question while we’re not: “what is the probability that version B is better than version A?”. Whereas, Bayesian A/B testing could tackle this specific question better.
However, it implies some complex calculations and numerical integration methods. It may be a good idea to implement this method only in the context of small volumes or when we need to conclude A/B test quickly. By iterating more quickly we may also interestingly accumulate marginal gains on top of each other.
If the p-value is less than your alpha, 0.05 in this case, then you can state with 95% confidence that you have observed a true difference between version A and version B, not one due to chance.
Finally, you have to choose a significance level, usually called alpha, which represents the probability of making a Type I error (adopting the new version when it is in fact performing worse). Commonly-used levels are 5% or 1%, which roughly means that if you run 100 tests a year you should expect on average that for 5 (or 1, depending on your alpha value) of them you would make the wrong decision and implement a change that is not actually improving things.
A Type II error is not rejecting the null hypothesis when we actually should. In our example, imagine version B performs better than version A but our test doesn’t offer enough power to reject the null hypothesis and so we keep version A and lose an opportunity to improve performance.
The formula to calculate a big enough sample relies on the concept of effect size: what is the minimum difference you want to be able to detect (and that makes business sense)? It could be, for instance, a difference of at least 0.05 percentage point, bringing an existing conversion rate of 0.1% up to 0.15%.
We need a sufficiently large sample size to ensure our hypothesis test has sufficient power. The formula to calculate the sample size uses the significance level (alpha), the power level (1-beta) and the effect size (ES). An online calculator is available here: https://www.evanmiller.org/ab-testing/sample-size.html
What is A/A testing and why should you care? A/A testing is basically running the same version of a website, banner ad or email copy twice within a test design to assess the randomisation process and the tools you are using. The users in each group see the same thing but you want to compare the results and see if the difference in performance can be attributed to chance. If the differences are statistically significant it could reflect issues with your test design or processes that you’ll have to investigate. It is also worth noting that on average you should expect to see false positives in roughly 5% of the cases if your significance level (alpha) is 5%.