📰 Summary (use your own words)

✍️ Notes

First ghost - its either significant or noise

p value tries to answer the question - "what is the likelihood that the observed differences due to the null hypothesis?"
- There is no such thing as "the test is nearly significant" or "work towards significance"
- There is only significant or uncertain

Second ghost - the fallacy of session based metrics

If there are users creating multiple sessions, then it is difficult to assert the required assumption of all sessions are independent
This happens when you are measuring a rate metric which is defined by something other than what you randomized on, for example:
- click through rate (total clicks / total impression) will be skewed if you are randomizing on users

Third ghost - multiple comparisons

The more comparisons you make the hgiher your chances of seeing a false positive
The standard practice of 95% confidence level means that we expect a false positive rate of 5% on a single metric
- So, as soon as you look at more than one metric you increase the chance
- So for a AA test, you would expect an increase in at least one false positive in any metric with an increase in the metrics you are tracking
This also applies when testing multiple different treatments and also applies when you break down the results into segments
To correct for this
- You can adjust the significance threshold with Bonferroni correction

Fourth ghost - Peeking

For any test, there is a chance of false positives (seeing a significant result randomly when there isn't). So if we peeked and took action to stop the test we would inflate our results artificially
The AB testing statistics are only valid when you make one comparison only - they are based on the assumption that you will only make an inference using a snapshot of data at one particular, predetermined, point in time - peeking invalidates that
To avoid this
- peeking safely by not actioning
- Use sequential hypothesis testing which doesn't require a fixed sample size or predetermined run time
- Bayesian methods to not come to the wrong conclusion - refer to [[Is Bayesian AB testing immune to peeking]]