Your Sample Size Is Never Big Enough
I want to talk about hypothesis testing. Over the past decade, nearly every mature organization has become rightfully obsessed, sometimes religiously, with hypothesis-driven development.
Picture you're in 1983. Corporate hypothesis testing was done only by hot-shot statisticians with long hair and beards using SPSS. They sat there, pondering in front of their IBM 3270 terminals, smoking doobies with such low THC concentration that they would practically be microdosing by today's standards. We've come a long way from my revisionist image of Bay Area statistics turbonerds working in the software industry. Today, testing has been democratized; the rise of both open-source tooling and enterprise-grade experimentation platforms has made it easier than ever for product professionals and analysts alike to go from zero to researcher in a matter of months. Testing is embedded in day-to-day workflows, product decisions, marketing campaigns, and UI tweaks. The mere ability to run A/B tests is responsible for the employment of tens of thousands of data professionals whose job is to tell executives whether “B is better than A” (or, more often, “it depends”).
In fact, it was my own experience at a previous employer, rapidly scaling up A/B testing infrastructure, that pushed me from data engineering into data science. When your company suddenly decides to test everything, someone has to build the pipelines, make sure the sample sizes aren’t garbage, and explain to leadership why “p = 0.049” does not mean we shouldn't bet the farm on Variant B.
But here’s the catch: Scaling up A/B testing isn’t just about running more tests, it's about running them well. Once you start slicing results by segment (DMA, device, user cohort, time of day, etc.), there is a lot that needs to be accounted for to get an accurate picture. The more granular you get, the easier it is to trick yourself into believing noise is a signal. In other words, you might accidentally build a false positive factory.
So before you declare victory because Cleveland's p-value is 0.051, let’s take a step back and talk about where hypothesis testing can start to fall apart at scale and how hierarchical models can save you from statistical regret.
Why Real Life A/B Tests Aren’t Perfect RCTs
To illustrate what I'm talking about, we'll examine the ever-dependable fictional company, Acme Widgets, Inc (AWI). In an effort to sell more Premier Widgets, AWI will be putting some users in the beloved Variant B checkout experience when a Premier Widget is added to the cart to see if cart-to-checkout rates can be improved. The data team, without long hair or IBM PCs, has decided that they will sample users from 10 Designated Marketing Areas (DMAs) and evenly split them 50/50 between a Control (A) experience and Variant B experience. Using a significance level (alpha) of 0.05 , power (1-beta) of 0.80, current checkout rate average of 12% ,and minimum detectible effect of +1.5 percentage points in checkout rate, the team determines they will need ~8,000 users in each group to get a detectible effect (LaTeX not included).
The team determines that number isn’t so bad! AWI gets plenty of traffic, and ~16,000 users adding the Premier Widget to the cart is totally realistic in a 14-day testing period.
“But wait!”, you may interject, “They are rolling this test out in 10 DMAs, do they all have similar traffic patterns?”
That’s a great question, and no, they do not. After doing some additional pre-test analysis, the team finds that there are major differences in the populations of different DMAs in our control and variant groups:
Users per test group
Equipped with this knowledge, our data team has two options. They could pretend the problem doesn’t exist, run the test at an aggregate level, and hope the differences between DMAs don’t introduce any hidden bias.
Or, they could acknowledge the reality that sample sizes aren’t evenly distributed across DMAs, meaning some regions will have plenty of statistical power, while others will be embarrassingly underpowered.
When we consider all the complexities of real human behavior, and all the ways we can slice the data,it becomes clear that for many subgroups, your sample size is never big enough.
Of course, they choose the good path and decide to account for these differences in their test design.
Why Can't We Run A Bunch of Tests?
The first thing they realize is that not all DMAs are created equal.
Their initial power calculation made sense at the total population level (16,000 users split between A and B). But once they break things down by DMA, the sample sizes for some regions drop below what’s needed for a reliable test. New York, Los Angeles, and Chicago might still be well-powered, but Miami and Seattle? Not even close.
If they ignore this issue and run independent hypothesis tests for each DMA, there are a few things to consider:
Confidence: Small DMAs will have wide confidence intervals and often fail to reach significance, even if an effect truly exists.
Multiple Comparisons Problem: Running 10 independent hypothesis tests means we are almost guaranteed to see a false positive (even if there’s no real effect).
Why does this happen? It’s due to a concept called the Family-Wise Error Rate (FWER).
I’ll spare you the math, but here’s the key idea:
When you run one hypothesis test at α = 0.05, you have a 5% chance of getting a false positive (Type I error).
When you run multiple independent tests, this probability compounds across tests, increasing the chance that at least one of them is a false positive.
With 10 DMAs, the probability of at least one false positive rises to ~40%, even if the null hypothesis is true for all DMAs.
The probability of at least one Type I error approaches 40% with 10 independent DMAs—meaning at least one DMA is likely to show a false positive just by chance.
So what can be done about this? This is a critical experiment for AWI to run, but the team doesn't want to give the business misleading results. Fortunately, much smarter people than me have have spent the last 250 years trying to solve hard problems just like this, so there are plenty of tools in our garage to work with.
From a frequentist perspective, there are several post hoc methodologies that can give a more rigorous understanding of the data, including:
Tukey's Honestly Significant Difference (HSD) Test
Bonferroni Correction
Welch's ANOVA (now with sugar-free options!)
Games-Howell Test
But each of these comes with a trade-off that can impact how we interpret the results. Some are too conservative, some don’t account for variance differences, and some make it nearly impossible to detect real effects in smaller DMAs. I’ll detail these drawbacks in a follow-up article, but for now, just keep in mind that the AWI team wants to incorporate subgroup information into their model without making overcorrections that downplay the true impact of their highest-value DMAs.
Bayesian Hierarchical Models
AWI’s data team is looking for a solution that accounts for subgroup differences without overcorrecting and accidentally downplaying real effects. They need a model that borrows strength where needed, avoids false positives, and doesn’t collapse under sample size imbalance.
The solution: Bayesian Hierarchical Models. But why?
Unlike some approaches that would treat each DMA as an independent test, Bayesian hierarchical models pool information across DMAs, allowing them to:
Stabilize estimates for small DMAs (without discarding them entirely).
Prevent noise from driving false conclusions (no more Miami looking like an anomaly just because it had a lucky week).
Avoid arbitrary multiple-comparison corrections that make real effects hard to detect.
Instead of treating every DMA as an isolated test, Bayesian hierarchical models assume they share an underlying distribution while still allowing for local differences. Big DMAs (New York, LA) get to stand mostly on their own, because they have enough data. Small DMAs (Seattle, Miami) borrow strength from the overall trend instead of giving us wild, misleading swings in estimated effect.
In other words, Bayesian hierarchical modeling automatically adjusts for sample size differences in a way frequentist methods can’t.
But what does this look like? The chart below shows the estimated lift that Variant B has over Control in each DMA after the test concludes:
Frequentist estimates show large swings in smaller DMAs due to limited sample sizes, while Bayesian hierarchical models apply partial pooling -reducing noise and stabilizing estimates while still preserving real effects
Notice how in the traditional model (red bars), smaller DMAs like Seattle and Philadelphia have wildly different estimated effects, when in reality, it’s just noise. Additionally, in the frequentist model, a DMA like Miami shows a strong effect. But in isolation, it’s difficult to know whether this is a true effect or just a false positive driven by low sample size and multiple comparisons.
The Bayesian model (blue bars), on the other hand, smooths out uncertainty, ensuring that low-sample-size DMAs don’t dominate the narrative with extreme values.
I'll go over how exactly Bayesian hierarchical modeling works in a later article, stay tuned!
Wrapping It All Up
So, what has AWI learned?
Traditional A/B testing methods require aggressive corrections for multiple comparisons, which often make it harder to detect real effects.
High dimensionality introduces massive variance problems, potentially leading to misleading results.
Hierarchical Models naturally balance subgroup effects, stabilizing noisy estimates while letting strong signals shine.
By using a multi-level approach, AWI can get a more accurate picture of what’s really happening across DMAs and provide stakeholders with the right recommendations.
So next time you’re running an A/B test across multiple segments, ask yourself: Do you want to fight with p-values all day, or do you want a model that actually respects your data?
Thanks for reading!