This blog post is taken from notes of a write-up I did at a previous job a few years ago when I was conducting and designing A/B tests for marketing teams. It is a condensed summary of my understanding of power analysis by simulation inspired by Nick Huntington-Klein’s excellent lecture notes on the same topic. Writing things up is a useful way of structuring my thoughts and ensuring I understand what I’m reading. The original write-up I did also had a Bayesian implementation (or interpretation) of power analysis but I’ve omitted it since I’m not 100% convinced it is the correct approach for pre-experimental design or whether other simulation frameworks are more appropriate. So I’ve stuck with the standard frequentist approach. Anyway, hope this is useful!
Introduction
A common task then designing experiments is to determine whether an experiment is sufficiently “powered”. That is, conditional on our parameter constraints and available data, how likely will our experiment and models or tests reject the null hypothesis given that the null is false. This requires assumptions on effect sizes, the type of underlying data (continuous, proportions), equal or unequal sample sizes, equal or unequal variances, etc.
I’ve been asked many times throughout my career to size up an experiment using standard frequentist tools. However, every time I return tot he exercise I get swamped with the variety of calculators and statistical tests that can be used - most only valid under very specific conditions and for relatively simple experiments. Typically in most commercial experiment designs, there are a lot of restrictions which may impact these calculations that cannot be adjusted directly.
So I turned so simulating my power analysis. Simulation is a great tool for flexible estimating the impact of different parameters and assumptions. This is done by fitting many models (or running tests) on synthetic data generated from a known data generating process (DGP). The DGP can be adjusted based on our assumptions and constraints. This is especially useful when greater complexity is built into the experimental design (multiple treatments, additional co-variates, generalising the impact of multiple experiments, etc).
Setup
I conducted an old experiment that measured the impact of a marketing campaign on product uptake for a treatment group compared to a representative control group. The specified parameters were:
Base control uptake: \(p_c = 0.35\)
Minimum detectable effect of treatment \(p_t = 0.36\)
Confidence: (\(1 - \alpha) = 0.95\)
Power: (\(1 - \beta) = 0.8\)
Type: One sided
Sample size: n
So assuming a base product uptake of \(p_c = 35\%\), the minimum effect to be detected at 80% power is 1% (\(p_t-p_c\)) with a 5% false positive rate under the null hypothesis.
There are two approaches to this analysis:
Calculate power for a known n (since this size can be restricted at the very start)
Calculate n required for a given power (more typical question that is asked)
For my use case, I was given a restricted n and had to think of different ways of determining power for asymmetric control and treatment group sizes, so I will demonstrate 1).
Typical workflow
Set parameters
Simulate synthetic data given these parameters
Fit model on data / run statistical rest
Extract desired estimates (p values of confidence intervals, test statistics)
Repeat 1)-4) multiple times
Determine what proportion tests were “statistically significant” (power estimate)
Benchmarking
Before running a simulation, let’s benchmark our results against a typical power analysis. Based on our parameter settings we need ~28,312 samples in both treatment and control. The goal is to replicate these results by simulation.
Two-sample comparison of proportions power calculation
n = 28311.97
p1 = 0.35
p2 = 0.36
sig.level = 0.05
power = 0.8
alternative = one.sided
NOTE: n is number in *each* group
import statsmodels.stats.api as smseffect_size = sms.proportion_effectsize(0.36, 0.35)sms.NormalIndPower().solve_power(effect_size, power=0.8, alpha=0.05, ratio=1, alternative ="larger")
28311.706512937755
Synthetic data simulation based on parameter settings
First, set parameter values. These can be changed and re-run for other yse cases.
# Parametersn <-28312*2# Total sample sizepc <-0.35# Success probability in control group (inferred from domain knowledge)pt <-0.36# Success probability in treatment group (minimum 'practical' effect size)n_sim <-1e3# Number of simulations to runtreatment_prop <-0.5# Proportion in treatmentone_sided <-TRUE# True is one sided testside <-"right"# Defaults to right sided test if one sided test. Input 'left' for left sided test# Sample size of each groupcontrol_prop =1- treatment_prop # Proportion in controlnt <- n * treatment_prop # Treatment group sizenc <- n * control_prop # Control group size
# Parametersn =28312*2# Total sample sizepc =0.35# Success probability in control group (inferred from domain knowledge)pt =0.36# Success probability in treatment group (minimum 'practical' effect size)n_sim =1e3# Number of simulations to runtreatment_prop =0.5# Proportion in treatmentone_sided =True# True is one sided testside ="right"# Right sided test# Sample size of each groupcontrol_prop =1- treatment_prop # Proportion in controlnt =int(n * treatment_prop) # Treatment group sizenc =int(n * control_prop) # Control group size
Next, simulate bernoulli outcomes for treatment and control group based on above parameter settings. \(y\) is the outcome variable (1 for product uptake, 0 for no uptake) and x is a binary indicator variable (0 for control, 1 for treatment).
set.seed(2025)# Control group bernoulli outcomes with probability pcyc <-rbinom(n = nc, size =1, prob = pc)# Treatment group bernoulli outcomes with probability ptyt <-rbinom(n = nt, size =1, prob = pt)# Dummy variable 1= treatment, 0 = control.# Coefficient is the relative change in log odds of success if in treatment groupxc <-rep(0, nc)xt <-rep(1, nt)# Bring together in a data framedf <-data.frame(y =c(yc, yt), x =c(xc, xt))head(df)
y x
1 1 0
2 0 0
3 0 0
4 0 0
5 1 0
6 0 0
import numpy as npimport pandas as pdnp.random.seed(2025)# Control group bernoulli outcomes with probability pcyc = np.random.binomial(n=1, p=pc, size=nc)# Treatment group bernoulli outcomes with probability ptyt = np.random.binomial(n=1, p=pt, size=nt)# Dummy variable 1= treatment, 0 = control.# Coefficient is the relative change in log odds of success if in treatment groupxc = np.repeat(0, nc)xt = np.repeat(1, nt)# Bring together in a data framedf = pd.DataFrame({"y":np.concatenate([yc,yt]),"const": np.repeat(1, (nc+nt)),"x":np.concatenate([xc,xt])})
Binomial GLM (Logistic regression)
Next, fit a logistic regression on the synthetic data and run a hypothesis test on the treatment/control dummy variable \(x\). This will be compared to a two sample proportions t test.
But why regression? Well, most statistical tests and estimation procedures (t-tests, ANOVA, etc) are all cases of general linear models that can be estimated with regression. This gives us the most flexibility when considering even more complicated experimental designs (e.g. adding other x variables for pre treatment adjustment of certain demographic characteristics, stratification across multiple cohorts, etc). It also gives a clear estimation strategy instead of fumbling through the documentation of different statistical tests.
So the hypothesis to test is if \(\beta_1\) is larger than 0, significant at the 5% level. If we were to run this experiment multiple times under the same parameter settings, we should expect to correctly reject the null in favour of the alternative hypothesis 80% of the time (although this does not tell us anything about the uncertainty of the effect size).
Fitting the model below gives a significant result. The next step is to run this exercise multiple times to get a power estimate.
# Estimate/fit logistic regressionmodel <-glm(y~x, data = df, family ='binomial')# Model resultssummary(model)
Call:
glm(formula = y ~ x, family = "binomial", data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.61767 0.01246 -49.582 < 2e-16 ***
x 0.05068 0.01755 2.887 0.00389 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 73742 on 56623 degrees of freedom
Residual deviance: 73734 on 56622 degrees of freedom
AIC: 73738
Number of Fisher Scoring iterations: 4
# Extract p value and divide by 2 for one sided test (Wald test statistic is asymptotically z distributed)coef(summary(model))[2,4]/2
[1] 0.001943298
import statsmodels.api as sm# Estimate/fit logistic regressionmodel = sm.GLM(df[["y"]], df[["const", "x"]], family=sm.families.Binomial())fit = model.fit()# Model resultsfit.summary()
Generalized Linear Model Regression Results
Dep. Variable:
y
No. Observations:
56624
Model:
GLM
Df Residuals:
56622
Model Family:
Binomial
Df Model:
1
Link Function:
Logit
Scale:
1.0000
Method:
IRLS
Log-Likelihood:
-36849.
Date:
Tue, 13 May 2025
Deviance:
73697.
Time:
21:08:30
Pearson chi2:
5.66e+04
No. Iterations:
4
Pseudo R-squ. (CS):
0.0001342
Covariance Type:
nonrobust
coef
std err
z
P>|z|
[0.025
0.975]
const
-0.6189
0.012
-49.672
0.000
-0.643
-0.594
x
0.0484
0.018
2.757
0.006
0.014
0.083
# Extract p value and divide by 2 for one sided test (Wald test statistic is asymptotically z distributed)fit.pvalues[1]/2
0.0029207985359095984
Simulate multiple times and calculate power
The below for loop will run the same step as above 1000 times.
power_sim <-function(seed, ncontrol=nc, propcontrol=pc, ntreatment=nt, proptreatment=pt, onesided=one_sided, left_right_side=side ) {# Binomial GLM (logistic regression)set.seed(seed)# Generate synthetic data # Data generating process governed by parameters pc and pt# Control group bernoulli outcomes with probability pc yc <-rbinom(n = ncontrol, size =1, prob = propcontrol)# Treatment group bernoulli outcomes with probability pt yt <-rbinom(n = ntreatment, size =1, prob = proptreatment)# Dummy variable treatment = 1, control = 0# Coefficient is the relative change in log odds of success if in treatment group xc <-rep(0, ncontrol) xt <-rep(1, ntreatment)# Bring together in a dataframe df <-data.frame(y =c(yc, yt), x =c(xc, xt))# Fit model model <-glm(y~x, data = df, family ='binomial') results <-data.frame(test_result =NA,pvalues =NA )if(onesided ==FALSE) {# Extract p values, returns TRUE if less than 0.05, FALSE otherwise results["pvalues"] <-coef(summary(model))[2,4] results["test_result"] <- results[1,"pvalues"] <0.05 } elseif (onesided ==TRUE) {# One sided test, halve the p-value results["pvalues"] <-coef(summary(model))[2,4]/2if(left_right_side =='right') {# Ensure test statistic is greater than the null hypothesis for right sided test results["test_result"] <- results[1,"pvalues"] <0.05&coef(summary(model))[2,1] >0 }elseif(left_right_side =='left') {# Test stat less than null for left sided test results["test_result"] <- results[1,"pvalues"] <0.05&coef(summary(model))[2,1] <0 } else {stop('Error: one_sided must be TRUE or FALSE') } }return(results)}# Random seed gridset.seed(456)random_grid <-sample(1e6, n_sim)# Set up parallelism available_cores = parallel::detectCores()-1future::plan(future::multisession, workers = available_cores)sim_list <- future.apply::future_Map(function(x) power_sim(seed = x), x = random_grid, future.seed=TRUE)sim_df <-do.call(rbind, sim_list)
Finally, calculating the power:
mean(sim_df[["test_result"]])
[1] 0.791
import multiprocess as mpdef power_sim(seed, ncontrol=nc, propcontrol=pc, ntreatment=nt, proptreatment=pt, onesided=one_sided, left_right_side=side): np.random.seed(seed)# Generate synthetic data # Data generating process governed by parameters pc and pt# Control group bernoulli outcomes with probability pc yc = np.random.binomial(n=1, p=propcontrol, size=ncontrol)# Treatment group bernoulli outcomes with probability pt yt = np.random.binomial(n=1, p=proptreatment, size=ntreatment)# Dummy variable treatment = 1, control = 0# Coefficient is the relative change in log odds of success if in treatment group xc = np.repeat(0, ncontrol) xt = np.repeat(1, ntreatment)# Bring together in a dataframe df = pd.DataFrame({"y":np.concatenate([yc,yt]),"const": np.repeat(1, (ncontrol+ntreatment)),"x":np.concatenate([xc,xt])})# Fit model model = sm.GLM(df[["y"]], df[["const", "x"]], family=sm.families.Binomial()) fit = model.fit()ifnot onesided:# Extract p values, returns TRUE if less than 0.05, FALSE otherwise pvalue = fit.pvalues[1] test_result = pvalue <0.05elif onesided:# One sided test, halve the p-value pvalue = fit.pvalues[1]/2if left_right_side =='right':# Ensure test statistic is greater than the null hypothesis for right sided test test_result = pvalue <0.05and fit.params[1] >0elif left_right_side =='left':# Test stat less than null for left sided test test_result = pvalue <0.05and fit.params[1] <0else:Exception('Error: one_sided must be TRUE or FALSE')returntuple([pvalue, test_result])# Random seed gridnp.random.seed(2025)random_grid =list(np.random.randint(0,int(1e6), int(n_sim)))# Set up parallelism nprocs = mp.Pool()._processes-1pool = mp.Pool(processes=nprocs)result = pool.map_async(power_sim, random_grid)result_ls = result.get()result_df = pd.DataFrame(result_ls)result_df.columns = ["pvalues", "test_result"]
Finally, calculating the power:
np.mean(result_df[["test_result"]])
0.81
Great! The null was rejected for around 80% of simulations which is in line with the original power analysis. Increasing the simulation number should see this get closer to 80%.
The original use case of this experiment was to consider different treatment and control group sizes. The original power test analysis assumes equal, independent sample sizes in treatment and control. Instead of looking for an appropriate statistical test, I can just reset the parameters above to have 70% treatment and 30% control proportions. This enables the most flexibility when it comes to pre-analysis design since I can simulate data with other properties for the use case at hand.