Show code
library(tibble)
library(ggplot2)
library(dplyr)
library(tidyr)
library(latex2exp)
library(scales)
library(knitr)
Over the past few years working in marketing measurement, I’ve noticed that power analysis is one of the most poorly understood testing and measurement topics. Sometimes it’s misunderstood and sometimes it’s not applied whatsoever despite its foundational role in test design. This article and the series that follow are my attempts to alleviate this.
In this segment, I will cover:
- What is statistical power?
- How do we compute it?
- What can influence power?
Power analysis is a statistical topic and as a consequence, there will be math and statistics (crazy right?) but I will try to tie those technical details back to real world problems or basi…
Show code
library(tibble)
library(ggplot2)
library(dplyr)
library(tidyr)
library(latex2exp)
library(scales)
library(knitr)
Over the past few years working in marketing measurement, I’ve noticed that power analysis is one of the most poorly understood testing and measurement topics. Sometimes it’s misunderstood and sometimes it’s not applied whatsoever despite its foundational role in test design. This article and the series that follow are my attempts to alleviate this.
In this segment, I will cover:
- What is statistical power?
- How do we compute it?
- What can influence power?
Power analysis is a statistical topic and as a consequence, there will be math and statistics (crazy right?) but I will try to tie those technical details back to real world problems or basic intuition whenever possible.
Without further ado, let’s get to it.
Error types in testing: Type I vs. Type II
In testing, there are two types of error:
-
Type I:
-
Technical Definition: We erroneously reject the null hypothesis when the null hypothesis is true
-
Layman’s Definition: We say there was an effect when there really wasn’t
-
Example: A/B testing a new creative and concluding that it performs better than the old design when in reality, both designs perform the same
-
Type II:
-
Technical Definition: We fail to reject the null hypothesis when the null hypothesis is false
-
Layman’s Definition: We say there was no effect when there really was
-
Example: A/B testing a new creative and concluding that it performs the same as the old design when in reality, the new design performs better
What is statistical power?
Most people are familiar with Type I error. It’s the error that we control by setting a significance level. Power relates to Type II error. More specifically, power is the probability of correctly rejecting the null hypothesis when it is false. It is the complement of Type II error (i.e., 1 – Type II error). In other words, power is the probability of detecting a true effect if one exists. It should be clear why this is important:
- Underpowered tests are likely to miss true effects, leading to missed opportunities for improvement
- Underpowered tests can lead to false confidence in the results, as we may conclude that there is no effect when there actually is one
- … and most simply, underpowered tests waste money and resources
The role of α and β
If both are important, why are Type II error and power so misunderstood and ignored while Type I is always considered? It’s because we can easily pick our Type I error rate. In fact, that’s exactly what we are doing when we set the significance level α (typically α = 0.05) for our tests. We are stating that we are comfortable with a certain percentage of Type I error. During test setup, we make a statement, “we are comfortable with an X % false positive rate,” and then set α = X %. After the test, if our p-value falls below α, we reject the null hypothesis (i.e., “the results are significant”), and if the p-value falls above α, we fail to reject the null hypothesis (i.e., “the results are not significant”).
Determining Type II error, β (typically β = 0.20), and thus power, is not as simple. It requires us to make assumptions and perform analysis, called “power analysis.” To understand the process, it’s best to first walk through the process of testing and then backtrack to figure out how power can be computed and influenced. Let’s use a simple A/B creative test as an example.
| Concept | Symbol | Typical Value(s) | Technical Definition | Plain-Language Definition |
|---|---|---|---|---|
| Type I Error | α | 0.05 (5%) | Probability of rejecting the null hypothesis when the null is actually true | Saying there is an effect when in reality there is no difference |
| Type II Error | β | 0.20 (20%) | Probability of failing to reject the null hypothesis when the null is actually false | Saying there is no effect when in reality there is one |
| Power | 1 − β | 0.80 (80%) | Probability of correctly rejecting the null hypothesis when the alternative is true | The chance we detect a true effect if there is one |
Quick Reference: Error Types and Power
Computing power: step by step
A couple notes before we get started:
-
I made a few assumptions and approximations to simplify the example. If you can spot them, great. If not, don’t worry about it. The goal is to understand the concepts and process, not the nitty gritty details.
-
I refer to the decision threshold in the z-score space as the critical value. Critical value typically refers to the threshold in the original space (e.g., conversion rates) but I will use it interchangeably so I don’t need to introduce a new term.
-
There are code snippets throughout tied to the text and concepts. If you copy the code yourself, you can play around with the parameters to see how things change. Some of the code snippets are hidden to keep the article readable. Click “Show the code” to see the code.
-
Try this: Edit the sample size in the test setup so that the test statistic is just below the critical value and then run the power analysis. Are the results what you expected?
Test setup and the test statistic
As stated above, it’s best to walk through the testing process first and then backtrack to identify how power can be computed. Let’s do just that.
# Set parameters for the A/B test
N_a <- 1000 # Sample size for creative A
N_b <- 1000 # Sample size for creative B
alpha <- 0.05 # Significance level
# Function to compute the critical z-value for a one-tailed test
critical_z <- function(alpha, two_sided = FALSE) {
if (two_sided) qnorm(1 - alpha/2) else qnorm(1 - alpha)
}
As stated above, it’s best to walk through the testing process first and then backtrack to identify how power can be computed. Let’s do just that.
Our test setup:
- Null hypothesis: The conversion rate of A equals the conversion rate of B.
- Alternative hypothesis: The conversion rate of B is greater than the conversion rate of A.
- Sample size:
- Na = 1,000 — Number of people who receive creative A
- Nb = 1,000 — Number of people who receive creative B
- Significance level: α = 0.05
- Critical value: The critical value is the z-score that corresponds to the significance level α. We call this Z1−α. For a one-tailed test with α = 0.05, this is approximately 1.64.
- Test type: Two-proportion z-test
x_a <- 100 # Number of conversions for creative A
x_b <- 150 # Number of conversions for creative B
p_a <- x_a / N_a # Conversion rate for creative A
p_b <- x_b / N_b # Conversion rate for creative B
Our results:
- xa = 100 — Number of conversions from creative A
- xb = 150 — Number of conversions from creative B
- pa = xa / Na = 0.10 — Conversion rate of creative A
- pb = xb / Nb = 0.15 — Conversion rate of creative B
Under the null hypothesis, the difference in conversion rates follows an approximately normal distribution with:
- Mean: μ = 0 (no difference in conversion rates)
- Standard deviation: σ = √[ pa(1 − pa)/Na + pb(1 − pb)/Nb ] ≈ 0.01
z_score <- function(p_a, p_b, N_a, N_b) {
(p_b - p_a) / sqrt((p_a * (1 - p_a) / N_a) + (p_b * (1 - p_b) / N_b))
}
From these values, we can compute the test statistic:
[ z = \frac{p_b – p_a} {\sqrt{\frac{p_a (1 – p_a)}{N_a} + \frac{p_b (1 – p_b)}{N_b}}} \approx 3.39 ]
If our test statistic, z, is greater than the critical value, we reject the null hypothesis and conclude that Creative B performs better than Creative A. If z is less than or equal to the critical value, we fail to reject the null hypothesis and conclude that there is no significant difference between the two creatives.
In other words, if our results are unlikely to be observed when the conversion rates of A and B are truly the same, we reject the null hypothesis and state that Creative B performs better than Creative A. Otherwise, we fail to reject the null hypothesis and state that there is no significant difference between the two creatives.
Given our test results, we reject the null hypothesis and conclude that Creative B performs better than Creative A.
z <- z_score(p_a, p_b, N_a, N_b)
critical_value <- critical_z(alpha)
if (z > critical_value) {
result <- "Reject null hypothesis: Creative B performs better than Creative A"
} else {
result <- "Fail to reject null hypothesis: No significant difference between creatives"
}
result
#> [1] "Reject null hypothesis: Creative B performs better than Creative A"
The intuition behind power
Now that we have walked through the testing process, where does power come into play? In the process above, we record sample conversion rates, pa and pb, and then compute the test statistic, z. However, if we repeated the test many times, we would get different sample conversion rates and different test statistics, all centering around the true conversion rates of the creatives.
Assume the true conversion rate of Creative B is higher than that of Creative A. Some of these tests will still fail to reject the null hypothesis due to natural variance. Power is the percentage of these tests that reject the null hypothesis. This is the underlying mechanism behind all power analysis and hints at the missing ingredient: the true conversion rates—or more generally, the true effect size.
Intuitively, if the true effect size is higher, our measured effect would typically be higher and we would reject the null hypothesis more often, increasing power.
Choosing the true effect size
If we need true conversion rates to compute power, how do we get them? If we had them, we wouldn’t need to perform testing. Therefore, we need to make an assumption. Broadly, there are two approaches:
-
Choose the meaningful effect size: In this approach, we assign the true effect size (or true difference in conversion rates) to a level that would be meaningful. If Creative B only increased conversion rates by 0.01%, would we actually care and take action on those results? Probably not. So why would we care about being able to detect that small of an effect? On the other hand, if Creative B increased conversion rates by 50%, we certainly would care. In practice, the meaningful effect size likely falls between these two points.
-
Note: This is often referred to as the minimal detectable effect. However, the minimal detectable effect of the study and the minimal detectable effect that we care about (for example, we may only care about 5% or greater effects, but the study is designed to detect 1% or greater effects) may differ. For that reason, I prefer to use the term meaningful effect when referring to this strategy.
-
Use prior studies: If we have data from prior studies or models that measure the efficiency of this creative or similar creatives, we can use those values to assign the true effect size.
Both of the above approaches are valid.
If you only care to see meaningful effects and don’t mind if you miss out on detecting smaller effects, go with the first option. If you must see “statistical significance”, go with the second option and be conservative with the values you use (more on that in another article).
Technical Note
Because we don’t have true conversion rates, we are technically assigning a specific expected distribution to the alternative hypothesis and then computing power based on that. The true mean in the following passages is technically the expected mean under the alternative hypothesis. I will use the term true to keep the language simple and concise.
Computing and visualizing power
Now that we have the missing ingredients, true conversion rates, we can compute power. Instead of the measured pa and pb, we now have true conversion rates ra and rb.
We measure power as:
[ 1 – \beta = 1 – P(z < Z_{1-\alpha} ;|; N_a, N_b, r_a, r_b) ]
This may be confusing at first glance, so let’s break it down. We are stating that power (1 − β) is computed by subtracting the Type II error rate from one. The Type II error rate is the likelihood that a test results in a z-score below our significance threshold, given our sample size and true conversion rates ra and rb. How do we compute that last part?
In a two-proportion z-score test, we know that:
- Mean: μ = rb − ra
- Standard deviation: σ = √[ ra(1 − ra)/Na + rb(1 − rb)/Nb ]
Now we need to compute:
[ P(X > Z_{1-\alpha}), \quad X \sim N!\left(\frac{\mu}{\sigma},,1\right) ]
This is the area under the above distribution that lies to the right of Z1−α and is equivalent to computing:
[ P!\left(X > \frac{\mu}{\sigma} – Z_{1-\alpha}\right), \quad X \sim N(0,1) ]
If we had a textbook with a z-score table, we could simply look up the p-value associated with (μ / σ − Z1−α), and that would give us the power.
Let’s show this visually:
Show the code
r_a <- p_a # true baseline conversion rate; we are reusing the measured value
r_b <- p_b # true treatment conversion rate; we are reusing the measure value
alpha <- 0.05
two_sided <- FALSE # set TRUE for two-sided test
mu_diff <- function(r_a, r_b) r_b - r_a
sigma_diff <- function(r_a, r_b, N_a, N_b) {
sqrt(r_a*(1 - r_a)/N_a + r_b*(1 - r_b)/N_b)
}
power_value <- function(r_a, r_b, N_a, N_b, alpha, two_sided = FALSE) {
mu <- mu_diff(r_a, r_b)
sd1 <- sigma_diff(r_a, r_b, N_a, N_b)
zc <- critical_z(alpha, two_sided)
thr <- zc * sigma_diff(r_a, r_b, N_a, N_b)
if (!two_sided) {
1 - pnorm(thr, mean = mu, sd = sd1)
} else {
pnorm(-thr, mean = mu, sd = sd1) + (1 - pnorm(thr, mean = mu, sd = sd1))
}
}
# Build plot data
mu <- mu_diff(r_a, r_b)
sd1 <- sigma_diff(r_a, r_b, N_a, N_b)
zc <- critical_z(alpha, two_sided)
thr <- zc * sigma_diff(r_a, r_b, N_a, N_b)
# x-range covering both curves and thresholds
x_min <- min(-4*sd1, mu - 4*sd1, -thr) - 0.1*sd1
x_max <- max( 4*sd1, mu + 4*sd1, thr) + 0.1*sd1
xx <- seq(x_min, x_max, length.out = 2000)
df <- tibble(
x = xx,
H0 = dnorm(xx, mean = 0, sd = sd1), # distribution used by test threshold
H1 = dnorm(xx, mean = mu, sd = sd1) # true (alternative) distribution
)
# Regions to shade for power
if (!two_sided) {
shade <- df %>% filter(x >= thr)
} else {
shade <- bind_rows(
df %>% filter(x >= thr),
df %>% filter(x <= -thr)
)
}
# Numeric power for subtitle
pow <- power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
# Plot
ggplot(df, aes(x = x)) +
# H1 shaded power region
geom_area(
data = shade, aes(y = H1), alpha = 0.25
) +
# Curves
geom_line(aes(y = H0), linewidth = 1) +
geom_line(aes(y = H1), linewidth = 1, linetype = "dashed") +
# Critical line(s)
geom_vline(xintercept = thr, linetype = "dotted", linewidth = 0.8) +
{ if (two_sided) geom_vline(xintercept = -thr, linetype = "dotted", linewidth = 0.8) } +
# Mean markers
geom_vline(xintercept = 0, alpha = 0.3) +
geom_vline(xintercept = mu, alpha = 0.3, linetype = "dashed") +
# Labels
labs(
title = "Power as shaded area under H1 beyond critical threshold",
subtitle = TeX(sprintf(r"($1 - \beta$ = %.1f%% | $\mu$ = %.4f, $\sigma$ = %.4f, $z^*$ = %.3f, threshold = %.4f)",
100*pow, mu, sd1, zc, thr)),
x = TeX(r"(Difference in conversion rates ($D = p_b - p_a$))"),
y = "Density"
) +
annotate("text", x = mu, y = max(df$H1)*0.95, label = TeX(r"(H1: $N(\mu, \sigma^2)$)"), hjust = -0.05) +
annotate("text", x = 0, y = max(df$H0)*0.95, label = TeX(r"(H0: $N(0, \sigma^2)$)"), hjust = 1.05) +
theme_minimal(base_size = 13)

In the plot above, power is the area under the alternative distribution (H1) (where we assume the alternative is distributed according to our true conversion rates) that is beyond the critical threshold (i.e., the area where we reject the null hypothesis). With the parameters we set, the power is 0.96. This means that if we repeated this test many times with the same parameters, we would expect to reject the null hypothesis approximately 96% of the time.
Power curves
Now that we have intuition and math behind power, we can explore how power changes based on different parameters. The plots generated from such analysis are called power curves.
Note
Throughout the plots, you’ll notice that 80% power is highlighted. This is a common target for power in testing, as it balances the risk of Type II error with the cost of increasing sample size or adjusting other parameters. You’ll see this value highlighted in many software packages as a consequence.
Relationship with effect size
Earlier, I stated that the larger the effect size, the higher the power. Intuitively, this makes sense. We are essentially shifting the right bell curve in the plot above further to the right, so the area beyond the critical threshold increases. Let’s test that theory.
Show the code
# Function to compute power for varying effect sizes
power_curve <- function(effect_sizes, N_a, N_b, alpha, two_sided = FALSE) {
sapply(effect_sizes, function(e) {
r_a <- p_a
r_b <- p_a + e # Adjust r_b based on effect size
power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
})
}
# Generate effect sizes
effect_sizes <- seq(0, 0.1, length.out = 100) # Effect sizes from 0 to 10%
# Compute power for each effect size
power_values <- power_curve(effect_sizes, N_a, N_b, alpha)
# Create a data frame for plotting
power_df <- tibble(
effect_size = effect_sizes,
power = power_values
)
# Plot the power curve
ggplot(power_df, aes(x = effect_size, y = power)) +
geom_line(color = "blue", size = 1) +
geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) + # target power guide
labs(
title = "Power vs. Effect Size",
x = TeX(r"(Effect Size ($r_b - r_a$))"),
y = TeX(r'(Power ($1 - \beta $))')
) +
scale_x_continuous(labels = scales::percent_format(accuracy = 0.01)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
theme_minimal(base_size = 13)

Theory confirmed: as the effect size increases, power increases. It approaches 100% as the effect size increases and our decision threshold moves down the long-tail of the normal distribution.
Relationship with sample size
Unfortunately, we cannot control effect size. It is either the meaningful effect size you wish to detect or based on prior studies. It is what it is. What we can control is sample size. The larger the sample size, the smaller the standard deviation of the distribution and the larger the area under the curve beyond the critical threshold (imagine squeezing the sides to compress the bell curves in the plot earlier). In other words, larger sample sizes should lead to higher power. Let’s test this theory as well.
Show the code
power_sample_size <- function(N_a, N_b, r_a, r_b, alpha, two_sided = FALSE) {
power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
}
# Generate sample sizes
sample_sizes <- seq(100, 5000, by = 100) # Sample sizes from 100 to 5000
# Compute power for each sample size
power_values_sample <- sapply(sample_sizes, function(N) {
power_sample_size(N, N, r_a, r_b, alpha)
})
# Create a data frame for plotting
power_sample_df <- tibble(
sample_size = sample_sizes,
power = power_values_sample
)
# Plot the power curve for varying sample sizes
ggplot(power_sample_df, aes(x = sample_size, y = power)) +
geom_line(color = "blue", size = 1) +
geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) + # target power guide
labs(
title = "Power vs. Sample Size",
x = TeX(r"(Sample Size ($N$))"),
y = TeX(r"(Power (1 - $\beta$))")
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
theme_minimal(base_size = 13)

We again see the expected relationship: as sample size increases, power increases.
Note
In this specific setup, we can increase power by increasing sample size. More generally, this is an increase in precision. In other test setups, precision—and thus power—can be increased through other means. For example, in Geo-testing, we can increase precision by selecting predictable markets or through the inclusion of exogenous features (more on this in a future article).
Relationship with significance level
Does the significance level α influence power? Intuitively, if we are more willing to accept Type I error, we are more likely to reject the null hypothesis and thus (1 − β) should be higher. Let’s test this theory.
Show the code
power_of_alpha <- function(alpha_vec, r_a, r_b, N_a, N_b, two_sided = FALSE) {
sapply(alpha_vec, function(a)
power_value(r_a, r_b, N_a, N_b, a, two_sided)
)
}
alpha_grid <- seq(0.001, 0.20, length.out = 400)
power_grid <- power_of_alpha(alpha_grid, r_a, r_b, N_a, N_b, two_sided)
# Current point
power_now <- power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
df_alpha_power <- tibble(alpha = alpha_grid, power = power_grid)
ggplot(df_alpha_power, aes(x = alpha, y = power)) +
geom_line(color = "blue", size = 1) +
geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) + # target power guide
geom_vline(xintercept = alpha, linetype = "dashed", alpha = 0.6) + # your alpha
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
labs(
title = TeX(r"(Power vs. Significance Level)"),
subtitle = TeX(sprintf(r"(At $\alpha$ = %.1f%%, $1 - \beta$ = %.1f%%)",
100*alpha, 100*power_now)),
x = TeX(r"(Significance Level ($\alpha$))"),
y = TeX(r"(Power (1 - $\beta$))")
) +
theme_minimal(base_size = 13)

Yet again, the results match our intuition. There is no free lunch in statistics. All else equal, if we want to decrease our Type II error rate (β), we must be willing to accept a higher Type I rate (α).
Power analysis
So what is power analysis? Power analysis is the process of computing power given the parameters of the test. In power analysis, we fix parameters we cannot control and then optimize the parameters we can control to achieve a desired power level. For example, we can fix the true effect size and then compute the sample size needed to achieve a desired power level. Power curves are often used to assist with this decision-making process. Later in the series, I will walk through power analysis in detail with a real-world example.
Sources
[1] R. Larsen and M. Marx, An Introduction to Mathematical Statistics and Its Applications
What’s next in the Series?
I haven’t fully decided but I definitely want to cover the following topics:
- Power analysis in Geo Testing
- Detailed guide on setting the true effect size in various contexts
- Real world end-to-end examples
Happy to hear ideas. Feel free to reach out. My contact info is below:
- Email: [email protected]
- LinkedIn: Sam Arrington