be a sensitive topic. Perhaps best avoided on first encounter with a Statistician. The disposition toward the topic has led to a tacit agreement that α = 0.05 is the gold standard—in truth, a ‘convenient convention’, a rule of thumb set by Ronald Fisher himself.
Who?? Don’t know him? Don’t worry.
He was the first to introduce Maximum Likelihood Estimation (MLE), ANOVA, and Fisher Information (the latter, you may have guessed). Fisher was more than a relevant figure in the community, the father of statistics. Had a deep interest in Mendelian genetics and evolutionary biology, for which he would make several key contributions. Unfortunately, Fisher also had a thorny past. He was involved with the Eugenics Society and its policy of voluntary sterilization for the “feeble-mi…
be a sensitive topic. Perhaps best avoided on first encounter with a Statistician. The disposition toward the topic has led to a tacit agreement that α = 0.05 is the gold standard—in truth, a ‘convenient convention’, a rule of thumb set by Ronald Fisher himself.
Who?? Don’t know him? Don’t worry.
He was the first to introduce Maximum Likelihood Estimation (MLE), ANOVA, and Fisher Information (the latter, you may have guessed). Fisher was more than a relevant figure in the community, the father of statistics. Had a deep interest in Mendelian genetics and evolutionary biology, for which he would make several key contributions. Unfortunately, Fisher also had a thorny past. He was involved with the Eugenics Society and its policy of voluntary sterilization for the “feeble-minded.”
Yes, there is no such thing as a famous Statistician.
But a rule of thumb set by the father of statistics can sometimes be mistaken for a law, and law it is not.
A portrait of the young Ronald Aylmer Fisher (1890–1962). Source: Wikimedia Commons (file page). Public Domain.
There is one key instance when you are not only compelled to but *must *change this alpha-level, and that all comes down to multiple hypothesis testing.
To run multiple tests without using the Bonferroni Correction or the Benjamini-Hochberg procedure is more than problematic. Without these corrections, we could prove any hypothesis:
**H₁: The sun is blue **
By simply re-running our experiment until luck strikes. But how do these corrections work? and which one should you use? They are not interchangeable!
P-values and a problem
To understand why, we need to look at what exactly our p-value is telling us. To understand it deeper than small is good, and big is bad. But to do this, we will need an experiment, and nothing is as exciting — or as contested — as discovering superheavy elements.
These elements are *incredibly *unstable and created in particle accelerators, one atom at a time. Pound-for-pound, the most expensive thing ever produced. Existing only in cosmic events like supernovae, lasting only for *thousandths *or *millionths *of a second.
But their instability becomes an advantage for detection, as a new superheavy element would exhibit a distinct radioactive decay. The decay sequence captured by sensors in the reactor can tell us whether a new element is present.
Scientist working inside the ORMAK fusion device at ORNL. Credit: Oak Ridge National Laboratory, licensed under CC BY 2.0
As our null hypothesis, we state:
H₀ = The sequence is background noise decay. (No new element)
Now we need to gather evidence that H₀ is not true if we want to prove we have created a new element. This is done through our test statistic T(X). In general terms, this captures the difference between what the sensors observe and what is expected from background radiation. *All *test statistics are a measure of ‘surprise’ between what we expect to observe if H₀ is true and what our sample data actually says. The larger T(X), the more evidence we have that H₀ is false.
This is precisely what the Schmidt test statistic does on the sequence of radioactive decay times.
[ \sigma_{obs} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (\ln t_i – \overline{\ln t})^2} ]
The Schmidt test statistic was used in the discovery of: Hassium (108), Meitnerium (109) in 1984 Darmstadtium (110), Element 111 Roentgenium (111), Copernicium (112) from 1994 to 1996 **Moscovium **(115), Tennessine (117). from 2003 to 2016
It is essential to specify a distribution for H₀ so that we can calculate the probability that a test statistic is as extreme as the test statistic of the observed data.
We assume noise decays follow an exponential distribution. There are a million reasons why this is a good assumption, but let’s not get bogged down here. If we don’t have a distribution for H₀, computing our probability value would be impossible!
[ H_0^{(Schmidt)}:t_1,…,t_n i.i.d. ∼ Exp(λ) ]
The p-value is then the probability under the null model of obtaining a test statistic at least as extreme as that computed from the sample data. The less likely our test statistic is, the more likely it is that H₀ is false.
[ p ;=; \Pr_{H_0}!\big( T(X) \ge T(x_{\mathrm{obs}}) \big). ]
Edvard Munch, At the Roulette Table in Monte Carlo (1892). Source: Wikimedia Commons (file page). Public Domain (CC PDM 1.0)
Of course, this brings up an interesting issue. What if we observe a rare background decay rate, a decay rate that simply resembles that of an undiscovered decaying particle? What if our sensors detect an unlikely, though possible, decay sequence that yields a large test statistic? Each time we run the test there is a small chance of getting an outlier simply by chance. This outlier will give a large test statistic as it will be quite different than what we expect to see when H₀ is true. The large T(x) will be in the tails of our expected distribution of H₀ and will produce a small p-value. A small probability of observing anything more extreme than this outlier. But no new element exists! we just got 31 red by playing roulette a million times.
It seems unlikely, but when you keep in mind that protons are being beamed at target particles for months at a time, the possibility stands. So how do we account for it?
There are two ways: a conservative and a less conservative method. Your choice depends on the experiment. We can use the:
- Family Wise Error Rate (FWER) and the Bonferroni correction
- False Discovery Rate (FDR) and the Benjamini-Hochberg procedure
These are not interchangeable! You need to carefully consider your study and pick the right one.
If you’re interested in the physics of it:
New elements are created by accelerating lighter ions at 10% the speed of light. These ion beams bombard heavier target atoms. The incredible speeds and kinetic energy are required to overcome the coulomb barrier (the immense repulsive force between two positively charged particles.
| New Element | Beam (Protons) | Target (Protons) |
| Nihonium (113) | Zinc-70 (30) | Bismuth-209 (83) |
| Moscovium (115) | Calcium-48 (20) | Americium-243 (95) |
| Tennessine (117) | Calcium-48 (20) | Berkelium-249 (97) |
| Oganesson (118) | Calcium-48 (20) | Californium-249 (98) |
Computer simulation showing the collision and fusion of two atomic nuclei to form a superheavy element. (Credit: Lawrence Berkeley National Laboratory) Public Domain
Family Wise Error Rate (Bonferroni)
This is our conservative approach, and what should be used if we cannot admit any false positives. This approach keeps the probability of admitting at least one Type I error below our alpha level.
[ \Pr(\text{at least one Type I error in the family}) \leq \alpha ]
This is also a simpler correction. Simply divide the alpha level by the number of times the experiment was run. So for every test you reject the null hypothesis if and only if:
[ p_i \leq \frac{\alpha}{m} ]
Equivalently, you can adjust your p-values. If you run ***m ***tests, take:
[ p_i^{\text{adj}} = \min(1, m p_i) ]
And reject the null hypothesis if:
[ p_i^{(\text{Bonf})} \le \alpha ]
All we did here was multiply both sides of the inequality by m.
The proof for this is also a slim one-line. If we let Aᵢ be the event that there is a false positive in test i. Then the probability of having at least one false positive will be the probability of the union of all these events.
[ \text{Pr}(\text{at least one false positive}) = \text{Pr}\left(\bigcup_{i=1}{m} A_i\right) \le \sum_{i=1}{m} \text{Pr}(A_i) \le m \cdot \frac{\alpha}{m} = \alpha ]
Here we make use of the union bound. a fundamental concept in probability that states the probability of A₁, or A₂, or Aₖ happening must be less than or equal to the sum of the probability of each event happening.
[ \text{Pr}(A_1 \cup A_2 \cup \cdots \cup A_k) \le \sum_{i=1}^{k} \text{Pr}(A_i) ]
False Discovery Rate (Benjamini-Hochberg)
The Benjamini-Hochberg procedure also isn’t too complicated. Simply:
- Sort your p-values: p₁ ≤ … ≤ pₘ.
- Accept the first k where pₖ > α/(m−k+1)
In this approach, the goal is to control the false discovery rate (FDR).
[ \text{FDR} = E\left[ \frac{V}{\max(R, 1)} \right] ]
Where ***R ***is the number of times we reject the null hypothesis, and ***V ***is the number of rejections that are (unfortunately) false positives (Type I errors). The goal is to keep this metric below a specific threshold q = 0.05.
The BH thresholds are:
[ \frac{1}{m}q, \frac{2}{m}q, \dots, \frac{m}{m}q = q ]
And we reject the first smallest p-values where:
[ P_{(k)} \leq \frac{k}{m}q ]
Use this when you are okay with some false positives. When your primary concern is minimizing the type II error rate, that is, you want to make sure there are fewer false negatives, no instances when we accept H₀ when H₀ is in fact false.
Think of this as a genomics study where you aim to identify everyone who has a specific gene that makes them more susceptible to a particular cancer. It would be less harmful if we treated some people who did not have the gene than risk letting someone who did have it walk away with no treatment.
Quick side-by-side
Bonferroni:
- Controls family-wise error rate (FWER).
- Guarantees the probability of a single false discovery rate ≤ α
- Higher rate of false negatives ⇒ Lower statistical power
- Zero risk tolerance
Benjamini-Hochberg
- Controls False Discovery Rate (FDR)
- guarantees that among all discoveries, false positives are ≤ q
- Fewer false negatives ⇒ Higher statistical power
- Some risk tolerance
A super-tiny p for a super-heavy atom
We can’t have any nonexistent elements in the periodic table, so when it comes to finding a new element, the Bonferroni correction is the right approach. But when it comes to decay chain data collected by position-sensitive silicon detectors, picking an m isn’t so simple.
Physicists tend to use the expected number of random chains produced by the entire search over the entire dataset:
[ \Pr(\ge 1 \text{ random chain}) \approx 1 – e^{-n_b} ]
[ 1 – e^{-n_b} \leq \alpha_{\text{family}} \Rightarrow n_b \approx \alpha_{\text{family}} \quad (\text{approximately, for rare events}) ]
The number of random chains comes from observing the background data when no experiment is taking place. from this data we can build the null distribution H₀ through monte carlo simulation
We estimate the number of random chains by modelling the background event rates and resampling the observed background events. Under H₀ (no heavy element decay chain), we use Monte Carlo to simulate many null realizations and compute how often the search algorithm produces a chain as extreme as the observed chain.
More precisely:
H₀: background events arrive as a Poisson process with rate λ ⇒ inter-arrival times are Exponential.
Then an accidental chain is k consecutive hits in τ time. We scan the data using our test statistic to determine whether an extreme cluster exists.
lambda_rate = 0.2 # events per second
T_total = 2_000.0 # seconds of data-taking (mean events ~ 400)
k = 4 # chain length
tau_obs = 0.20 # "observed extreme": 4 events within 0.10 sec
Nmc = 20_000
rng = np.random.default_rng(0)
def dmin_and_count(times, k, tau):
if times.size < k:
return np.inf, 0
spans = times[k-1:] - times[:-(k-1)]
return float(np.min(spans)), int(np.sum(spans <= tau))
...
Monte-Carlo Simulation on GitHub
Illustration by Author
If you’re interested in the numbers, in the discovery of element 117 **Tennessine **(Ts), a p-value of 5×10−16 was used. I imagine that if no corrections were ever used, our periodic table would, unfortunately, not be poster-sized, and chemistry would be in shambles.
Conclusion
This whole concept of searching for something in lots of places, then treating a specially significant blip as if it came from one observation, is commonly referred to as the Look-Elsewhere Effect. and there are two primary ways we can adjust for this:
- Bonferroni Correction
- Benjamini-Hochberg Procedure
Our choice entirely depends on how conservative we want to be.
But even with a p-value of 5×10−16, you might be wondering when a p-value of 10^-99 should still be discarded. And that all comes down to Victor Ninov, a physicist at Lawrence Berkeley National Laboratory. Who was – for a brief moment – the man who discovered element 118.
However, an internal investigation found that he had fabricated the alpha-decay chain. In this instance, with respect to research misconduct and falsified data, even a p-value of 10^-99 does not justify rejecting the null hypothesis.
Yuri Oganessian, the leader of the team at the Joint Institute for Nuclear Research in Dubna and Lawrence Livermore National Laboratory, who discovered Element 118. Wikimedia Commons, CC BY 4.0.
References
Bodmer, W., Bailey, R. A., Charlesworth, B., Eyre-Walker, A., Farewell, V., Mead, A., & Senn, S. (2021). The outstanding scientist, RA Fisher: his views on eugenics and race. Heredity, 126(4), 565-576.
Khuyagbaatar, J., Yakushev, A., Düllmann, C. E., Ackermann, D., Andersson, L. L., Asai, M., … & Yakusheva, V. (2014). Ca 48+ Bk 249 fusion reaction leading to element Z= 117: Long-lived α-decaying Db 270 and discovery of Lr 266. Physical review letters, 112(17), 172501.
Positives, H. M. F. Multiple Comparisons: Bonferroni Corrections and False Discovery Rates.