13 min readSep 9, 2025
–
In 2001, the legendary statistician Leo Breiman published the paper “Statistical Modeling: The Two Cultures”, which foresaw the fundamental battle that would define the era of Artificial Intelligence.
This battle, as he prophetically described it, was waged between two philosophies, or “cultures,” as he called them. On one side stood the Data Modeling Culture, which dominated statistical thinking. This culture operates on the assumption that data is generated by a specific stochastic model, such as a linear regression, and the goal is to understand that model’s parameters. On the other side was the Algorithmic Modeling Culture, a pragmatic approach that treats nature’s mechanism as an unknown black b…
13 min readSep 9, 2025
–
In 2001, the legendary statistician Leo Breiman published the paper “Statistical Modeling: The Two Cultures”, which foresaw the fundamental battle that would define the era of Artificial Intelligence.
This battle, as he prophetically described it, was waged between two philosophies, or “cultures,” as he called them. On one side stood the Data Modeling Culture, which dominated statistical thinking. This culture operates on the assumption that data is generated by a specific stochastic model, such as a linear regression, and the goal is to understand that model’s parameters. On the other side was the Algorithmic Modeling Culture, a pragmatic approach that treats nature’s mechanism as an unknown black box and focuses on finding an algorithm with the highest possible predictive accuracy, without being tied to assumptions about how the data was generated.
Breiman’s own journey, having spent over a decade as a consultant before returning to academia, placed him at the very heart of this clash. He recounts a telling anecdote from the late 1970s, when he described his work on “decision trees” — an early algorithmic model — to a prominent statistician colleague. After the explanation, his colleague’s first question was:
“What’s the model for the data?”.
This seemingly simple question, in fact, is the key to the entire conflict. It reveals a worldview where the validity of a method lies not in its performance or predictive accuracy, but in its adherence to a preconceived theoretical model. It is the clash between the quest to interpret a model and the quest for the result of an algorithm.
Press enter or click to view image in full size
We insist on forcing the tree’s untamable complexity into the pot’s rigid, limiting form, believing our assumptions can contain reality. But the truth always emerges: the roots of the real world are stronger than the walls of any preconceived model.
Part 1: The Reason for Data Modeling (The Respected Past)
Linear regression, as we know it, is not a simple method, but rather the culmination of a series of scientific advances that sought to solve measurement problems replete with noisy data. In the late 18th and early 19th centuries, the challenge posed was not “Big Data,” but rather “Imperfect Data”: small measurement errors in the orbits of celestial bodies generated errors in the prediction of these same orbits.
1.1 Before computers, there was computation. Before algorithms, there were models.
In 1805, Adrien-Marie Legendre published the “method of least squares,” elegantly simple yet revolutionary: of all the possible curves that could describe the data, the “best” is the one that minimizes the sum of the squares of the vertical distances (the errors or residuals) between the observed points and the curve itself.
Legendre begins his article by describing a problem common to all experimental sciences: we have more equations (in this case, observations) than unknowns. In modern parlance, we have an ‘overdetermined’ system of linear equations. For example, with 50 observations, we need to determine the parameters m and b of the equation of a straight line. In other words, it is impossible to find a straight line that passes through all points, so there will always be an error E for each equation.
The French mathematician’s genius was to use the most versatile tool of the time — the calculus of the derivative — to find the solution. In his words:
“Of all the principles that can be proposed for that purpose, I think there is none more general, more exact, and more easy of application than that which we made use of in the preceding researches, and which consists of rendering the sum of squares of the errors a minimum.” — Adrien-Marie Legendre, 1805
In other words, instead of trying to minimize errors (which would not be efficient, as we would have a “zero sum” given the existence of positive and negative values), he will focus on minimizing the sum of the squares of the errors, resulting in a more elegant and efficient approach.
To find the minimum value of a function f, we need to determine its derivative f’ and then set it equal to zero: f’(x) = 0. In this case, we need to do this for a function of multiple variables (x, y, z, …) in R^n.
If we consider m equations (observations) and n variables, with m > n, for each observation i (from 1 to m), we have an equation:
where E_i represents the error, which ideally should be zero.
We can rewrite this last equation as
where the term a_i is typically the observed value, and the linear equation combination of the variables is the value predicted by the model. It is advantageous to represent the problem in vector notation to facilitate computation, so we have, in this order, the parameter vector, design matrix:
The coefficients b, c, f, … for the i-th equation are the i-th row of our design matrix, A_mxn, and the term a_i in each equation is actually the negative of our observation y_i. In vector notation, the equation for the error *E_i *will be
where A_ij is the coefficient of the* j-*th variable in the *i-*th equation. For all m equations/observations, we have:
where e is the error vector of dimension m x 1:
A is the matrix of known coefficients of dimension m x m, β is the vector of variables to be determined of dimension n x 1, and y is the vector of known observations of dimension m x 1.
Calling the sum of the squares of the errors S,
we want to minimize the cost (or loss) function S(β), which is the square of the Euclidean norm of the vector e:
The important point to emphasize here is that the function S is quadratic, continuous, and differentiable over the entire domain, and can be treated with the tools of differential calculus.
To find the minimum of S, the necessary condition is that the derivative of S at β is 0. Since we must differentiate for each βⱼ, we need to use the partial derivative or gradient:
We first expand S(β) to
where the inner sum of the parentheses represents the model’s prediction for the i-th observation. Simply put, it is the dot product between the i-th row of matrix A and the parameter vector β.
Now, differentiating with respect to a specific βⱼ, we have:
The partial derivative of the inner term with respect to βⱼ is simply Aᵢⱼ, since all other βₖ terms, with k ≠ j, and yᵢ are treated as constants. Thus, equating the partial derivative with respect to βⱼ to 0 and performing the algebraic manipulations (dividing by 2, distributing Aᵢⱼ, and regrouping the sums), we obtain:
Despite the apparent complexity of this modern notation, the above result is, in essence, the exact formalization of Legendre’s result. Indeed, this formula is the precise translation of the method Legendre himself described: to form the equation for one of the unknowns, one must multiply all the terms of each original equation by the coefficient of that unknown and then add all these resulting products. The terms we see here as explicit sums (Σ)
are precisely the quantities that Legendre aggregated using the notation of his time (similar to the integral symbol). Therefore, what we see is not a different result, but rather the same seminal reasoning as Legendre’s, expressed with the precision and generality of contemporary mathematical language.
What seems complex is actually an elegant recipe: for each parameter, multiply each original equation by that parameter’s coefficient and add them all together. The result is a system of equations that, when solved, gives us the values that minimize the total error. This is the root of it all.
1.2 From the Heavens of the 19th Century to the GPUs of the 21st.
In 1801, Carl Friedrich Gauss employed this method (Gauss claims to have invented the technique in 1795, but published it in 1809) to predict the orbit of the asteroid Ceres with remarkable accuracy. At the time, Ceres had few measurements of its orbit and was hidden behind the Sun. With the limited data at hand, Gauss accurately calculated its trajectory, allowing other astronomers to find it exactly where he predicted.
Sum of Squarred Errors
Before this tool, scientists often averaged data or used ad-hoc methods. Regression provided an objective, defensible, mathematical criterion for finding the “best fit.” With it, it was possible to determine a function *f *that could be linear in multiple variables plus an error term epsilon (ϵ) to an output y, a gigantic conceptual leap. With numerous improvements over the years, the method allowed not only to describe a relationship but also to test hypotheses about it (e.g., “Does variable X really have a significant effect on Y?”).
The Power of Squares: By squaring the errors (the difference between the actual measurement and the estimated value on the line), the method disproportionately penalizes distant points (a mistake of 3 becomes an area of 9), while minimizing the impact of nearby points (an error of 0.5 becomes 0.25). This means that the algorithm is terrified of large errors. A single point very far from the line (an outlier) can significantly “pull” the regression line toward itself.
The beauty of Legendre’s and Gauss’s logic is its universality. The principle of adjusting parameters to minimize the sum of squared errors is so fundamental that it reappears, in its essence, at the heart of the most complex Deep Learning architectures. Today, we no longer need to solve the system of equations by hand; we have optimized frameworks that perform this task on a massive scale.
TensorFlow and PyTorch are, at their core, engines for numerical computation and automatic differentiation. They were built to optimize loss functions by adjusting parameters — precisely the problem Gauss solved for Ceres.
What once required the genius of Legendre and Gauss, and likely days of meticulous hand calculation with quill and paper, is now encapsulated with an almost unbelievable simplicity in just a few lines of Python code. This does not diminish the genius of the method; on the contrary, it is the ultimate tribute to its robustness. The logic was so powerful that it has survived two centuries of technological revolution to become a fundamental command, executed in milliseconds by any one of us.
Press enter or click to view image in full size
Method of Least Squares with TensorFlow
Press enter or click to view image in full size
Method of Least Squares with PyTorch
Part 2: The Illusion of the Model: Breiman’s Critique of the First Culture (The Necessary Criticism)
It’s time to explore why Leo Breiman felt the need to declare a crisis. His critique is not about mathematics itself, but about the culture and practice that have developed around it.
2.1 The Fundamental Error: Confusing the Map with the Territory
The starting point of Breiman’s critique is a subtle but profound philosophical error: the confusion between the model and the mechanism of nature. When constructing a regression model, the quantitative conclusions we draw are about the model’s mechanism, not about how reality works. Breiman warns that if the model is a poor emulation of nature, our conclusions become mere caricatures of reality.
This confusion leads to a crucial analytical shortcoming, where the focus shifts from the real problem to the model itself. The question ceases to be “How do I solve this problem?” and becomes “How can I adjust this problem to fit my model?”
2.2 A Question of Faith, Not Fit
Breiman argues that this culture has become so dogmatic that “the belief in the infallibility of data models was almost religious.” In one of his strongest statements, he observes:
“The question of how well the model fits the data is of secondary importance compared to the construction of an ingenious stochastic model.”
This obsession with a narrow class of models has had severe technical consequences. Methods like linear regression, while useful, become problematic and lose power in problems with more than four or five dimensions. Furthermore, the lack of power of goodness-of-fit tests means that the same data can be mapped by different models that, while passing the tests, tell completely different stories about reality.
This has led to a dangerous stagnation. Breiman asserts that, with the insistence on data models,
“(…) the tools of multivariate analysis in statistics are frozen in discriminant analysis, logistic regression, and multiple linear regression.”
He adds with the wry observation that
“No one really believes that multivariate data are multivariate normal, yet this data model takes up a large number of pages in every graduate-level textbook on multivariate statistical analysis.”
For Breiman, the data modeling culture had painted itself into a corner, which he sums up with a timeless adage:
“If all a man has is a hammer, then every problem looks like a nail.”
Press enter or click to view image in full size
The elegance of a tool, such as a data model, lies in its specific purpose. The true wisdom, however, is in discerning when its application is not the universal solution.
Part 3: The Algorithmic Answer: Breiman’s Vision for the Future (The Visionary Solution That Became the Present)
If the first culture was based on inference from predefined models, Breiman’s answer points to a new culture in his time, forged in practice and validated by performance.
3.1 The New Community
In the mid-1980s, the landscape began to change with the emergence of two powerful new algorithms: neural networks and decision trees. With these tools, a new research community flourished, composed of young computer scientists, physicists, and engineers, with the clear goal of achieving maximum predictive accuracy. They began applying these new methods to complex problems where traditional data models were inadequate, such as speech, image, and handwriting recognition.
The philosophy of this community was fundamentally different. Data models were rarely used. The approach assumed that nature produces data in a black box whose interior is complex, mysterious, and, in part, unknowable. The theory, therefore, shifted its focus: instead of validating data models, it began to characterize the properties of the algorithms themselves — their predictive strength, their convergence, and the factors that confer them high accuracy.
3.2 The New Rules of the Game: Rashomon and Occam’s Dilemma
This new approach brought with it new insights. One of them is what Breiman calls the Rashomon Effect: the realization that there are often an infinite number of different models (or equations) that produce virtually the same minimum error rate. This challenges the notion that there is a single “true” model yet to be discovered.
This brings us to Occam’s Dilemma: in prediction, accuracy and simplicity (interpretability) are in conflict. A single decision tree, for example, is wonderfully interpretable (grade “A+”), but its predictive power is only modest (grade “B”). In contrast, a “forest” of trees (a random forest) can achieve “A+” accuracy, but its decision mechanism is so complex that its interpretability is virtually zero (grade “F”).
This raises the question: if the most accurate models are black boxes, how can we extract knowledge from them? Breiman’s answer is the climax of his argument: “The goal is not interpretability, but accurate information.” He demonstrates that greater predictive accuracy is associated with more reliable information about the underlying data mechanism, and that poor accuracy can lead to questionable conclusions. Algorithmic models, because they are more accurate, can provide better insights and uncover important aspects of the data that standard models cannot.
Perhaps no algorithm better exemplifies this new philosophy than* Support Vector Machines *(SVMs). Instead of conforming to the data, SVMs transform it, projecting it into higher-dimensional spaces to find a separating hyperplane. This is Breiman’s ‘blessing of dimensionality’ in action: a problem that is intractable in three dimensions may become trivially simple in ten. The elegance of the solution lies not in an interpretable model of reality, but in the geometric beauty of its separating power, grounded purely in the concepts of Linear Algebra.
3.3 Breiman’s Vision: A Plea for Science
Ultimately, Breiman’s message is not a declaration of war, but a call for reunification and evolution. He makes it clear:
“I’m not against data models per se. In some situations, they’re the most appropriate way to solve the problem. But the emphasis needs to be on the problem and the data.”
What he advocates is an expansion of the data scientist’s arsenal. “The trick to being a scientist is being open to using a wide variety of tools.” He reminds us that “the roots of statistics, like science, lie in working with data and testing theory against data,” and expresses the hope that the field will return to these roots.
For Breiman, this evolution is not optional; it is a matter of survival:
“I believe this trend will continue, and indeed, it must continue if we are to survive as an energetic and creative field.”
Epilogue: The Two Cultures 25 Years Later
The algorithmic culture has not only grown; it has become the dominant force driving AI innovation. LLMs and Transformers are the culmination of the “black box” philosophy. Breiman’s criterion — predictive accuracy against benchmarks — is how success is measured.
The conflict between accuracy and interpretability that Breiman described today is a chasm. We have models with trillions of parameters that surpass human performance, but whose inner workings are profoundly obscure to their own creators.
Press enter or click to view image in full size
With just a few lines of Python code in Google Colab and the transformers library, we can load a modern language model, such as Google’s Gemma-7B. The result: around 8.5 billion parameters. Each one was a “knob” that was adjusted during training to minimize an error function. Trying to “interpret” the relationship between these billions of buttons in the same way Gauss interpreted the coefficients of his regression is a task that borders on the impossible. This is the abyss of Occam’s Dilemma in its most extreme form.
And herein lies the irony, and its most powerful insight. Precisely because of the triumph of the “black box,” the desire for the first culture’s goal — understanding and information — is returning with a vengeance under a new name: Explainable AI (XAI), AI ethics, and bias analysis. We are desperately trying to build new tools (a “third culture”?) to peer inside the black boxes that the second culture has constructed. Breiman’s prophecy was fulfilled in such a way that it itself generated the need for a new movement.