Does More Data Always Yield Better Performance?

In data science, we strive to improve the less-than-desirable performance of our model as we fit the data at hand. We try techniques ranging from changing model complexity to data massaging and preprocessing. However, more often than not, we are advised to “just” get more data. Besides that being easier said than done, perhaps we should pause and question the conventional wisdom. In other words*,*

*Does adding more data **always *yield better performance?

In this article, let’s put this adage to the test using real data and a tool I constructed for such inquiry. We will shed light on the subtleties associated with data collection and expansion, challenging the notion that such endeavors automatically improve performance and calling for a more mindful and strategic practice…

*Does adding more data **always *yield better performance?

What Does More Data Mean?

Let’s first define what we mean exactly by “more data”. In the most general setting, we commonly imagine data to be tabular. And when the idea of acquiring more data is suggested, adding more rows to our data frame (i.e, more data points or samples) is what first comes to mind.

However, an alternative approach would be adding more columns (i.e., more attributes or features). The first approach expands the data vertically, while the second does so horizontally.

We will next consider the commonalities and peculiarities of the two approaches.

Data can be expanded by adding more samples or more columns. (Image by author)

Case 1: More Samples

Let’s consider the first case of adding more samples.* *Does adding more samples *necessarily *improve model performance?

In an attempt to get to the bottom of it, I created a tool hosted as a HuggingFace space to target this question. This tool allows the user to experiment with the effects of changing the attribute set, the sample size, and/or model complexity when analyzing the UCI Irvine – Predict Students’ Dropout and Academic Success dataset [1] with a decision tree. While both the tool and the dataset are meant for educational purposes, we will still be able to derive valuable insights that generalize beyond this basic setting.

…

Feature/Depth/Sample Explorer Tool (Image generated by the author using UCI dataset)

Say the school’s dean hands you some student records and asks you to identify the factors that predict student dropout to address the issue. You are given 1500 data points to start with. You create a 700-data-point hidden out test set and you use the rest for training. The data furnished to you contains the students’ nationalities and parents’ occupations, as well as the GDP and inflation and unemployment rates.

However, the results don’t seem impressive. The F1 score is low. So, naturally, you ask your dean to pull some strings to acquire more student records (perhaps from prior years or other schools), which they do over a couple of weeks. You rerun the experiment every time you get a new batch of student records. Conventional wisdom suggests that adding more data steadily improves the modeling process (Test F1 score should increase monotonically), but that’s not what you see. The performance erratically fluctuates as more data comes in. You are confused. Why would more data ever hurt performance? Why did the F1 score drop from 46% down to 39% when one of the batches was added? Shouldn’t the relationship be causal?

Number of samples vs. performance: Even with cross-validated hyper-parameter tuning, both training and test F1 scores fluctuate as the number of samples increases. The impact of adding more samples can be messy and counter-intuitive. (Image generated by author using UCI dataset)

Well, the question is really whether additional samples necessarily provide more information. Let’s first ponder the nature of these additional samples:

They could be **false **(i.e., a bug in data collection)
They could be **biased **(e.g., over-representing a special case that does not align with the true distribution as represented by the test set)
The test set itself may be biased…
**Spurious patterns **may be introduced by some batches and later cancelled by other batches.
The attributes collected establish **little to no correlation or causation **with the target (i.e., there are lurking variables unaccounted for). So, no matter how many samples you add, they are not going to get you anywhere!

So, yes, adding more data is generally a good idea, but we must pay attention to inconsistencies in the data (e.g. two students of the same nationality and social status may end up on different paths due to other factors). We must also carefully assess the usefulness of the available attributes (e.g., perhaps GDP has nothing to do with student dropout rate).

Some may argue that this would not be an issue when you have lots of real data (After all, this is a relatively small dataset). There is merit to that argument, but only if the data is well homogenized and accounts for the different variabilities and “degrees of freedom” of the attribute set (i.e., the range of values each attribute can take and the possible combinations of these values as seen in the real world). Research has shown cases in which large datasets that are considered gold standard show biases in interesting and obscure ways that were not easy to spot at first glance, causing misleading reports of high accuracy [2].

Case 2: More Attributes

Now, speaking of attributes, let’s consider an alternative scenario in which your dean fails to acquire more student records. However, they come and say, “Hey you… I wasn’t able to get more student records… but I was able to use some SQL to get more attributes for your data… I am sure you can improve your performance now. Right?… Right?!”

Feature set vs. performance: Each vertical line shows a retraining of the decision tree (800 samples with cross-validated hyper-parameter tuning) with one additional attribute. Some attributes help (Mother’s occupation), while others hurt (Father’s occupation and Gender). More columns may sometimes mean more noise and more ways to overfit. (Image generated by author using UCI dataset)

Well, let’s put that to the test. Let’s look at the following example where we incrementally add more attributes,** expanding the students’ profile and including their marital, financial, and immigration statuses**. Each time we add an attribute, we retrain the tree and evaluate its performance. As you can see, while some increments improve performance, others actually hurt it. But again, why?

Looking at the attribute set more closely, we find that not all attributes actually carry useful information. The real world is messy… Some attributes (e.g., Gender) might provide noise or false correlations in the training set that will not generalize well to the test set (overfitting).

Also, while common wisdom says that as you add more data you should increase your model complexity, this practice does not always yield the best result. Sometimes, when adding an attribute, lowering model complexity may help with overfitting (e.g., when *Course *was introduced to the mix).

Feature set vs. tree depth: The optimal tree depth (chosen by grid search) fluctuates as attributes are added. Notice that more attributes do not always translate to a larger tree. (Image generated by author using UCI dataset)

Conclusion

Taking a step back and looking at the big picture, we see that while collecting more data is a noble cause, we should be careful not to automatically assume that performance will get better. There are two forces at play here: how well the model fits the training data, and how reliably that fit generalizes and extends to unseen data.

Let’s summarize how each type of “more data” influences these forces—depending on whether the added data is good (representative, consistent, informative) or bad (biased, noisy, inconsistent):


	If data quality Is good…	If data quality is poor…
More samples (rows)	• Training error may rise slightly (more variations make it difficult to fit).• Test error usually drops. The model becomes more stable and confident.	• Training error may fluctuate due to conflicting examples.• Test error often rises.
More attributes (columns)	• Training error usually drops (more signal leads to richer representation.)• Test error drops as attributes encode true and generalizable patterns.	• Training error usually drops (the model memorizes noisy patterns).• Test error rises due to spurious correlations.

Generalization isn’t just about quantity—it’s also about quality and the right level of model complexity.

To wrap up, next time someone suggests that you should “simply” get more data to magically improve accuracy, discuss with them the intricacies of such a plan. Talk about the characteristics of the procured data in terms of nature, size, and quality. Point out the nuanced interplay between data and model complexities. This will help make their effort worthwhile!

Lessons to Internalize:

Whenever possible, don’t take others’ (or my) word for it. Experiment yourself!
When adding more data points for training, ask yourself: Do these samples represent the phenomenon you are modeling. Are they showing the model more interesting realistic cases? or are they biased and/or inconsistent?
When adding more attributes, ask yourself: Are these attributes hypothesized to carry information that enhances our ability to make better predictions, or is it mostly noise?
Ultimately, conduct hyper-parameter tuning and proper validation to eliminate doubts when assessing how informative the new training data is.

Try it yourself!

If you’d like to explore the dynamics showcased in this article yourself, I host the interactive tool here. As you experiment by adjusting the sample size, number of attributes, and/or model depth, you will observe the impact of these adjustments on model performance. Such experimentation enriches your perspective and understanding of the mechanisms underlying data science and analytics.

References:

[1] M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. (2021) “Early prediction of student’s performance in higher education: a case study” Trends and Applications in Information Systems and Technologies, vol.1, in Advances in Intelligent Systems and Computing series. Springer. DOI: 10.1007/978-3-030-72657-7_16. This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

[2] Z. Liu and K. He, A Decade’s Battle on Dataset Bias: Are We There Yet? (2024), arXiv:https://arxiv.org/abs/2403.08632