of the universe (made by one of the most iconic singers ever) says this:
Wish I could go back And change these years I’m going through changes
Black sabbath – Changes
This song is incredibly powerful and talks about how life can change right in front of you so quickly.
That song is about a broken heart and a love story. However, it also reminds me a lot of the changes that my job, as a Data Scientist, has undergone over the last 10 years of my career:
- When I started studying Physics, the only thing I thought of when someone said “Transformer” was Optimus Prime. Machine Learning for me was all about **Linear Regression, SVM, Random Forest etc… **[2016]
- When I did my Master’s Degree in Big Data and Physics of Complex Systems, I first heard of “BERT” and…
of the universe (made by one of the most iconic singers ever) says this:
Wish I could go back And change these years I’m going through changes
Black sabbath – Changes
This song is incredibly powerful and talks about how life can change right in front of you so quickly.
That song is about a broken heart and a love story. However, it also reminds me a lot of the changes that my job, as a Data Scientist, has undergone over the last 10 years of my career:
- When I started studying Physics, the only thing I thought of when someone said “Transformer” was Optimus Prime. Machine Learning for me was all about **Linear Regression, SVM, Random Forest etc… **[2016]
- When I did my Master’s Degree in Big Data and Physics of Complex Systems, I first heard of “BERT” and various Deep Learning technologies that seemed very promising at that time. The first GPT models came out, and they looked very interesting, even though no one expected them to be as powerful as they are today. [2018-2020]
- Fast forward to my life now as a full-time Data Scientist. Today, if you don’t know what GPT stands for and have never read “Attention is All You Need” you have very few chances of passing a Data Science System Design interview. [2021 – today]
When people state that the tools and the everyday life of a person working with data are substantially different than 10 (or even 5) years ago, I agree all the way. What *I do not agree *with is the idea that the tools used in the past should be erased just because everything now seems to be solvable with GPT, LLMs, or Agentic AI.
The goal of this article is to consider a single task, which is **classifying the love/hate/neutral intent of a Tweet. **In particular, we will do it with **traditional Machine Learning, Deep Learning, and Large Language Models. **
We will do this hands-on, using Python, and we will describe why and when to use each approach. Hopefully, after this article, you will learn:
- The tools used in the early days should still be considered, studied, and at times adopted.
- Latency, Accuracy, and Cost should be evaluated when choosing the best algorithm for your use case
- Changes in the Data Scientist world are necessary and to be embraced without fear 🙂
Let’s get started!
1. The Use Case
The case we are dealing with is something that is actually very adopted in Data Science/AI applications: sentiment analysis. This means that, given a text, we want to extrapolate the “feeling” behind the author of that text. This is very useful for cases where you want to gather the feedback behind a given review of an object, a movie, an item you are recommending, etc…
In this blog post, we are using a very “famous” sentiment analysis example, which is classifying the feeling behind a tweet. As I wanted more control, we will not work with organic tweets scraped from the web (where labels are uncertain). Instead, we will be using content generated by Large Language Models that we can control.
This technique also allows us to tune the difficulty and the variety of the problem and to observe how different techniques react.
-
Easy case: the love tweets sound like postcards, the hate ones are blunt, and the neutral messages talk about weather and coffee. If a model struggles here, something else is off.
-
Harder case: still love, hate, neutral, but now we inject sarcasm, mixed tones, and subtle hints that demand attention to context. We also have less data, to have a smaller dataset to train with.
-
Extra Hard case: we move to five emotions: love, hate, anger, disgust, envy, so the model has to parse richer, more layered sentences. Moreover, we have 0 entries to train the data: we can not do any training.
I have generated the data and put each of the files in a specific folder of the public GitHub Folder I have created for this project [data].
Our goal is to build a smart classification system that will be able to efficiently grasp the sentiment behind the tweets. But how shall we do it? Let’s figure it out.
2. System Design
A picture that is always extremely helpful to consider is the following:
Image made by author
Accuracy, cost, and scale in a Machine Learning system form a triangle. You can only fully optimize two at the same time.
You can have a very accurate model that scales very well with millions of entries, but it won’t be quick. You can have a quick model that scales with millions of entries, but it won’t be that accurate. You can have an accurate and quick model, but it won’t scale very well.
These considerations are abstracted from the specific problem, but they help guide which ML System Design to build. We will come back to this.
Also, the power of our model should be proportional to the size of our training set. In general, we try to avoid the training set error to decrease at the cost of an increase in the test set (the famous overfitting).
Image made by author
We don’t want to be in the Underfitting or Overfitting area. Let me explain why.
In simple terms, underfitting happens when your model is too simple to learn the real pattern in your data. It is like trying to draw a straight line through a spiral. Overfitting is the opposite. The model learns the training data too well, including all the noise, so it performs great on what it has already seen but poorly on new data. The sweet spot is the middle ground, where your model understands the structure without memorizing it.
We will come back to this one as well.
3. Easy Case: Traditional Machine Learning
We open with the friendliest scenario: a highly structured dataset of 1,000 tweets that we generated and labelled. The three classes (positive, neutral, negative) are balanced on purpose, the language is very explicit, and every row lives in a clean CSV.
Let’s start with a simple import block of code.
Let’s see what the dataset looks like:
Image made by author
Now, we anticipate that this won’t scale for millions of rows (because the dataset is too structured to be diverse). However, we can build a very quick and accurate method for this tiny and specific use case. Let’s start with the modeling. Three main points to consider:
- We are doing train/test split with 20% of the dataset in the test set.
- We are going to use a TF-IDF approach to get the embeddings of the words. TF-IDF stands for Term Frequency–Inverse Document Frequency. It is a classic technique that transforms text into numbers by giving each word a weight based on how important it is in a document compared to the whole dataset.
- We will combine this technique with two ML models: Logistic Regression and Support Vector Machines, from scikit-learn. Logistic Regression is simple and interpretable, often used as a strong baseline for text classification. Support Vector Machines focus on finding the best boundary between classes and usually perform very well when the data is not too noisy.
And the performance is essentially perfect for both models.
Image made by author
For this very simple case, where we have a consistent dataset of 1,000 rows, a traditional approach gets the job done. No need for billions of parameter models like GPT.
4. Hard Case: Deep Learning
The second dataset is still synthetic, but it is designed to be annoying on purpose. Labels remain love, hate, and neutral, yet the tweets lean on sarcasm, mixed tone, and backhanded compliments. On top of that, the training pool is smaller while the validation slice stays large, so the models work with less evidence and more ambiguity.
Now that we have this ambiguity, we need to take out the bigger guns. There are Deep Learning embedding models that maintain strong accuracy and still scale well in these cases (remember the triangle and the error versus complexity plot!). In particular, Deep Learning embedding models learn the meaning of words from their context instead of treating them as isolated tokens.
For this blog post, we will use BERT, which is one of the most famous embedding models out there. Let’s first import some libraries:
… and some helpers.
Thanks to these functions, we can quickly evaluate our embedding model vs the TF-IDF approach.
Image made by author
As we can see, the TF-IDF model is extremely underperforming in the positive labels, while it preserves high accuracy when using the embedding model (BERT).
5. Extra Hard case: LLM Agent
Ok, now let’s make things VERY hard:
- We only have 100 rows.
- We assume we do not know the labels, meaning we cannot train any machine learning model.
- We have five labels: envy, hate, love, disgust, anger.

As we can not train anything, but we still want to perform our classification, we must adopt a method that somehow already has the classifications within. Large Language Models are the greatest example of such a method.
Note that if we used LLMs for the other two cases, it would be like shooting a fly with a cannon. But here, it makes perfect sense: the task is challenging, and we have no way to do anything smart, because we can not train our model (we don’t have the training set).
In this case, we have accuracy at a large scale. However, the API takes some time, so we have to wait a second or two before the response comes back (remember the triangle!).
Let’s import some libraries:
And this is the classification API call:
And we can see that the LLM does an amazing classification job:
6. Conclusions
Over the past decade, the role of the Data Scientist has changed as dramatically as the technology itself. This might lead to the idea of just using the most powerful tools out there, but that is NOT the best route for many cases.
Instead of reaching for the biggest model first, we tested one problem through a simple lens: accuracy, latency, and cost.
In particular, here is what we did, step by step:
-
We defined our use case as tweet sentiment classification, aiming to detect love, hate, or neutral intent. We designed three datasets of increasing difficulty: a clean one, a sarcastic one, and a zero-training one.
-
We tackled the easy case using TF-IDF with Logistic Regression and SVM. The tweets were clear and direct, and both models performed almost perfectly.
-
We moved to the hard case, where sarcasm, mixed tone, and subtle context made the task more complex. We used BERT embeddings to capture meaning beyond individual words.
-
Finally, for the extra hard case with no training data, we used a Large Language Model to classify emotions directly through zero-shot learning.
Each step showed how the right tool depends on the problem. Traditional ML is fast and reliable when the data is structured. Deep Learning models help when meaning hides between the lines. LLMs are powerful when you have no labels or need broad generalization.
7. Before you head out!
Thank you again for your time. It means a lot ❤️
My name is Piero Paialunga, and I’m this guy here:
Image made by author
I’m originally from Italy, hold a Ph.D. from the University of Cincinnati, and work as a Data Scientist at The Trade Desk in New York City. I write about AI, Machine Learning, and the evolving role of data scientists both here on TDS and on LinkedIn. If you liked the article and want to know more about machine learning and follow my studies, you can:
A. Follow me on Linkedin, where I publish all my stories B. Follow me on GitHub, where you can see all my code C. For questions, you can send me an email at piero.paialunga@hotmail