AI Papers to Read in 2025

with my series of AI paper recommendations. My long-term followers might recall the four previous editions ([1], [2], [3], and [4]). I’ve been away from writing for quite some time, and I couldn’t think of a better way to return than resuming my most successful series — and the one I enjoyed writing the most.

For the uninitiated, this is a very opinionated list, full of perspectives and tangents, meant to keep you updated on AI as a whole. This is not a state-of-the-art models list but real insights on what to look for in the coming years and what you might have missed from the past. The goal is to help you think critically abou…

In total, there are ten paper suggestions, each with a brief description of the paper’s contribution and explicit reasons why these papers are worth reading. Moreover, each has a dedicated further reading section with one or more tangents to explore.

Before we move on, back to my 2022 article, I kicked off saying “we don’t need larger models; we need solutions” and *“do not expect me to suggest GPT nonsense here.” *Back then, I was pretty sure I would repeat myself in the future, that a new GPT model would just be a larger and marginally better model, but far from groundbreaking. However, credit where credit is due. Since release, ChatGPT has sparked many new solutions and is certainly a turning point in all of computer science.

Last but not least, as a small disclaimer, most of my AI work centers around Computer Vision, so there are likely many excellent papers out there on topics such as Reinforcement Learning, Graphs, and Audio that are just not under my radar. If there is any paper you believe I should know, please let me know ❤.

Let’s go!

#1 DataPerf: A Benchmark for Data Centric AI

Mazumder, Mark, et al. “Dataperf: Benchmarks for data-centric ai development.” arXiv preprint arXiv:2207.10062 (2022).

From 2021 to 2023, Andrew Ng was very vocal about data-centric AI: to shift our focus from evolving models over static datasets towards evolving the datasets themselves — while holding models static or mostly unchanged. In their own words, our current model-centric research philosophy neglects the fundamental importance of data.

In practical terms, it is often the case that increasing the dataset size, correcting mislabeled entries, and removing bogus inputs is far more effective at improving a model’s output than increasing its size, number of layers, or training time.

In 2022, the authors proposed DataPerf, a benchmark for data-centric AI development, including tasks on speech, vision, debugging, acquisition, and adversarial problems, alongside the DataPerf working group. The initiative aims to foster data-aware methods and seeks to close the gap between the data departments of many companies and academia.

Reason 1: Most, if not all, companies working on niche topics end up developing internal datasets. It is wild how little research exists on how to do this properly/better.

Reason 2: A reflection: how many papers provide a solid 2% improvement over the State-of-the-Art (SOTA) nowadays? How much additional data would you need to boost your accuracy by 2%?

**Reason 3: **For the rest of your career, you might wonder, what if instead of doing the proposed X, we just collected more data?

Reason 4: If you are in academia, stuck with some X or Y dataset, trying to figure out how to get 0.1% improvement over SOTA, know that life can be much more than that.

Further Reading: In 2021, it all began with Deeplearning.AI hosting a data-centric AI** **competition. You can read about the winner’s approach here. Since then, there has been plenty of work dedicated to the subject by other authors, for instance, 2023’s Data-centric Artificial Intelligence: A Survey. Finally, if you are a Talks kind of person, there are many by Andrew Ng on YouTube championing the topic.

#2 GPT-3 / LLMs are Few-Shot Learners

Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877–1901.

This NeurIPS paper presented GPT-3 to the world. OpenAI’s third-gen model was in almost every way just a bigger GPT-2. With 116 times more parameters and trained on 50 times more data. Their biggest finding wasn’t that it was just “better” but that how you prompted it could drastically improve its performance on many tasks.

Machine Learning models are often expressed as predictable functions: given the same input, they will always yield the same output. Current Large Language Models (LLMs), on the other hand, can pose and answer the same question in many different ways — wording matters.

Reason 1: Previously, we discussed keeping models static while we evolve the dataset. With LLMs, we can evolve the questions we ask.

**Reason 2: **GPT-3 sparked the field of prompt engineering. After it, we started seeing authors proposing techniques like Chain-of-Thought (CoT) and Retrieval-Augmented-Generation (RAG).

**Reason 3: **Prompting well is far more important than knowing how to train or finetune LLMs. Some people say prompting is dead, but I don’t see that happening ever. Ask yourself: do you word requests the same way when addressing your boss vs your mom or friends?

**Reason 4: **When transformers came out, most research targeted their training/inference speed and size. Prompting is a genuinely fresh topic in natural language processing.

Reason 5: It’s funny when you realize that the paper doesn’t really propose anything; it just makes an observation. Has 60k citations, though.

**Further Reading: *Prompting reminds me of ensemble models. Instead of repeatedly prompting a single model, we would train several smaller models and aggregate their outputs. Now nearly three decades old, the AdaBoost paper is a classic on the topic and a read that will take you back to way before even word embeddings were a thing. *Fast forward to 2016, a modern classic is XGBoost, which is now on its v3 upgrade.

#3 Flash Attention

Dao, Tri, et al. “FlashAttention: Fast and memory-efficient exact attention with io-awareness.” Advances in Neural Information Processing Systems 35 (2022): 16344–16359.

Since the 2017 groundbreaking paper “Attention is All You Need” introduced the Transformer architecture and the attention mechanism, several research groups have dedicated themselves to finding a faster and more scalable alternative to the original quadratic formulation. While many approaches were devised, none has really emerged as a clear successor to the original work.

The original Attention formulation. The softmax term represents how important each token is to each query (so for N tokens, we have N² attention scores). The “transform” (in the name Transformer) is the multiplication between this N² attention map and the N-sized V vector (much like a rotation matrix “transforms” a 3D vector)

In this work, the authors do not propose a new formulation or a clever approximation to the original formula. Instead, they present a fast GPU implementation that makes better use of the (complicated) GPU memory structure. The proposed method is significantly faster while having little to no drawbacks over the original.

Reason 1: Many research papers get rejected because they are just new implementations or not “novel enough”. Sometimes, that’s all we need.

**Reason 2: **Research labs crave the attention of being the new Attention, to the point it’s hard for any new Attention to ever get enough attention. In this instance, the authors only improve what already works.

**Reason 3: **In retrospect, ResNet was groundbreaking for CNNs back in the day, proposing the residual block. In the following years, many proposed enhancements to it, varying the residual block idea. Despite all that effort, most people just stuck with the original idea. In such a crowded research field as AI, it’s best to remain cautious about all things that have many proposed successors.

**Further Reading: **From time to time, I consult Sik-Ho Tsang’s list of papers he reviews here on Medium. Each section reveals the leading ideas for each area over the years. It is a bit sad how many of these papers might have seemed groundbreaking and are now completely forgotten? Back to Attention, as of 2025, the hottest attention-replacement candidate is the Sparse Attention by the DeepSeek team.

#4 Training NNs with Posits

Raposo, Gonçalo, Pedro Tomás, and Nuno Roma. “Positnn: Training deep neural networks with mixed low-precision posit.” ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

Taking a turn to the world of hardware and low-level optimization, some of the most important (but least sexy) advancements in AI training are related to floating points. We went from boring floats to halfs, then 8-bit and even 4-bit floats (FP4). The horsepower driving LLMs today are eightfold ponies.

The future of number formats goes hand-in-hand with matrix-matrix multiplication hardware. However, there can be much more to this topic than just halving bit-depth. This paper, for instance, explores a totally new number format (posits) as a potential replacement for good old IEEE-754 floats. Can you imagine a future *sans *floats?

Reason 1: While new algorithms take time to find widespread adoption, hardware improves consistently every year. All ships rise with the hardware tide.

**Reason 2: **It’s worth questioning how far we would be today if we didn’t have as many GPU improvements over the past ten years. For reference, the AlexNet authors broke all ImageNet records in 2012 using two high-end GTX 580 GPUs, a total of 3 TFLOPs. Nowadays, a mid-range GPU, such as an RTX 5060, boasts ~19 TFLOPs — 6 times more.

**Reason 3: **Some technologies are so common that we take them for granted. All things can and should be improved; we don’t owe anything to floats (or even Neural Networks for that matter).

**Further Reading: **Since we are mentioning hardware, it’s also a good time to talk about programming languages. If you haven’t been keeping up with the news, the Python team (especially Python’s creator) is focused on optimizing Python. However, optimization nowadays seems to be a slang for rebuilding stuff in Rust. Last but not least, some hype was devoted to Mojo, an AI/speed-focused superset of Python; however, I barely see anyone talking about it today.

#5 AdderNet

Chen, Hanting, et al. “AdderNet: Do we really need multiplications in deep learning?.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

What if we didn’t do matrix multiplication at all? This paper goes a totally different route, showing it is possible to have effective neural networks without matrix multiplication. The main idea is to replace convolutions with computing the L1 difference between the input and the sliding filters.

I like to think of this paper as the “alternate world” neural networks. In some parallel universe, NNs evolved based on addition, and amidst it all, someone proposed a multiplication-based model; however, it never got traction since all the tooling and hardware were neck deep in optimizing massive matrix addition and subtraction operators.

Reason 1: We easily forget there are still other algorithms out there we have yet to find, besides CNNs and Transformers. This paper shows that an addition-based neural network is possible, how cool is that?

**Reason 2: A lot of our hardware and cloud infrastructure is tuned for matrix multiplication and neural networks. Can new models still compete? Can non-neural networks still make a comeback? **

**Further Reading: **Many of you might not be familiar with what existed before NNs took over most fields. Most people know staples like Linear Regression, Decision Trees, and XGBoost. Before NNs became popular, Support Vector Machines were all the rage. It’s been a while since I last saw one. In this regard, a cool paper to read is Deep Learning is Not All You Need.

Support Vector Machines learn to separate two groups of points with the best separation line possible. By using the Kernel Trick, these points are cast into a higher-dimensional space, in which a better separation plane might be found, achieving a non-linear decision boundary while maintaining the linear formulation. Its a brilliant solution worth learning about. Source.

#6 Interpolation vs Extrapolation

Balestriero, Randall, Jerome Pesenti, and Yann LeCun. “Learning in high dimension always amounts to extrapolation.” arXiv preprint arXiv:2110.09485 (2021).

Sometime ago, I used to think the big names on AI were visionaries or had very good educated guesses on the future of the field. This changed with this paper and all the debate that followed.

Back in 2021, Yann LeCun pushed this discussion about interpolation vs extrapolation, claiming that in high-dimensional spaces, like all neural networks, what we call “learning” is data extrapolation. Right after publication, many renowned names joined in, some claiming this was nonsense, some that it was still is interpolation, and some taking the extrapolation side.

If you never heard about this discussion… it shows how pointless it really was. As far as I could see (and please write me if you think otherwise), no company changed course, no new extrapolation-aware model was devised, nor did it spark new relevant training techniques. It came and it went.

Reason 1: To be honest, you can just skip this one. I just needed to rant about this for my own peace of mind.

**Reason 2: **From a purely academic point of view, I consider this an interesting take on learning theory, which is indeed a cool topic.

**Further Reading: **Yoshua Bengio, Geoffrey Hinton, and Yann LeCun were awarded the 2018 Turing Award for their pioneering work on Deep Learning foundations. Back in 2023 or so, LeCun was focused on self-supervised learning, Hinton was concerned with Capsule Networks, and Bengio was looking at Generative Flow Networks. By late 2025, LeCun moved towards world models while Hinton and Bengio moved towards AI Safety. If you are second-guessing your academic choices, keep in mind that even the so-called godfathers switch gears.

#7 DINOv3 / Foundation Vision Models

Siméoni, Oriane, et al. “DINOv3.” arXiv preprint arXiv:2508.10104 (2025).

While the world of language processing has evolved to use big universal models that work for every task (aka foundation models), the field of image processing is still working its way up to that. In this paper, we see the current iteration of the DINO model, a self-supervised image model designed to be the foundation for Vision.

Reason 1: Self-supervised pretraining is still relatively evolving in other problem areas when compared to text, especially if done entirely within the problem domain (versus adding text descriptions to help it).

**Reason 2: **Don’t read only language papers, even if your job is working with LLMs. Variety is key.

Reason 3: Language models can only go so far towards AGI. Vision is paramount for human-like intelligence.

**Further Reading: **Continuing on the Vision topic, it is worth knowing about the YOLO and the Segment-Anything Model. The former is a staple for object-detection (but also boasts versions for other problems) while the latter is for image segmentation. Regarding image generation, I find it funny that a few years back we would all talk about GANs (generative adversarial networks), and nowadays it is probable that many of you have never heard of one. I even wrote a list like this for GAN papers many years ago.

#8 Small Language Models are the Future

Belcak, Peter, et al. “Small Language Models are the Future of Agentic AI.” arXiv preprint arXiv:2506.02153 (2025).

The field of “Generative AI” is quickly being rebranded to “Agentic AI”. As people try to grasp how to make money with that, they bleed VC money running behemoth models. In this paper, the authors argue that Small Language Models (< 10B params, on their definition) are the future for Agentic AI development.

In more detail, they argue that most subtasks executed on agentic solutions are repetitive, well-defined, and non-conversational. Therefore, LLMs are somewhat an overkill. If you include fine-tuning, SLMs can easily become specialized agents, whereas LLMs thrive on open tasks.

**Reason 1: **What we call “large” language models today might just as well be the “small” of tomorrow. Learning about SMLs is future-proofing.

**Reason 2: **Many people claim AI today is heavily subsidized by VC money. In the near future, we might see a huge increase in AI costs. Using SMLs might be the only option for many businesses.

**Reason 3: **This is super easy to read. In fact, I think it is the first time I have read a paper that so explicitly defends a thesis.

**Further Reading: **Smaller models are the only option for edge AI / low-latency execution. When applying AI to video streams, the model + post needs to execute in less than 33 ms for a 30fps stream. You can’t roundtrip to a cloud or batch frames. Nowadays, there are a variety of tools like Intel’s OpenVINO, NVIDIA’s Tensor-RT, or TensorFlow-Lite for fast inference on limited hardware.

#9 The Lottery Ticket Hypothesis (2019)

Frankle, Jonathan, and Michael Carbin. “The lottery ticket hypothesis: Finding sparse, trainable neural networks.” arXiv preprint arXiv:1803.03635 (2018).

As a follow-up to small models, some authors have shown that we most likely aren’t training our networks’ parameters to their fullest potential. This is “humans only use 10% of their brains” applied to neural networks. In this literature, the Lottery Ticket Hypothesis is surely one of the most intriguing papers I’ve seen.

Frankle *et al. *found that if you (1) train a big network, (2) prune all low-valued weights, (3) rollback the pruned network to its untrained state, and (4) retrain; you will get a better performing network. Putting it differently, what training does is uncover a subnetwork whose initial random parameters are aligned to solving the problem — all else is noise. By leveraging this subnetwork alone, we can surpass the original network performance. Unlike basic network pruning, this improves the result.

**Reason #1: **We are accustumed to ”bigger models are better but slower” whereas “small models are dumb but fast”. Maybe we are the dumb ones who insist on big models always.

Reason #2: An open question is how underutilized our parameters are. Likewise, how can we use our weights to their fullest? Or even, is it even possible to measure a NN learning potential?

Reason #3: How many times have you cared about how your model parameters were initialized before training?

**Further Reading: **While this paper is from 2018, there is a 2024 survey on the hypothesis. On a contrasting note, “The Role of Over-Parameterization in Machine Learning — the Good, the Bad, the Ugly (2024)” discusses how over-parametrization is what really powers NNs. On the more practical side, this survey covers the topic of Knowledge Distillation, using a big network to train a smaller one to perform as close to it as possible.

#10 AlexNet (2012)

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

Can you believe all this Neural Network content we see today really started just 13 years ago? Before that, NNs were somewhat in between a joke and a failed promise. If you wanted a good model, you would use SVMs or a bunch of hand-engineered tricks.

In 2012, the authors proposed the use of GPUs to train a large Convolutional Neural Network (CNN) for the ImageNet challenge. To everyone’s surprise, they won first place, with a ~15% Top-5 error rate, against ~26% for the second place, which used state-of-the-art image processing techniques.

Reason #1: While most of us know AlexNet’s historical importance, not everyone knows which of the techniques we use today were already present before the boom. You might be surprised by how familiar many of the concepts introduced in the paper are, such as dropout and ReLU.

Reason #2: The proposed network had 60 million weights, complete insanity for 2012 standards. Nowadays, trillion-parameter LLMs are around the corner. Reading the AlexNet paper gives us a great deal of insight into how things have developed since then.

Further Reading: Following the history of ImageNet champions, you can read the ZF Net, VGG, Inception-v1, and ResNet papers. This last one achieved super-human performance, solving the challenge. After it, other competitions took over the researchers’ attention. Nowadays, ImageNet is mainly used to validate radical new architectures.

The original portrayal of the AlexNet structure. The top and bottom halves are processed by GPU 1 and 2, respectively. An earlier form of model parallelism. Source: The Alexnet Paper

This is all for now. Feel free to comment or connect with me if you have any questions about this article or the papers. Writing such lists is A LOT OF WORK. If this was a rewarding read for you, please be kind and share it among your peers. Thank you!