Data Science Weekly – Issue 630

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let’s dive into some interesting links from this week.

**Broken Chart: discover 9 visualization alternatives **I’ve wanted to write a post for a while about a graph that the Spanish Ministry for Ecological Transition publishes every month, summarizing the average monthly temperature in Spain. If we look closely, there is a misuse of the geometry type to present the temperature variable…So, what alternatives can we propose? Let’s not forget that our visualization is conditioned by its objective and also by the audience…

**[Why You Sh…

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let’s dive into some interesting links from this week.

Why You Should Read More Engineering Blogs One of the early mistakes I made in my career was not reading engineering blogs. I used to think that the only way for an engineer to deepen their knowledge was by reading textbooks like Designing Data-Intensive Apps by Martin Kleppman, Fundamentals of Software Architecture, and so on. While these are great references to have, they often miss the real-world challenges engineers face when architecting solutions for their customers, especially at scale. A great way to learn new patterns or solutions is through stories. That’s where engineering blogs can help. In this post, I’ll highlight some example stories and share my 25 favourite engineering blogs…

The Most Useful, Timeless Skill to Learn as a Data Professional When I started out in data all I wanted was to have a long and satisfying career, to be compensated well regardless of advances in technology or the whims of the market and of course to love what I did…In this issue, I’ll try to achieve three things:

First, I’ll show you why you should focus on finding these leverage points 1.

Second, I’ll give you ideas on how to learn them 1.

Finally, I’ll provide a couple of examples so you understand what I’m talking about…

Featured Book

By Allen B. Downey, author of Think Python, Think Bayes, and Think Stats

Now in Paperback!

Think more clearly about data, avoid common statistical pitfalls, and make better decisions in an uncertain world.

Allen speaking about the book: “Probably Overthinking It” Google Talk

“It demands more intellectual engagement than a typical pop science book, drawing readers in with its broad scope of topics and colorful storytelling.”

Book Review by “Implicit Assumptions: “Probably Overthinking It” Book Review

Available from Bookshop.org and Amazon (affiliate links).

* Want to be featured in the newsletter? Email us for details –> team@datascienceweekly.org

Statistical Learning Theory and Chat-GPTStatistical learning theory models generalization as follows. There is a data distribution from which all data – training, validation, test – are drawn independently and identically. The goal of the learner is to then learn a good approximation to this underlying distribution…Since Valiant (1984), there has been a large body of very beautiful mathematical work on when this works and under what conditions on the data distribution and the class of classifiers. My job in this post is to not go into the details of this work, but to talk about very high level insights that we get from the entire body. In this post I will describe what learning theory gets right about ChatGPT, and in the next post, I will talk about where the gaps are…

My “small data” pipeline checklist that saved me from building a fake-big-data mess [Reddit]

I work with datasets that are not huge (GBs to low TBs), but the pipeline still needs to be reliable. I used to overbuild: Kafka, Spark, 12 moving parts, and then spend my life debugging glue. Now I follow a boring checklist to decide what to use and what to skip. If you’re building a pipeline and you’re not sure if you need all the distributed toys, here’s the decision framework I wish I had earlier…

Cheap science, real harm: the cost of replacing human participation with synthetic data Driven by the goals of augmenting diversity, increasing speed, reducing cost, the use of synthetic data as a replacement for human participants is gaining traction in AI research and product development. This talk critically examines the claim that synthetic data can “augment diversity,” arguing that this notion is empirically unsubstantiated, conceptually flawed, and epistemically harmful…

I’ve been writing ring buffers wrong all these years So there I was, implementing a one element ring buffer. Which, I’m sure you’ll agree, is a perfectly reasonable data structure…It was just surprisingly annoying to write, due to reasons we’ll get to in a bit. After giving it a bit of thought, I realized I’d always been writing ring buffers “wrong”, and there was a better way…

Pipeline Design Patterns for Data Engineers In this post, we cover: a) What is a data pipeline, and b) 10 key design patterns, their principles, and practical applications for building effective data pipelines…

Contra DSPy and GEPA I must share my authentic truth: I hate DSPy and GEPA to the core of my being. I tried to like them. I really did. I went through the 45-minute Colab notebook. I even made an effort to add a flavor of GEPA to lm-deluge, my open-source LLM SDK, before giving up in a fit of rage (sorry, Claude)…My conclusion? Trying to treat LLM workflows as modular programs is (often) a mistake. It’s backwards, rigid, and the wrong fit for the most interesting tasks…In this post (polemic? rant?) I try to unpack the “why” behind the hate, and consider whether it’s possible to salvage the good parts of GEPA…``

Data management Research data are like the water of science: When they stop flowing and dry up, everything withers and ultimately dies. In this chapter we discuss the principles and practices for good research data management and organization…

**A Linear-Time Alternative To t-SNE for Dimensionality Reduction and Fast Visualisation **Moving data visualisation from a Python notebook to a web browser usually demands a painful compromise: you either pay for a heavy GPU backend or you force the user to wait while JavaScript struggles through iterative algorithms. This article explores a third option: Sine Landmark Reduction (SLR)…We will cover:

Why t-SNE/UMAP are a poor fit for the browser

The idea of landmarks instead of all-pairs distances

How to build a synthetic “sine skeleton” in high-D

How linearised trilateration turns distances into coordinates

Two important refinements: alpha scaling and distance warping

A compact Python implementation of SLR you can experiment with today

MIT’s Missing Semester 2026

Over the years, we have helped teach several classes at MIT, and over and over we have seen that many students have limited knowledge of the tools available to them. Computers were built to automate manual tasks, yet students often perform repetitive tasks by hand or fail to take full advantage of powerful tools such as version control and text editors. In the best case, this results in inefficiencies and wasted time; in the worst case, it results in issues like data loss or inability to complete certain tasks…To help remedy this, we created a class that covers all the topics we consider crucial to be an effective computer scientist and programmer. The class is pragmatic and practical, and it provides hands-on introduction to tools and techniques that you can immediately apply in a wide variety of situations you will encounter…

A Complete Guide to Spherical Equivariant Graph Transformers A 2.5-hour breakdown of spherical equivariant graph neural networks (EGNNs) and a deconstruction of the SE(3)-Transformer model…This article will focus on a specific type of geometric GNN called Spherical Equivariant GNNs (Spherical EGNNs), which are extremely useful in tasks dealing with geometric graph representations of objects with rotational symmetries, like molecules and proteins. Then, we will describe a specific spherical EGNN called the SE(3)-Transformer that incorporates the self-attention mechanism for molecular property prediction…

Which LLM writes the best R code?

In a series of past blog posts, we evaluated how well various models generate R code. To do so, we used the vitals package, a framework for LLM evaluation. vitals contains functions for measuring the effectiveness of an LLM, as well as are, a dataset of challenging R coding problems and their solutions. We evaluated model performance on this set of coding problems…

Does anyone have DS job that is low stress? [Reddit] Started in DA and that was pretty low stress but boring. Mostly doing dashboard. Moved to DS and every project was high stress high priority with executive oversight. I experienced burn out and health issues. I got a low stress DS job just but it’s actually 100% DA so now I’m bored again. I want to go back to something more interesting like ML but don’t want all that stress again…

**The magic (image resampling) kernel **“The magic kernel” is, today, a colloquial name for Magic Kernel Sharp, the world’s gold standard image resizing algorithm that powers all the photos on Facebook and Instagram and has almost universally replaced the previous gold standard Lanczos kernels. Resized images are crystal clear and astoundingly free of the artifacts that all other algorithms produce, no matter how large the enlargement or how small the reduction. Best of all, Magic Kernel Sharp is actually more computationally efficient than the Lanczos kernels…

The Girl Named Florida

How Google Maps quietly allocates survival across London’s restaurants - and how I built a dashboard to see through it

Traditional ML is dead and I’m pissed about it [Reddit]

. * Based on unique clicks. ** Please take a look at last week’s issue #629 here.

Beyond Vibes: How to Actually Evaluate AI Agents (Part 1)

How visual workflow automation can integrate with enterprise-scale agentic AI

Great Ideas in Theoretical Computer Science

My year in data visualisation

Computational Fluid Dynamics Course

jax-js: an ML library for the web

Spherical Voronoi - Directional Appearance as a Differentiable Partition of the Sphere

R Consortium’s 2025 in Review

Looking to get a job? Check out our “Get A Data Science Job”Course It is a comprehensive course that teaches you everything you need to know about getting a data science job, based on answers to thousands of reader emails like yours. The course has three sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume. 1.

**Promote yourself/organization to ~68,750 subscribers** by sponsoring this newsletter. 30-35% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best, Hannah & Sebastian

No posts