How I Use LLMs for ML Research

I have seen some really good posts recently about how different people are using Claude Code. Over the last few months I have converged to my own Frankenstein LLM setup that seems to be working quite well for now. My day to day role (applied ML Research) at a very small and lean startup (Eightsleep) might be a factor, so I will also try to share my mental models behind my choices.

We are hiring for ML/AI Researchers and ML/AI Engineers at Eightsleep. So if you can build fast, take ownership and want to rethink how health and longevity works in an AI enabled future, DM me or apply.

My setup has 3 pillars -

High knowledge - deep and/or novel research 1.

**High t…

We are hiring for ML/AI Researchers and ML/AI Engineers at Eightsleep. So if you can build fast, take ownership and want to rethink how health and longevity works in an AI enabled future, DM me or apply.

My setup has 3 pillars -

High knowledge - deep and/or novel research 1.

High taste - build and optimize frameworks/libraries 1.

High steering - interactive debugging for code/data

AFAICT OpenAI is still king when it comes to highly technical research answers.

Whether I want to to discuss a SOTA paper or brainstorm variants and adaptations of SOTA to my problem or create a novel loss function, I turn to ChatGPT.

There is a elegant simplicity to the ChatGPT webapp, much like Google Search, its just me and chat riffing on ideas. I just use the top model of the time, started with GPT4, then moved to o1, then o3, now at GPT5.2.

I have a few projects each with their own custom system prompt. The one I use the most is the MLResearch, which sets the two of us (ChatGPT and I) as best friends and ML researcher colleagues. Basically asking it to be extremely competent, yet as disagreeable as necessary.

When building libraries and frameworks from scratch, taste is the most essential quality. What I am looking for is a highly opinionated agent that maintains the same style across the whole library, doesn’t over-engineer features and writes good documentation.

I used to do this in Cursor Plan mode till a couple of months ago, but recently moved to Claude Code. CC is just a step above in taste. Here is how you can tell.

Ask CC/GPT5/Gemini/Opus4 on Cursor, to write documentation about a plan or codebase. They are all pretty good, but in my experience, CC writes docs with half the tokens while keeping all the content.

The same goes with the code, CC is just more judicious with its variable naming and abstractions. I don’t understand Anthropic’s recipe for building CC but the advantages are obvious when I use it. so for building large libraries which, to be honest, I will never go through in detail, I use Claude Code.

The problem with CC is that it is too autonomous. Interrupting CC almost feels bad. It is not easy to inspect its output and build collaboratively. Picking up a project halfway leaves too many gaps in knowledge. One I get the basic training and evaluation pipelines set up, for the gritty details of ML/data work, I need a high touch tool. One that can build, use, improve, iterate with me in the loop.

Cursor has done a great job here. I really like Cursor’s input format, how it lets me plan and build and iterate back and forth. I use CC in Cursor so I can immediately go back and forth between the terminal and Cursor agent. I am quite agnostic to the model I use in Cursor, they are all pretty good and improve over time. I used sonnet 3 a lot, then moved to gemini2.5. I tried Cursor’s auto for while, but I think they throttle the intelligence, which would be quite rational. Right now I am using Opus 4.5.

I realize that my setup assumes being in the privileged position of having the Pro plan of Cursor/ChatGPT/ClaudeCode. Thankfully, I get them from work. But having used these tools for the last year or so, it is very clear to me software engineering as it was before is extinct.

I write 0 code now.

About once a day, one of the agents gets stuck in a loop and needs help to get unstuck. But even that is evolving. 3 months ago, I would have to write <~10-100 loc to unstuck a model, nowadays it might just be reading the code and spotting the bug, or just prompting the LLM to step back and explain itself so I can figure out whats is wrong.

It is a brave new world. I can’t believe this tech is real, and that we got here so fast. The recursive improvement loop has begun.

What a time to be alive.

No posts

Similar Posts