The long-term advancement and equitable diffusion of AI technologies crucially depend on their development in the Open. In the US, a few stalwarts of open-source AI are protecting its future. Today, we are proud to make our first model contribution to the open-source canon with Rnj-1 (an homage to Ramanujan, pronounced “range-1”), a world-class pair of base and instruction-tuned large language models. In this blog, we will summarize the key capabilities of the models, briefly cover the background behind their development, and share our vision for what lies ahead.
Capabilities
Rnj-1 is an 8B model that roughly follows the open-source Gemma 3 architecture. We employ global self-attention and YaRN to extend the context to 32k. The Rnj-1 Base and Instruct models compare favor…
The long-term advancement and equitable diffusion of AI technologies crucially depend on their development in the Open. In the US, a few stalwarts of open-source AI are protecting its future. Today, we are proud to make our first model contribution to the open-source canon with Rnj-1 (an homage to Ramanujan, pronounced “range-1”), a world-class pair of base and instruction-tuned large language models. In this blog, we will summarize the key capabilities of the models, briefly cover the background behind their development, and share our vision for what lies ahead.
Capabilities
Rnj-1 is an 8B model that roughly follows the open-source Gemma 3 architecture. We employ global self-attention and YaRN to extend the context to 32k. The Rnj-1 Base and Instruct models compare favorably against similarly sized open weight models.
We report published numbers when possible, and when unavailable they are internal reproductions. Pre-training FLOPs were estimated using 6nt, where n is the number of parameters and t is the token budget. GPT OSS 20B was evaluated with reasoning_effort=low. Qwen 3 8B was evaluated with thinking turned off.
Code Generation
On Algorithmic code generation tasks like HumanEval+, MBPP+, and broader coding tasks like BigCodeBench, both Rnj-1 Base and Instruct compete with the strongest open weight models of similar size, sometimes outperforming even larger model such as GPT OSS 20B. Below are two screen recordings illustrating the model’s ability to generate end-to-end applications in a multi-turn setting.
Agentic and Tool Use
Rnj-1 Instruct dominates the pack on Agentic coding, one of our target abilities. SWE bench performance is indicative of the model’s ability to tackle everyday software engineering tasks. We are an order of magnitude stronger than comparably sized models on SWE-bench and approach the capabilities available in much larger models (leaderboard: SWE-bench-Verified bash-only). Here is a screen recording of its multi-turn actions in resolving a SWE-bench PR.
There is a surge of interest in developing models’ abilities to write performant code. Rnj-1 can learn to use a profiler to iteratively improve the efficiency of the code that it produces. For instance, on Enamel, which tasks the model to write efficient solutions to algorithmic problems, Rnj-1 Instruct outperforms very strong baselines. Here is a screen recording of the model iterating with the profiler in the environment to write optimized code.
Furthermore, Rnj-1 Instruct surpasses comparable models in tool use performance as measured by the Berkeley Function Calling Leaderboard (BFCL).
Mathematical Problem Solving and Scientific Reasoning
Rnj-1 Instruct has mathematical problem solving abilities that are on par with the strongest open weight models as measured by AIME’25, an advanced high school mathematics problem solving task. In addition, Rnj-1 Base is on par with similar size open weight base models on Minerva-MATH. On GPQA-Diamond – a task with questions in biology, physics, and chemistry designed to be difficult even for non-domain experts who have access to the web – we land close to the best similarly sized models.
Quantization & Inference Performance
Our model is robust to quantization. As we go from BF16 to FP8 to NVFP4 we retain model quality while boosting token throughput significantly in prompt heavy workloads. The token throughput numbers are computed on NVIDIA B200 GPUs with KV Cache dtype set to FP8 and a batch size of 128.
The Journey to Rnj-1
In February of this year, Essential decided to go back to the basics. Research and product were both competing for a small team’s focus, hindering our ability to make deep contributions to either endeavor. Between the extremes of focusing all our energies on the model’s capabilities versus deeply understanding the user’s environment, our talents and passions gravitated to the former. Our long-term view is that mastery over the technological machine from which capabilities emerge is a viable path to useful and enduring AI companies. We also believed that open-source AI would ultimately prevail, and we wanted to be an accelerator of a movement that could change the course of humanity.
We were immediately faced with our first and most spiritual decision to date — to choose between pre or post-training. As the world was reckoning with the purported omnipotence of RL right after the release of DeepSeek R1, we believed that compression is a necessary component for simulating intelligence, and the predictive task of language model pre-training was the logical choice. We discovered early evidence of reflection and exploratory reasoning abilities in pre-training, vindicating our thesis that strong pre-training was necessary for downstream success. Our approach to the pre versus post-training question was symbolic of our broader decision framework. We placed longer-term research and engineering bets based on their significance to our roadmap, and each bet was broken into milestones that were prioritized based on the resources we had at our disposal. Given the constant barrage of distractions in the field, staying focused on a few high-conviction ideas is uncomfortable but necessary for success. We expect not all research ideas to work out, but our ability to focus on fundamentals will give us the tools to revise our beliefs and overcome failure.
We set four higher-level goals by the end of 2025:
- Uncover signs of life or failure on our research bets.
- Set an unimpeachable standard for experimental rigor and engineering.
- Build a model that would be useful to our own work.
- Make a substantial contribution to the open-source AI ecosystem.
Ensuring 1. and 2. would create favorable conditions for 3. and 4.
To achieve our targets, we split the year into two phases, each phase culminating in a larger flagship model run. The large model runs would validate our most promising research results at larger scales. We employed 200M-2B models to rapidly navigate our experimental search space and chose an 8B dense Transformer as the threshold to balance iteration speed and robust evaluation of our methods. There is burgeoning evidence that signs of emergence appear early, and we believe that there are useful invariant signals at smaller scales that still elude us. In our experiments, not all signals at smaller scales were reliable and we had to reach for more expensive runs for higher signal-to-noise ratio. We believe this is an important area for future work. The figure below chronicles our progress through each phase.
Research and engineering wins were equally responsible for our consistent gains from one phase to another. We cover a few of our successes here, and for those who want to understand our work in greater detail, we will soon publish a technical report.
Research
In the long run, we expect our methods to automatically represent, transform, and blend data to optimize measurable abilities in pre-training. Our work on modeling data taxonomies led to new approaches for jointly clustering and mixing data distributions under data repetition penalties. Many improvements in our STEM abilities can be traced back to this.
We demonstrated Muon’s practical advantages over AdamW, and developed a sharding strategy to scale it to larger model sizes. Both flagship runs have benefited from the superior token efficiency of Muon. Broadly understanding the behavior of optimizers and the training dynamics of neural networks is an important area of research for us.
We believe that LLMs should go beyond just modeling code and learn to simulate important aspects of program behavior in different environments. With Rnj-1, we made a substantial bet on modeling program execution at unprecedented scale. To teach our base models to iteratively refine code, we also made investments in modeling elementary code evolution. Both research bets were vetted thoroughly at smaller scales, and we believe they have significantly improved Rnj-1’s abilities as a software engineer. Due to resource constraints, we have not been able to tease apart capability improvements from each of our research endeavors at larger scales, but it is a priority given our conviction in these research directions.
As Rnj-1 entered its final phases of pre-training, we were confident that it possessed useful latent mathematical and programming abilities and scientific knowledge. It was necessary to ask how much supervised fine-tuning was needed to coax out its instruction-following abilities, general reasoning, and put it to the test on long interactions and solving difficult tasks in hard real-world environments that are typically in the realm of much larger models. Our post-training recipe was inspired by existing work on long context mid-training with YaRN, Nemotron, and simple agentic environments. We had three mandates:
- Determine how targeted data distributions influence reasoning and agentic abilities.
- Track qualitative improvements live by playing with the model.
- Gather downstream feedback for our next batch of pre-training bets.
Overall we landed close to our year-end goal of building our own useful instrument of intelligence from scratch. We expect that we are few quarters away from our models being sufficient for our internal engineering and scientific needs.
In our work, we never broke the ironclad rule of plunging into the training data and carefully inspecting the tasks that we evaluate on. This knowledge turned out to be priceless as we navigated the space of research ideas.
Infrastructure
The infrastructure team set priorities based on the overarching goal of eliminating any blockers to experimental velocity, for all families of workloads.
Our accelerator infrastructure is distributed across two clouds and vastly different platforms — TPU v5p ASICs and AMD MI300X GPUs. The year started with limited JAX support for AMD chips and our 1.2 exaflop fleet split into two disconnected islands. Today, an Essential staff member can develop models within a unified JAX training framework with TPU/GPU support and schedule workloads seamlessly on any platform. We built a node auto recovery service that slashed our badput by two thirds. Our MFUs for the flagship runs were ~50% of max achievable FLOPs on MI300X GPUs. Going forward, we expect to able to push this to at least 65%.
We also moved our data infra from GCP’s managed service to our Kubernetes Spark cluster hosted on GKE and ran executors on spot instances to save on cost. We constructed a rigorously-tested set of common data jobs, and placed them in a shared toolkit such that they can be repurposed efficiently with minimal learning curve.
Demos
Watch Rnj-1 in action with these demonstrations showcasing its capabilities.
The Journey Ahead
There are numerous irresistible ideas that are vying for our attention. To name a few, we are passionate about conditional computation, extending and strengthening the models’ abilities to process longer contexts, and lower precision training. In the medium term, we will continue to push the thesis of compression, extending the type and scope of program behaviors we want to simulate, and code evolution. We expect that scaling ideas like Reinforcement Learning to inculcate sophisticated reasoning will appear on our roadmap soon.
Essential is dedicated to building open instruments of intelligence that benefit society. We stand on the shoulders of giants and consider ourselves lucky to participate in and contribute to one of the most important arcs of technology. The words of the pioneering computer scientist, Alan Perlis, capture our sentiments well.
“I think that it’s extraordinarily important that we in computer science keep fun in computing… I think we’re responsible for stretching them [computers] setting them off in new directions and keeping fun in the house. … Above all I hope we don’t become missionaries. Don’t feel as if you’re Bible salesmen. The world has too many of those already. What you know about computing other people will learn. Don’t feel as if the key to successful computing is only in your hands. What’s in your hands I think and hope is intelligence: the ability to see the machine as more than when you were first led up to it that you can make it more.”
Team
It has been an honor to work with the most brilliant engineers, researchers, and leadership at Essential. The team performed at the peak of their abilities for sustained periods. We list the Rnj-1 contributors below.
Team (alphabetical by last name)
| Code | STEM | Infra | Operations |
|---|---|---|---|
| Adarsh Chaluvaraju | Aleksa Gordić | Mike Callahan | Divya Mansingka |
| Devaansh Gupta | Michael Pust | Alok Tripathy | Mohit Parmar (Data) |
| Yash Jain | Tim Romanski | Yash Vanjani | |
| Somanshu Singla | Ali Shehper | ||
| Anil Thomas | Ameya Velingker |
Leadership
- Saurabh Srivastava, Code
- Kurt Smith, STEM
- Philip Monk, Infra
- Khoi Nguyen, Data Acquisition
- Divya Shivaprasad, Organization
- Peter Rushton, Organization
- Ashish Vaswani, Research and Engineering Roadmap