PMPP-Eval Journey

Everything started with one tweet from Andrej Karpathy upon the release of prime-environments: ![[Pasted image 20251018012750.png]] Right after seeing that some latest technical issues I faced (dealing a lot with NVIDIA DRIVE AGX Thor) and books I have checked directly popped up in my mind. One of them was “Programming Massively Parallel Processors” by David Kirk and Wen-mei Hwu and luckily I have seen Johannes Hagemann from Prime Intellect also mentioning it in his tweet thread about prime-environments. This book has a lot of coding exercises that are perfect for LLMs to practice on, especially for those interested in GPU programming and parallel computing. So I decided to take the initiative and create a new environment called PMPP (Programming Massively Parallel Processors) inspired by the exercises from this book. ![[71VXgKBcF7L._UF894,1000_QL80_.jpg|300]] ## Putting the first book into library Today one of the winning races for local models is probably image-to-text, there is a lot of VLMs readily accessible over the HuggingFace that you can locally process many books with your local 3090/4080 setup overnight (@mervenoyann always keeps your timeline up to date about OCR if you are missing out on model releases especially local VLMs side). On PMPPs case, there is a lot of content/code already published by the community one of which is from GitHub user ‘tugot17’ who has shared their own solutions for the textbook exercises. Its a great starting point and reference for building the environment. But there are some references that directly mention figures within the book, a LLMs needs to understand and generate given exercise solutions are a little bit different than a human. Considering this I have created a process utilizing local Granite Docling in cooperation with a larger LLM as judge within feedback loop to run couple iterations over both community provided data and the book content to create dataset that is matching needs of the environment. Books diversify in terms of content and the flow of the text compared to typical question answer paradigms. Each chapter usually starts with some theoretical background, followed by examples and then exercises at the end. Depending on the book this chapter by chapter structure can become linearly more complex with harder exercises with also possibility of having multiple concepts visited from previous chapters. Thus, having a rubric that considers this complexity progression is important for effective scoring at the end which I will revisit on later sections. ## Dataset and different task types Books are mostly already in pre-training data of LLMs, legally or illegally all labs included pretty easy to access books in their training data. The missing part is the QA and coding exercises that are not directly available in the training data mostly because they are not well-structured or easily extractable. So I started by going through the book and identifying different types of tasks that can be converted into different task types for the environment. I categorized them into coding tasks, multiple-choice questions, and short answer questions. Each type requires a different approach for evaluation and feedback. QA part is pretty straightforward, doesn’t really require much infra other than having a dataset with questions and answers. Coding tasks are more complex, as they require a proper environment setup for compiling and running CUDA code. I have planned to build a containerized evaluation server that can securely compile and execute submitted code snippets, returning the results and any errors back to the user. Feel free to go through dataset entries thanks to hf providing such great data studio without needing to download the data : <iframe src=“https://huggingface.co/datasets/sinatras/pmpp-eval/embed/viewer/qa/train” frameborder=“0” width=“100%” height=“560px” ></iframe> ## Dealing with coding tasks For coding tasks, I wanted to continue on idea of creating a book environment for LLM rather than just a input/expected output pair. So I designed a skeleton structure for each coding task that includes TODO comments indicating where the student (in which for this case, a LLM) needs to fill in the code. This way, the model can focus on completing specific parts of the code rather than writing everything from scratch (especially function definitions and imports are failing a lot) which gives us a more controlled evaluation setting. For coding task evaluation, there are multiple ways to approach it. The most straightforward way is to compile and run the submitted code against a set of predefined test cases making testing deterministic, checking for correctness and performance. However, this requires a decent infra setup to handle code compilation and execution securely, especially when dealing with potentially untrusted code submissions from LLMs. On the performance side, nvcc compiler can run on cpu but the actual testing needs to be done on a cuda capable gpu. To address this, having different async pipelines for cpu bound compilation and gpu bound execution was needed to optimize throughput. ## Coding evaluation infra The evaluation system’s architecture emerged from a observation about resource bottlenecks I just mentioned, compilation is CPU-heavy but embarrassingly parallel, while CUDA execution is GPU bound and requires careful orchestration. This dichotomy led to a dual-pipeline design that maximizes eval/test throughput. The CPU-bound compilation pipeline leverages parallel make with optional ccache support. The ccache integration proved essential for practical evaluation speeds when models attempt similar solutions or when we’re debugging specific tasks repeatedly, compilation drops from seconds to near-instant (recorded up to 6.7x speedup over multiple runs post optimization). The parallel make allows multiple compilation targets to build simultaneously, taking advantage of multi core systems without overwhelming resources. Meanwhile, the GPU-bound execution pipeline uses an `asyncio.Semaphore` to cap concurrent GPU runs. This was for ensuring stable, reproducible results. When multiple kernels compete for GPU resources, you get variable performance and potential timeout failures that have nothing to do with code correctness. The semaphore ensures that GPU execution happens sequentially within each slot, maintaining consistent evaluation conditions. This semaphore can be tuned for available cuda resources on setup, currently can be tuned via args/env variables to get max out of your kernel compiler/tester node. Each evaluation gets its own workspace under `/tmp/pmpp_workspaces`, following a strict lifecycle pattern which turned out more complex than I initially anticipated. The workspace isolation prevents any cross-contamination between evaluations utilizing uuid based directories. Considering we are running untrusted code I also extended this into fully containerized solution with FastAPI server that has endpoints to basically submit jobs, check status and retrieve results. And both for local and containerized runners there is automatic cleanup even when things go wrong because CUDA compilation can generate substantial temporary files for each submitted kernel that would otherwise accumulate. ![[Pasted image 20251018013012.png]] ## Testing, testing and more testing The decision to prioritize deterministic testing shapes every aspect of the evaluation basically proving how much I hate string matching. Rather than using random inputs that might catch different edge cases on different runs, every test harness uses predetermined input data with exact comparison against known-correct implementations. Those reference implementations that also helped me building CPU oracles within testing, also available in repository as a educational material even though its not cruicial for evaluation itself. Lets mention what we are targetting and challenging on tests: - Boundary conditions: Off-by-one errors in thread indexing, grid-stride loops that don’t cover all elements - Partial tiles: Threads that need special handling when the problem size isn’t divisible by block dimensions - Memory patterns: Coalesced vs. strided access, bank conflicts in shared memory, alignment requirements - Edge cases: Zero-size inputs, single-element matrices, maximum allocation sizes that stress memory limits - Synchronization: Race conditions from missing __syncthreads(), incorrect atomic operation usage - Numerical stability: Floating-point accumulation order, epsilon comparisons for approximate equality For numerical kernels, we implement epsilon-based comparisons with well-defined tolerance bounds, typically relative error under 1e-5 for single precision over the CPU oracle result. Performance tests avoid wall-clock measurements that vary with system load, instead verifying algorithmic complexity through operation counts or relative comparisons against baseline implementations(small spoiler for next iterations such as pmpp-hard). ![[image.webp|500]] ## Rubric design One of the key challenges in evaluating model performance across textbook exercises is that not all chapters are created equal. Early chapters cover fundamental concepts like basic indexing and simple memory access patterns, while later chapters dive into complex topics like radix sort, hybrid sparse matrix formats (HYB/ELL SpMV), and dynamic parallelism. To address this disparity, I implemented a weighted scoring system based on empirical difficulty estimation. The weight for each chapter is calculated using the observed pass rates from a calibration set: `w_c = (1 - pass_rate_c) + ε` Where `pass_rate_c` is the empirical pass rate for chapter c, and ε is an additive smoothing factor to prevent zero weights. These weights are then normalized to have a mean of 1.0 for interpretability. The calibration process uses three models of varying capabilities (including GPT-5 Pro for upper bound) to establish a wide difficulty signal. This multi-model approach prevents overfitting to any single model’s specific weaknesses (Such case can be seen on Sonnet 4.5). The resulting weights show significant variance: Chapter 13 receives a weight of approximately 2.00, while Chapter 9 gets 0.22, reflecting the difficulty gap between advanced parallel algorithms and basic memory operations. The weighted reward for each task is then calculated via: `weighted_reward = raw_reward × chapter_weight` This approach directly changed the results table, which I believe provides great insight on especially close call models on raw binary scores. ![[output(2).png]] ## What I have learned over first iterations After running hundreds of evaluations across different models, patterns emerged in the failure modes that I didn’t consider at the start for specification clarity. My analysis revealed that most failures weren’t due to models lacking capability but rather specifications leaving room for reasonable but incorrect interpretations at the first iterations. The CUDA-specific idioms issue characterizes this perfectly. Models consistently reached for `std::numeric_limits<float>::lowest()` a completely reasonable C++ choice that happens to be device-incompatible in CUDA kernels. The fix wasn’t to penalize models for not knowing CUDA quirks (Come on they really focused including pretraining data for them only around start of this year which can be seen over eval results) but to explicitly specify the use of `-INFINITY` macro in the skeleton code. Similarly, sparse matrix format ambiguities led to models implementing perfectly valid but incompatible indexing schemes. The ELL format’s row-major versus column-major layout wasn’t obvious from the task description, leading to systematic failures that had nothing to do with parallel programming understanding. I now provide explicit indexing examples: “For element at `row i`, `column j`, access as `data[i * max_cols + j]`” rather than leaving it to interpretation. The accumulation pattern confusion in multi-pass kernels was also interesting. When instructions said “accumulate results,” models split between using assignment for the first pass and addition for subsequent passes, versus using addition throughout with proper initialization. Both interpretations make sense, so the updated specifications now include explicit accumulation examples with required operators clearly marked. And the final thing is that working with [verifiers](https://github.com/PrimeIntellect-ai/verifiers)/[prime-environments](https://github.com/PrimeIntellect-ai/prime-environments/)is pretty fun achieving whatever you want is possible in no time and the community behind it (thanks to Prime Intellect’s support to the community) pretty helpful for anyone wants to get into them. Either its a eval or a training focused RL env doesn’t matter Prime Intellect has the tools and 100+ examples to get you there. Even a basic command addition to within dev time of this env which is `vf-tui` made debugging iterations almost like a game, i would just put a thumbs up/down button on this UI and go over it whole day if had time to. ![[Pasted image 20251018022227.png]] ## Results Across the first evaluation wave, the coding benchmark proved to be the true differentiator among models. The weighted scoring system, which amplifies the contribution of more complex chapters, revealed meaningful performance separation beyond what raw pass rates alone could show. GPT-5-High emerged as the clear leader, scoring over 85% weighted, while the base GPT-5 followed closely at 76%. Then comes the Grok family and Sonnet 4.5. One of the interesting results is also GPT-OSS which, compared to its QA score tells a lot of stories about its pre/post training data. Gemini 2.5 Pro also underperformed compared to models on top of eval. There seems to be a decline in its performance even on multiple test runs to ground results. ![[output(7).png]] In contrast, the QA evaluation displayed a much narrower spread, with top models clustering around the low-80 % range with steps going down. Since these tasks mainly probe conceptual recall and theoretical comprehension, saturation is expected. The coding side, however, demonstrates genuine depth separation among models, serving as a better proxy for procedural understanding and execution-level reasoning. We can see some outliers that have performance differences compared to the coding subset. Grok seems to have better QA performance than its coding, even passing GPT-5-High with barely a couple of points (win is a win even with %3.2). The QA set also benefitted Gemini 2.5 Pro, pushing it to the competitive top section of the figure with %73.3 On the other hand, GPT OSS as mentioned in the coding part, comes in last in the QA section (synthetic data didn’t provide enough entropy, I guess). We can see small improvements of DeepSeek v1 to v2 release on both evals. Sonnet 4.5 is stuck being a good specialized agent after all these updates, but this brings it down to the middle section of the eval; it couldn’t compete with GPT-5/Gemini 2.5 Pro on QA. Claude Haiku 4.5 recently released and included in both evals, underperformed below my expectations, it is really fast on inference but performs badly across the questions. There probably will be more silent updates to it in upcoming weeks and will require re-evaluation. ![[output(5).png]] ## Next Steps, possibly a PMPP-Hard These results provide a solid baseline, but they also highlight the need for even more challenging evaluations. The fact that GPT-5-High solves nearly 90% of coding tasks suggests we’re approaching saturation on the current difficulty level. The upcoming pmpp-hard variant will address this by extracting only the most challenging problems, less code skeletons, performance benchmarking and possibly teaching models a little bit more on benchmaxxing. —— ## Special thanks goes to Prime Intellect Grateful acknowledgment to [Prime Intellect](https://www.primeintellect.ai/) for sponsoring this release at full, as a ongoing RL Resident the program provided the foundational infrastructure, funding, and guidance that made PMPP env/eval possible from inception to release. ![[148051844 2.jpg|50]] –– ``` @misc{pmpp_eval, author = {Sinatras}, title = {pmpp-eval}, year = {2025}, url = {https://github.com/SinatrasC/pmpp-eval} } ```

Similar Posts