Awesome-ML-SYS-Tutorial
English Version | Chinese Version
My learning notes for ML SYS.
I’ve been writing this blog series intermittently for over a year now, and it’s almost become an RL Infra Learning Note 😂
I often see discussions about whether ML SYS or AI Infra is worth getting into, and how to start. Everyone’s choice is different. For me, I simply want to pursue the truth in algorithms:
A large number of RL conclusions derived from papers are based on RL infrastructure in the open-source community that may be extremely flawed. I’ve been involved in RL infra development for over a year, and I’ve seen…
Awesome-ML-SYS-Tutorial
English Version | Chinese Version
My learning notes for ML SYS.
I’ve been writing this blog series intermittently for over a year now, and it’s almost become an RL Infra Learning Note 😂
I often see discussions about whether ML SYS or AI Infra is worth getting into, and how to start. Everyone’s choice is different. For me, I simply want to pursue the truth in algorithms:
A large number of RL conclusions derived from papers are based on RL infrastructure in the open-source community that may be extremely flawed. I’ve been involved in RL infra development for over a year, and I’ve seen numerous community experts diligently working, but the fact is that RL infra, whether open-source or within major companies, still has many problems. It is absolutely worth questioning whether the high-level conclusions drawn from this flawed infrastructure are correct. When I was reviewing for ICLR this year, I often asked the papers assigned to me, "If the framework you are using has implementation issues itself, can your conclusions still hold?" Although I never deducted points for this reason, no one could provide an answer that resolved my fundamental doubt.
Therefore, some excellent researchers I know are keen to participate in infra development, spending most of their time on foundational work to rigorously ensure that the algorithm they plan to develop next has a correct basis. I greatly admire them and agree with such rigor—they are my role models. The same is true for our SGLang RL community. With so much human power and time, we all hope to provide the most correct and concise RL foundation possible, whether it’s for companies training models or researchers developing new algorithms, with the goal of genuinely serving everyone in the community. Thank you for your recognition, and I look forward to hearing from interested friends who wish to contact me and join us!
After a year of going around in circles, this is the resolve that keeps me going in Infra: to make a contribution to the community by building a correct foundation, thereby helping to ensure correct conclusions.
Coming back to the topic, this series of podcasts started in August 2024, when I began learning ML SYS notes following the opportunity to use SGLang during my research. It’s largely written by me, with content focusing on RL infra, online/offline inference systems, and some fundamentals of AI Infra. Over the past year, starting from two or three articles and thirty to fifty Github Stars, to now exceeding 4.5K Stars, I have become a minor technical influencer. I am deeply honored and grateful for the support.
I would like to thank my advisors, Professor Quanquan Gu, Dr. Ying Sheng, and Dr. Linmin Zheng, for the immense help and guidance they gave me in my study of AI Infra, career choices, and life path. Although I am no longer pursuing a Ph.D. at UCLA due to personal reasons, this journey after my undergraduate graduation has been an incredibly valuable experience. I have now joined RadixArk full-time, continuing my research in RL Infra. We will continue to share AI Infra-related technology and thoughts through my blog, via unofficial channels. I also hope readers interested in AI Infra reach out to us, join the SGLang open-source community, and together build open-source AI Infra that changes the world and is worth being proud of for a lifetime!
RLHF System Development Notes
slime Framework
- Achieving Speed and Accuracy: A Comprehensive Solution to Train-Inference Mismatch in RL: Introduces two solutions provided by the slime framework for the train-inference mismatch problem: achieving perfect True On-Policy training through kernel-level alignment, and mitigating the mismatch using algorithms like TIS/MIS. Also available in Chinese version.
- Support FSDP2 as A Training Backend for slime: Added FSDP as a training backend to slime, and aligned it with Megatron. FSDP is more flexible in supporting models with architectural innovations like Qwen3-Next/gpt-oss and helps us further support VLM RL. Also available in Chinese version and on Zhihu.
- Unified FP8: Moving Beyond Mixed Precision for Stable and Accelerated MoE RL: Fully utilizing FP8 for both sampling (Rollout) and training (Training) in RL. Also available in Chinese version and on Zhihu.
- Power Up Speculative Decoding In Reinforcement Learning: Introduces speculative decoding into the RL sampling process, significantly boosting sampling speed when the batch size is appropriate; moreover, the draft model is updated during training. Compared to freezing the draft model, the accepted length remains consistently high, yielding long-term stable positive returns. Also available in Chinese version.
- An In-Depth Look at the Elegant Design and Source Code of the slime RL Framework: slime source code appreciation. Also available on Zhihu and in Chinese version.
- [Pending Review] slime FSDP Setup Guide: Records how to test FSDP on slime, including H-cards and B-cards, and both Colocate and Disaggregated placement methods.
- [Pending Review] Chunked Parallel Computation of GAE in PPO (slime Implementation): Rewrites the standard backward recurrence of GAE into chunk-based parallel prefix scanning, significantly mitigating the GAE computation bottleneck in long sequence scenarios, achieving about $100\times–300\times$ acceleration in slime. Also available on Zhihu.
AReal Framework
- AReal Code Walk Through AReal source code appreciation. Also available on Zhihu and in Chinese version.
verl Framework
- Analyzing VLM RL Training Memory Leaks via Torch Memory Snapshot: Analysis of SGLang memory leak issues and solutions. Also available on Zhihu and in Chinese version.
- Latency optimization for weight updates: A debug process for efficiency. Also available on Zhihu: A record of optimizing SGLang weight update latency.
- In-Depth Understanding of verl Source Code (Initialization): Also available on Zhihu and in Chinese version.
- In-Depth Understanding of verl Source Code (Rollout): Also available on Zhihu and in Chinese version.
- [Pending Review] In-Depth Understanding of verl Source Code (Make Experience): Analysis of the logic for the make experience part in verl.
- AgentLoop Source Code Analysis: Analysis of the multi-turn RL implementation based on AgentLoop in verl.
- verl Parameter Quick Reference: Quick reference for verl parameters. Also available on Zhihu and in Chinese version.
- Analyzing the Complexity of Agentic Multi-Turn Training from a Tokenizer Perspective: Also available on Zhihu and in Chinese version.
- [Pending Review] DAPO Dynamic Filtering Implementation and Batch Size Analysis: Exploring how to achieve higher parallelism by padding prompts to a smaller batch size.
- Systematic Analysis of Time Consumption in verl Multi-Turn Training: verl multi-turn interaction and tool call profile analysis. Also available in Chinese version and on Zhihu.
- SGLang, verl, OpenBMB, and Tsinghua University Team Jointly Open Source: First Support for Multi-Turn Interaction and Tool Calling in Mainstream RLHF Frameworks: First support for multi-turn interaction and tool calling in mainstream RLHF frameworks. Also available on Zhihu.
- Search-R1 & veRL-SGLang: Train LLMs with Multi-Turn RL to Reason and Call a Search Engine: Integrating the Search-R1 framework into the verl-sglang ecosystem. Also available on Zhihu.
- SGLang-veRL Server: From Engine to Server, We Need More Flexible RLHF Rollout Interfaces: To implement more complex RLHF systems, we are gradually replacing the rollout engine in veRL with a rollout server. Also available on Zhihu: SGLang-veRL Server.
- HybridFlow veRL Original Paper Analysis: Principles and implementation of SGLang’s hybrid engine. Also available on Zhihu: HybridFlow veRL Original Paper Analysis.
OpenRLHF Framework
- Illustrated Series on LLM RLHF: PPO Principles and Source Code Interpretation for Everyone and Illustrated Distributed Training Process based on Ray in OpenRLHF: Excellent RLHF introductory resources by Ms. Mengyuan. After reading, you will have a good understanding of RLHF’s computational flow and the OpenRLHF PPO framework. I have also added my own understanding in RLHF Computational Flow.
- Brief Analysis of the Computational Flow of Post-Training Systems Represented by OpenRLHF: Further complement to Ms. Mengyuan’s article. The Github native rendering is terrible; you might as well look at Zhihu.
System Design and Optimization
- Deep Thoughts on RL Systems: In-Depth Understanding of Weight Update Mechanism: Summary of half a year’s work, in-depth understanding of the weight update mechanism. Also available on Zhihu and in Chinese version.
- Deep Thoughts on RL Systems: FSDP Training Backend: Discusses the principles and implementation of FSDP, and analyzes verl’s use of FSDP. Also available on Zhihu and in Chinese version.
- [Pending Review] Deep Thoughts on RL Systems: Megatron: Brief analysis of Megatron’s basic features, focusing on its use in the RL framework.
- Extending the OpenRLHF Inference Engine: Development notes on integrating SGLang into OpenRLHF. The entire process was very painful, and there’s still an nccl hang error that a DeepSpeed core contributor is currently fixing.
- [Pending Review] SGLang as rollout engine of GRPO trainer: Introduction on how to use SGLang as the inference backend for the GRPO Trainer in TRL. GRPO is a PPO variant that optimizes PPO’s memory usage while improving mathematical reasoning capabilities.
Algorithms and Theory
- [Pending Review] Learning to Reason under Off-Policy Guidance: The LUFFY framework uses off-policy assistance for on-policy learning, dynamically balancing imitation and exploration by combining off-policy inference trajectories with on-policy rollouts.
- Kimi K1.5: Successful Practice of Long Context RL: Industrial implementation of Long Context RLHF. I have always liked the technical reports from the Kimi team. Also available on Zhihu: Kimi K1.5: Successful Practice of Long Context RL.
- Rule-based Reward: Only on Zhihu, a brief write-up. Honestly, I didn’t particularly like the original paper, but determined reward is indeed charming.
- SWE-Bench: How to Construct an Excellent Benchmark in the LLM Era: Reading notes on the SWE-Bench paper. How to construct a good benchmark to provide fine-grained reward for post-training is an eternal and beautiful topic.
- Brief Analysis of Mainstream Alignment Algorithms and the NeMo-Aligner Framework
SGLang Learning Notes
SGLang Diffusion Learning Notes
- SGLang Diffusion Code Walk Through: Basic principles of the diffusion model, and the entire process of a request being handled by SGLang-Diffusion. Also available on Zhihu and in Chinese version.
Core Architecture and Optimization
- SGLang Code Walk Through: The entire process of a request being handled by the SGLang Engine. Some parts are unfinished, but most are okay and have served as a starting point for many SGLang beginners. Chinese version is here.
- Walk Through SGLang / VLLM Worker: Incomplete analysis of SGLang code. Also available on Walk Through SGLang / VLLM Worker. We also thoughtfully provide an English version. For a more detailed analysis, refer to SGLang Code Walk Through; this one is just supplementary.
- Walk Through SGLang Scheduler
- [Pending Review] SGLang Scheduler Evolution: Detailed introduction to the technical evolution of the SGLang Scheduler from serial to CPU/GPU overlap, and related components, comparing the previous overlap Scheduler with the current one introducing multiple CUDA streams and FutureMap. Can be viewed on Zhihu article.
- [Pending Review] KV Cache Code Walkthrough: Overview of KV cache management implementation, starting from the Scheduler component, detailing the update process of KV cache and memory pool during prefill and decode stages.
- [Pending Review] SGLang Multimodal Request Lifecycle: A Deep Architectural Analysis with Qwen2.5-VL as an Example: Provides a detailed analysis of the multimodal request processing flow within the SGLang framework, using Qwen2.5-VL as a reference model.
- [Pending Review] How A Model is Loaded in Hugging Face and SGLang: Documents the process of loading models in Hugging Face and SGLang to help understand the weight loading mechanism.
- [Pending Review] Speculative Decoding: Introduces the speculative decoding optimization technique, which uses a smaller draft model to predict the next $K$ tokens, achieving up to $K$-fold acceleration.
- [Pending Review] Zero-Overhead Batch Scheduler: Introduces the zero-overhead batch scheduler, which solves the GPU Bubble problem caused by serial execution of CPU scheduling and GPU computation in traditional inference systems.
- [Pending Review] Data Parallelism Attention: Detailed introduction to the principles and implementation of DP Attention, specifically for models like DeepSeek that use MLA and only have one KV head, to avoid KV cache duplication caused by tensor parallelism.
- Brief Analysis of SGLang Framework’s Quantization Design and Ideas: Also available on Zhihu: Brief Analysis of SGLang Framework’s Quantization Design and Ideas and in Chinese version.
- Constraint Decoding: Concepts, Methods, and Optimization: Also available on Zhihu: Understanding Constraint Decoding: Concepts, Methods, and Optimization in one article.
- [Pending Review] Online Update Weights: Introduction to the implementation of the
online_update_weightsinterface in SGLang. Unlikeupdate_weightswhich reads weights from the disk, this interface broadcasts new weights directly from the training engine via NCCL. - [Pending Review] SGLang Verl Engine Optimization Analysis: Analysis of optimizations in the SGLang verl engine, including the implementation of interfaces like
update_weights_from_tensor. - Latency Accelerate For Weight Updates
- [🔥 Related Debugging] Analyzing VLM RL Training Memory Leaks via Torch Memory Snapshot: Analysis of SGLang memory leak issues and solutions. Also available on Zhihu and in Chinese version.
Usage and Practice
- [Pending Review] Qwen3-Coder Usage: Introduction to using Qwen3-coder in SGLang, including the use of tool-parser.
- [Pending Review] NVIDIA Dynamo: Introduction to NVIDIA Dynamo, a high-throughput, low-latency inference framework designed for generative AI and inference model serving in multi-node distributed environments.
- Viewing HuggingFace Model Structure
- SGLang Backend Original Paper Analysis
- Brief Analysis of the Status Quo of Reward / Embed Model Server Engine
- Newbie Perspective: Experience and Gains from Migrating vllm to SGLang
- Newbie Perspective: Using SGL to Serve Embedding Model
- Newbie Perspective: Using vllm to serve a new Embedding Model
Scheduling and Routing
- Mooncake: Carrying the P/D Separation to the End
- Should Prefill and Decode be Separated onto Different Cards?
- Understanding Prefill and Decode Computation Characteristics Based on Chunked Prefill
- ModelServer: A Frontend Distribution System Based on SGLang
ML System Fundamentals
Transformers & Model Architecture
- [Pending Review] Cross-Attention Mechanism in Transformer: Introduction to the cross-attention mechanism in Transformers, allowing the decoder to access and use relevant information from the encoder. Also available in Chinese version.
- Understanding Special Tokens and Chat Templates in One Article: Also recorded on Zhihu Understanding Special Tokens and Chat Templates in One Article.
CUDA & GPU
- Brief Analysis of CUDA Graph Based on torch-memory-savor: Also available on Zhihu: Brief Analysis of CUDA Graph Based on torch-memory-savor and in Chinese version.
Distributed Training & Communication
- [Pending Review] Implementing Tensor Parallelism From Scratch: Implementation and practice of Tensor Parallelism.
- NCCL and NVIDIA TOPO: Introduction to NCCL and NVIDIA GPU detection. Also available on NCCL and NVIDIA TOPO.
- NCCL and SGLang: Application of NCCL in SGLang. This is very similar to the Chinese content but includes some additional notes on parallel strategies. I probably won’t complete this note and will write a separate one to record parallel strategies.
- PyTorch Distributed: Communication practice with
torch.distributed, details on GIL andall_reduce. This part is also available on Zhihu: PyTorch Communication Practice. - [Original][In-Depth][PyTorch] DDP Series Part 1: Introductory Tutorial: Although I didn’t fully grasp the DDP content, I used this to learn about GIL and ring all reduce. This step is recorded in the Postscript of torch-distributed.
- Detailed Explanation of nvidia-smi Command and Some Advanced Tips: Mainly about network topology; my local results are recorded in the NCCL section.
Quantization
- Give me BF16 or Give Me Death: Comprehensive Evaluation of Current Quantization Methods
- AWQ: Model Quantization Should Focus on Activation Values
Developer Guide
- How to use docker: How to use Docker to manage development environments. Please note that to collectively foster a good research environment and prevent others from being annoyed by the baseline "it runs on my machine," learning Docker is essential for everyone. We also have a Chinese version and Zhihu.
- Setting up a Clean Development Environment: Setting up a clean development environment. Also available on Zhihu: Setting up a Clean Development Environment.
- Compiling and Deploying Jupyter Notebooks as Documentation on CI