RL Learning with LoRA: A Diverse Deep Dive

09 Nov, 2025 *

Hi! This is kalomaze. In this post, I’ll be covering LoRA training and its recent incorporation into prime-rl for both SFT and RL finetuning, including practical implementation details & experimental training results for some of our RL environments.

When to use LoRA?

A recent blogpost from John Schulman caught the attention of many prominent reinforcement learning practitioners; that is, “LoRA Without Regret” (released by him in collaboration with other members of the Thinking Machines Lab).

The blogpost presents some important practical evidence on where LoRA training works vs. where it can’t be used as effectively:

LoRA finetuning is most significantly bottlenecked…

09 Nov, 2025 *

When to use LoRA?

The blogpost presents some important practical evidence on where LoRA training works vs. where it can’t be used as effectively:

LoRA finetuning is most significantly bottlenecked by the intrinsic density of the learning signal, or more specifically, the fundamental “new” information that is present in the batch.
This can be a problem for some instances of SFT finetuning that are inherently information rich and introduce fundamentally “high rank” changes to the model.
A practical example of this might be “new knowledge” about newer Python libraries that weren’t originally covered in the pretraining data.
However, RLVR (Reinforcement Learning with Verifiable Rewards) does not appear to require such rich supervisory signal due to the nature of what it is optimizing for (sparse outcome rewards based on experience).

In light of the broader industry pivot from RLHF-centric approaches (the kind of RL we first saw widely applied for LLMs in 2023) to more specialized reinforcement learning on environments with programmatically verifiable rewards (RLVR), the benefits of LoRA training become more immediately attractive; RLVR’s low capacity requirements mean LoRA can match full finetuning performance while offering practical benefits (such as multi-tenant serving of multiple adapters) and significant reductions in training memory requirements.

How is LoRA implemented in prime-rl?

Currently, LoRA is implemented in such a fashion that enables the user to flexibly mix and match what parts of the model they want to target as necessary.

target_modules specifies which kinds of projections the user wants to target for LoRA training specifically.
modules_to_save specifies which kinds of projections the user wants to keep fully trainable rather than LoRA adapted.

For example, if someone wanted to exclusively adapt the attention weights (which are generally understood to be responsible for routing and composing information across the network’s layers) while leaving the MLPs (understood to be responsible for storing world knowledge) completely unmodified, a filter like so (or more complex patterns via regex) could be used:

target_modules = [
"q_proj",      # Attention: Query projection
"k_proj",      # Attention: Key projection
"v_proj",      # Attention: Value projection
"o_proj"       # Attention: Output projection
]
modules_to_save = []

By default, we assume the user wants to keep certain parameters fully trainable rather than adapting them with LoRA, such as normalization parameters (r".*norm$", r".*layernorm$"), since these layers are small enough that fully training them adds negligible memory cost.

The embeddings layer can be considered a special case. While (in principle) an embedding matrix could be LoRA adapted, embeddings function as lookup tables rather than linear projections.

The lm_head is also kept fully trainable by default.

However, today’s experiments will focus on exclusively adapting the MLPs and attention projections in Qwen models across multiple environments. A pure LoRA layout (i.e, no projections targeted as fully trainable in modules_to_save) also guarantees compatibility in the case of vllm’s LoRA serving feature, which may be useful or convenient depending on your use case.

For reference, this is the configuration that I will be using for the following demonstrations.

modules_to_save is empty, which means we’re exclusively training with LoRA for the main modules of every layer (which means this config keeps the norms, output head, & embeddings frozen rather than adapting them):

target_modules = [
"q_proj",      # Attention: Query projection
"k_proj",      # Attention: Key projection
"v_proj",      # Attention: Value projection
"o_proj",      # Attention: Output projection
"gate_proj",   # MLP: Gating projection
"up_proj",     # MLP: Up projection
"down_proj"    # MLP: Down projection
]
modules_to_save = [
]

(On a side note, you will likely want to toggle on save_adapter_separately in the weight checkpointing configuration if you wish to have the LoRA adapter weights by themselves before merging into the main model saved to disk at each checkpoint).

Experiments with LoRA in prime-rl

Experiment 1: `alphabet-sort` task with varying Ranks & Alpha

For our first “toy” task, we will be using a custom environment I’ve previously published on the environments hub at kalomaze/alphabet-sort. We are adapting this model: Qwen/Qwen3-4B-Instruct-2507. We will be using this task to demonstrate the basics of LoRA training, as well as the practical benefits compared to full finetuning.

Here’s an example of what a perfectly successful episode looks like when the model is instructed to sort by LAST name:

Turn 1: User (all turns): Sort the following names by LAST name alphabetically. User: EnginDurgun

<alphabetical_sorted>
EnginDurgun
</alphabetical_sorted>

Turn 2: User: Add DenizCakir

<combined_alphabetical_sorted>
DenizCakir // new name!
EnginDurgun
</combined_alphabetical_sorted>

Turn 3: User: Add NurtenAkman

<combined_alphabetical_sorted>
NurtenAkman // new name!
DenizCakir
EnginDurgun
</combined_alphabetical_sorted>

This multi-turn conversation task requires the model to sort names alphabetically by first OR last name (randomly chosen per episode), maintain a cumulative sorted list across multiple turns, and tag new names with // new name! markers per turn.

Deciding LoRA Rank / Alpha

In LoRA training, the rank controls the size (in parameter count) of the A/B adapter projections relative to the modules that they are learning to adapt, while the alpha hyperparameter is used to adjust the effective learning rate.

However, if you are testing or sweeping various LoRA ranks at a constant alpha parameter during training, you might end up observing that the “effective learning rate” behaves inconsistently across different rank sizes.

The gradient norms for the first step at batch_size = 4096, rollouts_per_example = 8 when it comes to 4 individual LoRA adapters trained with the ranks of 1, 4, 16, and 64 look like this:

(effective batch size = 512 unique samples per step * 8 rollout attempts)

This is because standard LoRA scales by α/r, which makes the effective learning rate diminish as the rank increases; as such, higher rank LoRAs end up learning slower by default despite their additional capacity.

rsLoRA uses α/√r scaling instead to maintain relatively constant gradient magnitudes for a constant alpha across different ranks, and intuitively, this can be thought of as a simple rule to “normalize” the effective gradient magnitude across different LoRA sizes.

Rank	Standard LoRA α (adjusted to match rsLoRA)	rsLoRA α
1	64	64
4	128	64
16	256	64
64	512	64

The gradient norms for the first step of rsLoRA consequently look a lot more “even”:

Constant Alpha vs. rsLoRA

Now, let’s compare ~15 steps of rsLoRA normalized training compared to using the same alpha value as a constant across different rank sizes:

As you can see, using a constant alpha for different LoRA sizes/adapter ranks causes the scale of your gradients to become disproportionately small for larger adapters, but rsLoRA normalizes the contribution.

This becomes especially clear when observing the grad norms for the non-rsLoRA runs:

Memory Reduction

The most immediately useful benefit of LoRA for practitioners is the reduced memory usage for the gradients. Since we are only adapting the low rank projections with respect to the original weights, we end up saving a lot of memory; this enables us to do larger batch sizes on setups that are GPU-count limited.

What is worth noting is that for this task (which uses two GPUs total, one H100 doing training, another for inference) I am able to fit a larger batch size than FFT. 4096 context (with sample packing) can only tolerate micro_batch_size of 1 for 4b on this setup (for 2xH100s); however, all LoRA configurations shown can tolerate a micro_batch_sizeof 4 without out of memory errors.

Comparison to Full Finetuning

For a task like this, there are many possible valid ways to compose and refine the state tracking without introducing fundamentally high rank changes to the model. While full-finetuning still edges out our LoRA, the performance isn’t far behind enough to make it a no go due to the simplicity of the task.

Do also note that a 10x (single order of magnitude) learning rate difference between full-finetuning and LoRA training was found to result in empirical learning speed equivalence across varying rank sizes in “LoRA Without Regret”, which is why this screenshot uses 1e-6 LR for the full-finetune variant of the training run instead of 1e-5.

So, this experiment proves that rank 1 training is sufficient to capture the difference necessary to become competent in basic state tracking and string manipulation across multiple turns.

But does this assumption break for tasks requiring “deeper” semantic reasoning?

Experiment 2: `unscramble` with varying rsLoRA-scaled Ranks

Well, as you’re about to find out, the answer is “yes” to some degree, but not in a way that seems fundamentally limiting.

The unscrambling task I’m about to present (primeintellect/unscramble) is essentially a deeper, more challenging form of sorting. It requires the model to take randomly shuffled sentences that originally followed a chronological sequence and restore them to their correct semantic order.

We will be adapting the same Qwen-4b based instruct model as the previous experiment, and setting the reward_mode variable in the environment to binary (meaning only perfect outcomes are rewarded, while slight imperfection results in a reward of 0).

A 1.0 reward rollout for this environment looks like this:

We are given a scrambled sequence of steps involving a trap and an insect, and we are told that the block labels (*1*, *2*, etc.) are fixed — meaning we cannot change which step has which label — but we need to **reorganize the content** to form the **most sensible flow**.

The given steps are:

*1*: The trap closes completely
*2*: Tap the trap with a toothpick
*3*: Freeze the insect
*4*: Wait for the insect to move
*5*: Place it on the trap

We need to logically order these steps to make a coherent and sensible process, keeping the original labels.

Let's analyze what makes sense:

- To use a trap, we generally start by preparing the target (the insect).
- We need to get the insect to interact with the trap.
- The trap likely only activates or closes when the insect moves or triggers it.

Now, consider the logical sequence:

1. **Start with the insect**: You need an insect to place on the trap.
2. **Freeze the insect** — this might be to make it easy to handle or prevent movement.
3. **Place it on the trap** — now the insect is in position.
4. **Tap the trap with a toothpick** — to start it, perhaps to trigger movement.
5. **Wait for the insect to move** — after tapping, we wait to see if it reacts.
6. **The trap closes completely** — the final action.

But let's check if this makes sense and if the steps are in proper order.

Try this sequence:

→ *3*: Freeze the insect
→ *5*: Place it on the trap
→ *2*: Tap the trap with a toothpick
→ *4*: Wait for the insect to move
→ *1*: The trap closes completely

This sequence follows a logical, step-by-step flow:
- Prepare the insect (freeze it)
- Put it on the trap
- Trigger it (tap)
- Wait for response
- Then, the trap closes

This is a coherent and sensible process.

Now, check if any step is out of place:
- *1* (trap closes) is the final outcome, so it should come last.
- *2* (tap) must come after placing the insect.
- *4* (wait) must follow tapping.
- *3* (freeze) should come before placing, to ensure the insect is handled properly.
- *5* (place) comes right before tapping.

✅ All steps are used, and the flow is logical.

Final ordered sequence with original labels:

<unscrambled_text>
*3*: Freeze the insect
*5*: Place it on the trap
*2*: Tap the trap with a toothpick
*4*: Wait for the insect to move
*1*: The trap closes completely
</unscrambled_text>

The reward curves over 100 steps for the unscramble environment (for the same set of ranks) look like this:

As you can clearly see, the rank 1 and 4 LoRAs aren’t nearly as close to performing as well as the higher rank training runs as you may have expected. One could reasonably hypothesize that this is due to the implicit semantic understanding being learned in a task like this requiring intrinsically higher rank adaptation compared to the “simpler” case of alphabet-sort.

However, just because lower rank LoRAs learn slower and have less capacity doesn’t mean that they can’t eventually find functionally equivalent but simpler solutions, if given sufficient time & data.

I am skeptical on whether or not peak performance simply takes longer for lower rank LoRAs to converge to for many real world RLVR tasks; after all, looking at the graph, the reward doesn’t appear to have plateaued in a permanent sense, even for rank 1. This remains as something to be studied in more depth for future LoRA experiments, and especially for longer training runs.

It’s also worth noting that the potential redundancies introduced by additional LoRA capacity could be a detriment to some degree; underparameterized LoRAs might learn more generalizable, stable adaptations to the target task, which could very well be preferable if the model is given enough time to learn. As with anything else in RL, outcomes can vary dramatically depending on the nature of the task or the task’s design; feel free to experiment and determine what the best empirical balance is for the specific task that you are training.

Experiment 3: `acereason-math` with varying rsLoRA-scaled Ranks

For our third (and final) experiment, we run a much longer context single turn reasoning task: the primeintellect/acereason-math environment, which is an environment that teaches (you guessed it!) single turn mathematical reasoning. This task is being done on top of a model that was already trained on distilled DeepSeek R1 reasoning traces (deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) with a maximum token count of 8192. (A full example of a reasoning trace will not be shown here for the sake of brevity).

As we established earlier, it is tricky to directly compare FFT vs LoRA in a way that is “fair” due to intrinsically different LR scaling rules (as LoRA is adapting different modules), but since rsLoRA doesn’t behave in this way, we can ablate the size difference more predictably:

One might note that the trend of improvement is somewhat similar to the unscramble task, except with a larger gap between the adjacent ranks over time. Rank 64 and Rank 16 were much more tightly clustered together on the unscramble task, while the differences between those two appear much more pronounced here (and the same pattern applies to rank 1 and 4).

So what we can gather from this is additional evidence that strengthens the earlier hypothesis: rank scaling doesn’t follow a consistent relationship to reward improvement across varying RL environments, and the exact degree to which it matters for learning isn’t easily predictable ahead of time.

Limitations

All experiments were conducted with a constant rsLoRA alpha of 64, which was chosen as a reasonable value to compare varying rank sizes without hurting stability. The relationship between alpha scaling and learning rate, and more specifically, what the empirical or theoretical differences are between scaling your alpha instead of the learning rate to increase the “effective” learning rate, has not been rigorously studied at this point in time.

Conclusion

prime-rl has full support for LoRA. We plan on continuing improving it with better LoRA algorithms and more efficient implementations. We also are working on MoE support, as well as the ability to train multiple LoRA adapters at the same time in preparation for our upcoming Reinforcement Fine-tuning API launch.

If there are any features that you think are missing that would be useful to have, feel free to leave an issue or open a pull request on the repository.

As always, we’ll continue to push forward on our vision to make reinforcement learning as accessible as it can possibly be made to be. Our mission is to build the open superintelligence stack. This starts with empowering businesses looking to adapt the technology for real world use cases, as well as the broader research community at large.

If you’re excited to help shape the future of a truly sovereign open-source AI ecosystem, we’d love to hear from you and invite these researchers and companies to:

Contribute: Develop environments & evals for the Environments Hub. Get Started Collaborate: We’re hiring engineers and researchers at the intersection of AI and distributed systems. Careers

When to use LoRA?

When to use LoRA?

How is LoRA implemented in prime-rl?

Experiments with LoRA in prime-rl

Experiment 1: alphabet-sort task with varying Ranks & Alpha

Deciding LoRA Rank / Alpha

Constant Alpha vs. rsLoRA

Memory Reduction

Comparison to Full Finetuning

Experiment 2: unscramble with varying rsLoRA-scaled Ranks

Experiment 3: acereason-math with varying rsLoRA-scaled Ranks

Limitations

Conclusion

Similar Posts

Experiment 1: `alphabet-sort` task with varying Ranks & Alpha

Experiment 2: `unscramble` with varying rsLoRA-scaled Ranks

Experiment 3: `acereason-math` with varying rsLoRA-scaled Ranks