RL for Language Models, From First Principles
This post introduces the ideas behind post-training LLMs. I’ve always learned best by tracing a concept’s lineage — building intuition from where an idea started toward where it is now. Here I do that for post-training: starting from probability basics, deriving the policy gradient, and building up to the modern RL algorithms. Largely inspired by my reading of rlhfbook.com.
The probability you actually need
Almost everything in reinforcement learning is built out of five small ideas from probability. Not a semester’s worth — five. If you’re solid on these, the scary equations later turn out to be just these ideas wearing formal clothes. So we start here, from absolute zero, and we don’t move on until they feel obvious.
// random variable
Most numbers have one fixed value: 5 is 5. A random variable is a quantity whose value isn’t decided until some uncertain process resolves. A die roll is a random variable: before you roll, it doesn’t have a value — it has a set of possible values , each with some chance. We write to mean “the process came out 3.” Convention: capital for the uncertain thing itself, lowercase for a specific value it might take.
// probability distribution
This is the full description of a random variable: for every possible value, how likely it is. For a fair die, “each of 1–6 has probability .” Two rules every distribution obeys: each probability sits between 0 and 1, and they all sum to 1 (something must happen). We write for “the probability that takes value .”
// conditional probability
Often the likelihood of one thing depends on another. We write , read “probability of given .” The bar means “assuming has happened, here’s the distribution over .” For example is high; is low — same outcome, different conditioning, different probability.
// expectation — the probability-weighted average
Suppose a game pays you the value of a die roll in rupees. Any single play is random. What do you get on average over many plays? Weight each outcome by its probability and add:
The symbol means “expected value of.” Notice 3.5 isn’t even a possible roll — expectation is the long-run average, the center of mass of the distribution, not a value you’ll necessarily ever see. More generally, for any function of the random variable:
“Average the value of , weighting each input by how likely it is.” When is continuous, the sum becomes an integral — same meaning, a weighted average over all possibilities. A subscript tells you which distribution you’re averaging over: means ” is drawn from , average over that.” The "" reads “is distributed as.” Keep an eye on that subscript — in RL, the distribution you average over is the very thing you’re trying to change.
// sampling & Monte Carlo estimation
“Sampling from a distribution” means drawing one concrete outcome according to its probabilities — rolling once and getting a 4. Here’s the practical punchline: if you can’t compute an expectation by hand (too many outcomes to sum), you can estimate it by sampling many times and averaging:
Roll the die 10,000 times, average — you’ll get close to 3.5. This is Monte Carlo estimation, and it is how essentially all of practical RL turns otherwise-impossible expectations into things a computer can actually compute. The more samples , the closer the estimate. Hold onto this; it’s the workhorse of the entire field.
Reinforcement learning is about an agent (a decision-maker — a robot, a game player, a language model) interacting with an environment (everything outside it). The interaction is a loop over discrete timesteps
- The agent observes the current state — a description of the situation right now.
- It picks an action .
- The environment responds with a reward (a number scoring that action) and a new state .
- Repeat.
The goal is to pick actions that lead to lots of reward over time — not just immediate reward, but accumulated reward. That’s the whole game.
The MDP: bundling this into math
A Markov Decision Process (MDP) is the formal package describing such a problem. It’s written as a tuple of five things bundled together with a name, the way you’d define a struct:
| Symbol | Name | What it is |
|---|---|---|
| State space | The set of all possible states. In chess, every board configuration. The calligraphic letter denotes the whole set; is one particular state. | |
| Action space | The set of all possible actions. {up, down, left, right} for a grid-walker; a continuous range of angles for a steering wheel. | |
| Transition dynamics | A conditional distribution: given you’re in and take , the probabilities over each possible next state. A distribution, not one outcome, because environments can be stochastic. | |
| Reward function | A scalar scoring action in state . Higher is better. This is how you tell the agent what you want. | |
| Discount factor | A number in controlling how much future reward is worth relative to immediate reward. We’ll see exactly how it enters next. |
Run the agent from start to finish and you get a trajectory (also: rollout, episode) — the whole logged sequence:
( is the Greek letter tau; the episode ends in the terminal state , so actions run to .) One trajectory is one complete playthrough. We score it by its total reward, but with a twist — we weight each reward by raised to its timestep. This is the return:
Reading the sum: means “add up the following for , then , … up to .” Since , higher powers are smaller (), so rewards far in the future count for less.
// why discount at all?
- Modeling preference: a reward now is generally better than the same reward later (like money today vs. next year), and encodes that.
- Mathematical safety: if the loop runs forever, the undiscounted sum could blow up to infinity. With and bounded rewards, a geometric series converges to something finite.
- A dial on far-sightedness: the effective horizon is roughly steps. ≈ 10-step lookahead; ≈ 100.
The policy: how the agent decides
The agent needs a rule for choosing actions given states. That rule is the policy, written (pi):
Read it carefully. It’s a conditional distribution: given the current state , it gives a probability for each possible action. So the policy doesn’t output one fixed action — it outputs a distribution, and the agent samples from it. The subscript (theta) is the crucial part: stands for the parameters — concretely, the weights of a neural network. Different → different behavior. Learning = searching for good values of .
A trajectory now gets generated by alternating two random draws:
- the policy picks an action: ,
- the environment picks the next state: .
Because two random processes are involved, the trajectory is itself a random variable — run the same policy twice, get two different trajectories. We write for “the trajectory produced by running policy .”
Putting it all together:
Here is the objective — the thing we maximize:
Read it right to left, using everything above. The inner bracket is the return — the discounted reward of one trajectory. The is the expectation from Part 1 — the probability-weighted average — taken over trajectories the policy produces. So in one English sentence:
is the expected total discounted reward when you let policy run in the environment.
Written out explicitly as a weighted average (the is just shorthand; for an LM, whose trajectories are discrete token sequences, read it as a plain sum over all possible sequences):
where (rho) is the distribution of starting states, the agent product runs over all chosen actions , and the environment product over the transitions , ending in the terminal state . This sum can’t be computed directly — the space of sequences is astronomically large — so we estimate it by Monte Carlo: sample trajectories by running the policy, average their returns. And the entire goal of RL, in one line:
"" means “the that makes largest.” We climb by repeatedly nudging in the direction that increases it. The one hard part — and the reason RL is its own field — is computing that gradient , because sits inside the sampling distribution. That’s Parts 5 and 6.
Language models are MDPs
Here’s the bridge that makes all of this apply to LLMs. Take the abstract MDP and fill in every slot for a language model generating text. Notation first: is the prompt, is the completion, individual generated tokens are , and means “all completion tokens before position .”
| MDP concept | Language model |
|---|---|
| State | Prompt + tokens so far: — literally the transformer’s context window at step . It grows as generation proceeds. |
| Action | The next token . The action space is the vocabulary — typically 32K–256K options at every step. |
| Transition | Deterministic: append the chosen token to the sequence. No randomness, nothing to learn. |
| Policy | The LM’s next-token distribution. The model already is the policy — the softmax over the vocabulary is exactly . |
| Episode | One prompt → one completion. Feed a prompt, generate until end-of-sequence or a length cap. |
| Terminal reward | A reward-model score (RLHF) or a verifier’s output (RLVR — did the math check out, did the code pass tests). Arrives only at the end. |
| Typically — no discounting. Careful with the why, though: it’s not that would do nothing here. With a terminal-only reward, would multiply the score by — an implicit brevity penalty that grows with completion length. Setting is the deliberate choice not to smuggle a length preference into the objective; when you want one, you put it in the reward explicitly. |
// consequence 1 — the trajectory probability is pure policy
Every factor in equals 1, so the whole product collapses to just the policy terms:
This is literally the ordinary autoregressive likelihood a language model already computes during pretraining. The RL “trajectory distribution” and the LM’s “sequence likelihood” are the same object — which is the deep reason RL slots onto LMs so naturally.
// consequence 2 — all randomness lives in one place
In a robot, randomness comes from two taps: the policy’s action noise and the environment’s transition noise. In an LM there’s exactly one tap: the sampling of which token to emit. Decode greedily (always argmax) and the whole trajectory is deterministic. This is why sampling temperature is the entire exploration mechanism in LLM-RL. And a concrete reason to care, planted early: the group-based methods of Part 8 score a group of sampled completions and divide by the group’s spread, so the group needs temperature to have any spread at all. If every completion comes out identical, their rewards are identical too: the numerator goes to , the standard deviation in the denominator goes to , and every advantage is — no spread, no signal, the gradient vanishes. (That expression is GRPO’s advantage formula; it arrives properly in Part 8.)
How a policy gradient works
Here is the heart of the whole thing, and it really does decompose into exactly two questions. The principle in one line: make actions more likely when they lead to better outcomes. The update has this shape:
is the change you make to the parameters this step. The "" means “proportional to” — there’s a learning-rate constant out front we’re suppressing to focus on direction and sign. The right side is a product of two things: a scalar (capital psi) and a vector . Read it as two questions answered at once.
Factor 1 — which action?
Start inside. is the probability the policy assigned to the action it actually took. Take its log (products of small probabilities underflow; logs turn products into sums — and, as we’ll see, the log is what makes the whole trick work). Now the gradient of that log-prob.
What is a gradient here? is a giant vector — every weight in the network. is a single number depending on all of them. The gradient is a vector the same size as , where each entry says: “if I nudge this weight up a touch, how much does the log-prob of this action change?” It is the direction in weight-space that most rapidly increases the probability of taking in .
That’s the whole meaning of Factor 1: it points the parameters toward making “do here” more likely next time. It carries no information about whether that’s a good idea — it’s pure “here’s how to do more of this.” Which is why you need Factor 2.
Factor 2 — how good was it?
is a scalar scoring the outcome. Its sign is the entire learning mechanism, because a scalar times a vector either keeps the vector’s direction or flips it:
- (good outcome): points along Factor 1 → make more likely. Reinforce it.
- (bad outcome): the negative scalar flips the vector → make less likely. Suppress it.
- Magnitude : bigger means a stronger push. Great outcome → shoved up hard; mildly good → gentle nudge.
Move the parameters to make good actions more likely and bad actions less likely, in proportion to how good or bad they were.
Where that equation comes from
You can take Part 5 on faith, but the derivation — one identity plus bookkeeping — makes the log feel inevitable rather than magic, and it explains the one subtlety that makes RL hard. We want to ascend , so we need its gradient .
// the obstacle
Write the objective as an integral. Since is just a number for a fixed trajectory (only its probability depends on ), the gradient lands only on :
This is correct — but not computable in this form, and here’s exactly why. The only tool we have for an integral this large is Monte Carlo: rewrite it as an expectation, then sample. But Monte Carlo only works on integrals shaped like an expectation — a genuine probability distribution multiplying some function:
The sampling is the weighting: draw from , and frequent samples land where is large, exactly reproducing the "" factor. Now compare the two integrals:
- itself has in the multiplying slot — a real distribution. Sampleable. Run the policy times, average returns. ✓
- has in that slot — and that is not a probability distribution. ✗
// the fix — the log-derivative trick
One algebraic identity rescues everything. It’s just the chain rule on , run backwards ():
Substitute it into the broken integral and watch a real distribution reappear in the multiplying slot:
The trick manufactured a out front and pushed the gradient onto , where it becomes a well-behaved, finite, signed vector — perfectly fine to evaluate (it’s allowed to be negative; only the distribution part needs to be a real distribution, and now it is). The integral is an expectation again. Sampleable. That is precisely why the log is forced on you.
// the final form
Recall . Those transition terms don’t depend on (for an LM they’re literally zero; in general they just vanish under ). So:
REINFORCE, and the trick that saves it
REINFORCE is just the bare skeleton with the simplest possible . For an LM — terminal reward, — every token in a completion shares one scalar, the completion’s reward :
Generate a completion, score it, push up the log-probability of every token in it by an amount proportional to that score. It’s exactly correct on average. So what’s wrong with it?
The baseline — variance reduction for free
This is the single most important idea in the entire lineage. Subtract a baseline from the reward before using it as :
The remarkable fact: as long as doesn’t depend on the action (it may depend on the state ), this does not change the gradient’s expectation — the estimator stays exactly unbiased — yet it can dramatically reduce the variance. The one-line reason: pulls out of the action-average, and what’s left differentiates the constant :
So you may subtract something for free, and the smart choice is the average reward, which re-centers the signal:
Now better-than-average → (reinforce); worse-than-average → (suppress). Even if all rewards are to , after subtracting the mean (~7.5) you get properly signed signals from to . This re-centered quantity is exactly what we call the advantage.
Every algorithm that follows is a different answer to one question: what baseline do I subtract, and how do I compute it?
The algorithm zoo
From here the post shifts from foundations to the frontier. The derivation is behind us; Parts 8–9 take the practitioner’s view — the bets people actually make and why their runs break — so the pace picks up a notch.
The whole family tree is “same gradient, progressively smarter , plus eventually some step-size control.” Two levers are turning throughout: the baseline inside (a variance story) and step-size control (a stability story). Let’s walk it in the order the ideas actually depend on each other.
RLOO — a clever, unbiased, critic-free baseline
The naive baseline is “average reward over the batch.” RLOO (REINFORCE Leave-One-Out) does something sharper. For each prompt, generate completions ( plays the same role as the group size in GRPO below); the baseline for completion is the mean of all the others, excluding itself:
Empirically it punches above its weight: Ahmadian et al. (2024, “Back to Basics”) found vanilla REINFORCE — and RLOO as its multi-sample extension — matching or beating PPO in their RLHF experiments, arguing PPO’s machinery is often unnecessary there. But mind the setting: those results are on short-form preference tasks (summarization, dialogue — completions of a few hundred tokens), where treating the whole completion as a single “bandit” action — one decision, scored once, no sequential credit — is reasonable. Whether one shared, sequence-level advantage suffices over long reasoning chains is actively contested — VinePPO (ICML 2025) and a line of credit-assignment work argue it’s too coarse there and report gains from finer per-step credit on math benchmarks. Well-replicated for short-form RLHF; genuinely disputed for long-horizon reasoning — not a universal law in either direction.
PPO — the other branch: a learned baseline + step control
PPO comes from a different instinct: instead of estimating the baseline from sibling samples, learn a value function — a separate network predicting expected reward from a state — and use it to build a per-token advantage via GAE. This is the token-level view: every token gets its own advantage. Two costs. First, the value network itself (≈ policy-sized, noisy, finicky to train). Second, and subtler: a baseline that depends only on the state is action-independent, so on its own it stays unbiased (the same free-lunch identity as Part 7) — the bias actually enters through GAE’s bootstrapped advantage targets, which estimate returns partly from other estimates. GAE’s is the dial that trades that bootstrap bias against variance.
PPO also adds the second, orthogonal ingredient — step-size control — that GRPO will inherit. The problem: you’d like to reuse expensive samples for several gradient steps, but after the first step the policy has moved, so the samples are “off-policy.” The fix is importance sampling: reweight each sample by the ratio of how likely it is under the new vs. old policy (written by convention — an unfortunate collision with the reward symbol; this is a ratio, not a reward),
and then clip it into so no single sample with a runaway ratio can blow up the gradient. That clipped-ratio objective is PPO’s “surrogate objective” — a safety rail on the step.
GRPO — RLOO’s baseline idea wearing PPO’s clip
The synthesis, and the one you’ve probably actually run. GRPO (introduced with DeepSeekMath) takes the critic-free, group-based baseline (the RLOO branch) and bolts on PPO’s clipped surrogate (the stability branch). For each prompt, sample a group of completions and normalize:
Above-average completions get positive advantage, below-average negative — and the std-normalization rescales it. The group is the baseline; no value function needed. Then GRPO keeps PPO’s per-token importance ratio and clipping essentially unchanged. In the lineage: GRPO is a simplified PPO that groups completions per prompt and normalizes rewards within the group, removing the need for a value function.
Flag that last phrase now, because it’s about to do a lot of work: “token-level vs. sequence-level” is really three separate questions — where the reward lands, how credit is spread across tokens, and where the importance ratio is clipped — and GRPO already answers them differently (sequence-level reward, sequence-level advantage, token-level ratio). Knob 2 in Part 9 pulls the three apart; for now just don’t collapse them into one.
A worked example: three advantages, one group
Make it concrete. One prompt, a group of four sampled completions, binary reward — exactly one is correct:
Here is the advantage each recipe assigns to those four completions (group mean ; sample std ):
| Completion | reward | RLOO () | GRPO () | GRPO, no std (Dr. GRPO) |
|---|---|---|---|---|
| 1 | 0 | |||
| 2 | 0 | |||
| 3 | 1 | |||
| 4 | 0 |
All three agree on the sign — push the one correct completion up, the three wrong ones down — and since each is mean-centered, every column sums to zero. They differ only in scale and split. Dividing by the std (middle column) inflates magnitudes, and that is exactly the lever that bites: a group whose completions barely differ has a tiny std, so dividing by it blows the advantages up, while a high-variance group gets damped. Dropping the std (right column) removes that distortion — but also stops amplifying the rare all-but-one-wrong groups that are often your hardest, most informative prompts. Same skeleton, different bet.
// the same idea, in my own training loop
After all the machinery, GRPO’s is four lines of PyTorch. This is from qwen3-nano-math-reasoner, where I post-trained Qwen3-0.6B on MATH-500 with GRPO — lightly annotated, otherwise as it runs:
# src/qwen_grpo.py — the heart of one GRPO update (G rollouts, one prompt)
rewards = torch.tensor(roll_rewards, device=device) # G verifier scores
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
adv = advantages.detach()
# Dr. GRPO is a one-line edit: delete the "/ (rewards.std() + 1e-4)".
# all rewards identical -> advantages all ~0 -> no gradient in this group
is_zero_adv = torch.allclose(
advantages, torch.zeros_like(advantages), atol=1e-8, rtol=0.0
)
if skip_zero_adv and is_zero_adv:
return {"loss": 0.0} # nothing to learn here; skip the update
Two details the equations glossed over, sitting right there in the code: the 1e-4 in the denominator is the practical answer to Part 4’s zero-spread collapse (without it, a group of identical rewards divides by ), and the explicit zero-advantage skip is the honest admission that such a group teaches nothing. Hold that thought — the next section’s dynamic sampling is this same skip, turned into a sampling strategy.
The frontier — DAPO, GSPO, CISPO
GRPO has residual leaks; each recent algorithm patches one — still the same skeleton.
Fixes a length bias in how the loss is averaged. GRPO averages token-contributions within each response then across responses, which favors shorter completions. DAPO switches to a token-level aggregation over all responses, so length no longer dilutes the signal — better for multi-step reasoning. (Also adds clip-higher — a looser upper clip bound, so high-advantage exploratory tokens aren’t over-suppressed — and dynamic sampling, which discards all-right or all-wrong groups since their advantages are zero and they contribute no gradient.)
Fixes the granularity of the importance ratio. Per-token clipping zeros out the gradient for any token whose ratio strays outside the band — exactly the rare, important reasoning tokens you most want the model to learn. GSPO computes and clips the ratio at the sequence level. Its principle: the unit of optimization should match the unit of reward. Especially important for MoE, where per-token ratios have explosive variance.
Attacks the same token-dropping problem differently — it drops PPO’s clip on the objective and instead clips the importance-sampling weight, wrapped in a stop-gradient. Detaching the clipped weight means it only rescales each token’s REINFORCE-style update and can never zero it out, so every token always contributes a gradient — including the rare high-ratio reasoning tokens that GRPO’s clip would discard. GSPO’s answer is “lift the ratio to the sequence level”; CISPO’s is “keep every token, but cap the weight so none is ever fully dropped.”
Three axes separate every algorithm above. Understand these and you understand why each design choice was made — and why your training run does or doesn’t work.
Knob 1 — bias vs. variance
Your gradient is computed from samples, so it’s a random variable with two properties. Bias: is it centered on the true gradient — would infinite samples recover it, or something systematically off? Variance: how much does it jitter batch-to-batch? They trade against each other, and they hurt differently: bias sends you confidently toward the wrong optimum; variance makes training noisy, forces tiny learning rates, and can collapse. You met all of these moves in Part 8 — the point now is to read the whole zoo as one progression along this single axis:
Unbiased, catastrophic variance. Exactly correct on average; one noisy uncentered scalar makes it unusable raw.
Variance reduction for free. The one move that cuts variance without adding bias. RLOO’s leave-one-out is the careful version: maximal variance reduction subject to staying unbiased.
Deliberate bias-for-variance. The state-value baseline is action-independent (so unbiased as a baseline); the bias enters through GAE’s bootstrapped advantage targets. GAE’s is literally the bias↔variance dial. The opposite bet from RLOO.
A small bias for stability. Dividing by group std stabilizes magnitudes but biases toward low-variance prompts — the live knob inside GRPO.
Bias for step-safety. Truncating the ratio is a biased operation accepted to stop any step blowing up.
A recurring empirical finding (not a theorem): in several reported LLM RLHF/RLVR settings, the unbiased-sampled-baseline branch (RLOO → GRPO → GSPO) matches or beats the learned-critic branch (PPO), because the critic’s bias and engineering cost can outweigh its variance reduction in that regime. How far it generalizes is still open.
Knob 2 — token-level vs. sequence-level
This phrase means three independent things; conflating them is the #1 source of confusion.
- The reward. Where does the scalar land? Standard setup: sequence-level (terminal) — the verifier/RM scores the whole completion. There’s no natural per-token reward. (Process supervision, scoring each reasoning step, is the denser exception.) This is the ground truth: the reward genuinely lives at the sequence level.
- The advantage / credit assignment. Given a sequence-level reward, how is credit spread over tokens? Sequence-level (REINFORCE/RLOO/GRPO): every token gets the same advantage, no value function. Token-level (PPO+GAE): each token gets its own, via the critic — finer, but you’re manufacturing per-token signal that wasn’t really there.
- The importance ratio + clipping. Where is the off-policy correction computed? Token-level (PPO, GRPO) vs. sequence-level (GSPO).
// importance sampling, in one breath
The policy gradient is an expectation over trajectories from the current policy. Reusing samples from a slightly-older policy makes them off-policy; importance sampling reweights each by to correct the mismatch. It only behaves when old and new are close — diverge, and the ratio explodes into huge variance, which is why you clip it. The token-vs-sequence question for the ratio is separate from the advantage question, and GSPO’s whole pitch is that GRPO’s remaining token-level ratio is the leak.
Knob 3 — the KL penalty: in the reward vs. as a regularizer
The KL penalty’s purpose is identical either way: keep from drifting too far from a reference (usually the SFT model), so it doesn’t hack the reward or destroy general capability. What differs is where you attach the leash.
| Placement A — in the reward (RLOO) | Placement B — separate loss term (GRPO) |
|---|---|
| — drift is treated as part of the scored outcome, flows through the baseline and gets the same credit-assignment treatment as task reward. | — advantage is pure task reward; KL is bolted on like weight decay, penalizing the distribution directly. |
One more frontier note: when the reward is a clean verifier rather than an exploitable learned RM, the original justification for the leash weakens, and several recent RLVR recipes (DAPO among them) drop the KL penalty entirely, arguing it just holds the model back from the large policy moves that reasoning gains require.
The one-page map to carry
Everything in this post is a single equation with two levers:
The direction factor (“which action”) never changes. Everything is (a) what you put in to lower variance, and (b) how you control the step.
REINFORCE (none) → RLOO (unbiased leave-one-out group mean, no critic) → PPO (learned value fn + per-token GAE) → GRPO (group mean/std, critic-free like RLOO but per-token like PPO).
REINFORCE/RLOO (none) → PPO (clipped importance ratio) → GRPO (inherits the clip) → DAPO (fix length-averaging), GSPO (ratio + clip at sequence level), CISPO (clip the weight, not the objective).
And the whole family as one matrix — every row is the same policy-gradient skeleton with different entries in a few columns:
| Method | Baseline | Advantage unit | Ratio + clip unit | Length handling | What it fixes / its failure mode |
|---|---|---|---|---|---|
| REINFORCE | none | sequence | none (on-policy) | — | unbiased, but catastrophic variance |
| RLOO | leave-one-out group mean (critic-free) | sequence | none (single step) | token-mean | cuts variance, stays unbiased |
| PPO | learned value Vφ | token (GAE, bootstrapped → biased) | token | token-level | enables sample reuse; cost is the critic |
| GRPO | group mean / std (critic-free) | sequence (shared over tokens) | token | per-response then per-group → length bias | drops the critic; std + length biases remain |
| DAPO | group | sequence | token, “clip-higher” + dynamic sampling | token-level over all responses | fixes GRPO’s length-averaging; often drops KL |
| GSPO | group | sequence | sequence | sequence-level | tames token-ratio variance (esp. MoE) |
| CISPO | group | sequence | token, clips the IS-weight (stop-grad) | token-level | keeps every token’s gradient |
A key for the Length column, since the terms haven’t all appeared yet: it records how the per-token loss terms are averaged, which is exactly where DAPO showed length bias hides. Token-mean / token-level = all tokens pooled into one flat average (each token weighs the same, regardless of which completion it came from). Per-response then per-group = GRPO’s nested averaging — average within each completion first, then across completions — which quietly down-weights every token of a long completion. Sequence-level = one loss term per completion, so length never enters the averaging at all.
The unit of optimization should match the unit of reward.
That one sentence — GSPO’s — is the throughline the modern papers keep rediscovering, and the most useful thing to keep in your head. The reward is sequence-level, which is exactly why — in the reasoning/RLVR results reported so far — the sequence-level branch (RLOO → GRPO → GSPO) keeps winning on simplicity and robustness over dragging a token-level value function (PPO) or a token-level importance correction (vanilla GRPO) into a setting where the reward never actually lived at the token level. Start from the unbiased-but-explosive REINFORCE estimator; every method after it is either reducing variance while preserving unbiasedness, or trading a little bias for a large stability win. Once you can place any new algorithm on those two levers, you understand it.
References
- Williams (1992) — Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning — REINFORCE.
- Schulman et al. (2015) — High-Dimensional Continuous Control Using Generalized Advantage Estimation — GAE.
- Schulman et al. (2017) — Proximal Policy Optimization Algorithms — PPO.
- Schulman — Approximating KL Divergence — the k1/k2/k3 estimators.
- Shao et al. (2024) — DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — where GRPO was introduced.
- Ahmadian et al. (2024) — Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs — RLOO.
- Kazemnejad et al. (2024; ICML 2025) — VinePPO: Refining Credit Assignment in RL Training of LLMs — the credit-assignment counterpoint.
- Liu et al. (2025) — Understanding R1-Zero-Like Training: A Critical Perspective — Dr. GRPO.
- Yu et al. (2025) — DAPO: An Open-Source LLM Reinforcement Learning System at Scale — clip-higher, dynamic sampling, token-level loss.
- Zheng et al. (2025) — Group Sequence Policy Optimization — GSPO.
- Chen et al. (2025) — MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention — CISPO.
- Lambert — The RLHF Book — the reading that inspired this post’s arc.
thanks for reading —j.