fri, jun 12, 2026, 14:30:02
An impressionist landscape: a faint path winding uphill through a sunlit summer meadow toward a glowing horizon.

RL for Language Models, From First Principles

This post introduces the ideas behind post-training LLMs. I’ve always learned best by tracing a concept’s lineage — building intuition from where an idea started toward where it is now. Here I do that for post-training: starting from probability basics, deriving the policy gradient, and building up to the modern RL algorithms. Largely inspired by my reading of rlhfbook.com.

Almost everything in reinforcement learning is built out of five small ideas from probability. Not a semester’s worth — five. If you’re solid on these, the scary equations later turn out to be just these ideas wearing formal clothes. So we start here, from absolute zero, and we don’t move on until they feel obvious.

// random variable

Most numbers have one fixed value: 5 is 5. A random variable is a quantity whose value isn’t decided until some uncertain process resolves. A die roll XX is a random variable: before you roll, it doesn’t have a value — it has a set of possible values {1,2,3,4,5,6}\{1,2,3,4,5,6\}, each with some chance. We write X=3X=3 to mean “the process came out 3.” Convention: capital XX for the uncertain thing itself, lowercase xx for a specific value it might take.

// probability distribution

This is the full description of a random variable: for every possible value, how likely it is. For a fair die, “each of 1–6 has probability 1/61/6.” Two rules every distribution obeys: each probability sits between 0 and 1, and they all sum to 1 (something must happen). We write p(x)p(x) for “the probability that XX takes value xx.”

// conditional probability

Often the likelihood of one thing depends on another. We write p(yx)p(y \mid x), read “probability of yy given xx.” The bar means “assuming xx has happened, here’s the distribution over yy.” For example p(umbrellaraining)p(\text{umbrella} \mid \text{raining}) is high; p(umbrellasunny)p(\text{umbrella} \mid \text{sunny}) is low — same outcome, different conditioning, different probability.

// expectation — the probability-weighted average

Suppose a game pays you the value of a die roll in rupees. Any single play is random. What do you get on average over many plays? Weight each outcome by its probability and add:

E[X]=xp(x)x=16(1)+16(2)++16(6)=3.5\mathbb{E}[X] = \sum_x p(x)\cdot x = \tfrac{1}{6}(1)+\tfrac{1}{6}(2)+\cdots+\tfrac{1}{6}(6) = 3.5

The symbol E[]\mathbb{E}[\cdot] means “expected value of.” Notice 3.5 isn’t even a possible roll — expectation is the long-run average, the center of mass of the distribution, not a value you’ll necessarily ever see. More generally, for any function ff of the random variable:

E[f(X)]=xp(x)f(x)\mathbb{E}[f(X)] = \sum_x p(x)\cdot f(x)

“Average the value of ff, weighting each input by how likely it is.” When XX is continuous, the sum becomes an integral p(x)f(x)dx\int p(x)\,f(x)\,dx — same meaning, a weighted average over all possibilities. A subscript tells you which distribution you’re averaging over: EXp[f(X)]\mathbb{E}_{X\sim p}[f(X)] means ”XX is drawn from pp, average f(X)f(X) over that.” The "\sim" reads “is distributed as.” Keep an eye on that subscript — in RL, the distribution you average over is the very thing you’re trying to change.

// sampling & Monte Carlo estimation

“Sampling from a distribution” means drawing one concrete outcome according to its probabilities — rolling once and getting a 4. Here’s the practical punchline: if you can’t compute an expectation by hand (too many outcomes to sum), you can estimate it by sampling many times and averaging:

E[f(X)]1Ni=1Nf(xi),xi sampled from p\mathbb{E}[f(X)] \approx \frac{1}{N}\sum_{i=1}^{N} f(x_i), \qquad x_i \text{ sampled from } p

Roll the die 10,000 times, average — you’ll get close to 3.5. This is Monte Carlo estimation, and it is how essentially all of practical RL turns otherwise-impossible expectations into things a computer can actually compute. The more samples NN, the closer the estimate. Hold onto this; it’s the workhorse of the entire field.


Reinforcement learning is about an agent (a decision-maker — a robot, a game player, a language model) interacting with an environment (everything outside it). The interaction is a loop over discrete timesteps t=0,1,2,t = 0, 1, 2, \dots

  1. The agent observes the current state sts_t — a description of the situation right now.
  2. It picks an action ata_t.
  3. The environment responds with a reward rtr_t (a number scoring that action) and a new state st+1s_{t+1}.
  4. Repeat.

The goal is to pick actions that lead to lots of reward over time — not just immediate reward, but accumulated reward. That’s the whole game.

The MDP: bundling this into math

A Markov Decision Process (MDP) is the formal package describing such a problem. It’s written as a tuple of five things bundled together with a name, the way you’d define a struct:

MDP  (S,  A,  P,  r,  γ)\text{MDP}\;(\,\mathcal{S},\;\mathcal{A},\;P,\;r,\;\gamma\,)
SymbolNameWhat it is
S\mathcal{S}State spaceThe set of all possible states. In chess, every board configuration. The calligraphic letter denotes the whole set; stSs_t \in \mathcal{S} is one particular state.
A\mathcal{A}Action spaceThe set of all possible actions. {up, down, left, right} for a grid-walker; a continuous range of angles for a steering wheel.
P(st+1st,at)P(s_{t+1}\mid s_t,a_t)Transition dynamicsA conditional distribution: given you’re in sts_t and take ata_t, the probabilities over each possible next state. A distribution, not one outcome, because environments can be stochastic.
r(st,at)r(s_t,a_t)Reward functionA scalar scoring action ata_t in state sts_t. Higher is better. This is how you tell the agent what you want.
γ\gammaDiscount factorA number in [0,1][0,1] controlling how much future reward is worth relative to immediate reward. We’ll see exactly how it enters next.

Run the agent from start to finish and you get a trajectory (also: rollout, episode) — the whole logged sequence:

τ=(s0,a0,s1,a1,,sT1,aT1,sT)\tau = (s_0,a_0,s_1,a_1,\dots,s_{T-1},a_{T-1},s_T)

(τ\tau is the Greek letter tau; the episode ends in the terminal state sTs_T, so actions run t=0t=0 to T1T-1.) One trajectory is one complete playthrough. We score it by its total reward, but with a twist — we weight each reward by γ\gamma raised to its timestep. This is the return:

R(τ)=t=0T1γtr(st,at)=r0+γr1+γ2r2+γ3r3+R(\tau) = \sum_{t=0}^{T-1}\gamma^t\,r(s_t,a_t) = r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3 + \cdots

Reading the sum: t=0T1\sum_{t=0}^{T-1} means “add up the following for t=0t=0, then 11, … up to T1T-1.” Since γ<1\gamma<1, higher powers are smaller (0.90=1, 0.91=0.9, 0.92=0.81,0.9^0=1,\ 0.9^1=0.9,\ 0.9^2=0.81,\dots), so rewards far in the future count for less.

// why discount at all?

  • Modeling preference: a reward now is generally better than the same reward later (like money today vs. next year), and γ\gamma encodes that.
  • Mathematical safety: if the loop runs forever, the undiscounted sum could blow up to infinity. With γ<1\gamma<1 and bounded rewards, a geometric series converges to something finite.
  • A dial on far-sightedness: the effective horizon is roughly 11γ\frac{1}{1-\gamma} steps. γ=0.9\gamma=0.9 ≈ 10-step lookahead; γ=0.99\gamma=0.99 ≈ 100.

The policy: how the agent decides

The agent needs a rule for choosing actions given states. That rule is the policy, written π\pi (pi):

πθ(atst)\pi_\theta(a_t \mid s_t)

Read it carefully. It’s a conditional distribution: given the current state sts_t, it gives a probability for each possible action. So the policy doesn’t output one fixed action — it outputs a distribution, and the agent samples from it. The subscript θ\theta (theta) is the crucial part: θ\theta stands for the parameters — concretely, the weights of a neural network. Different θ\theta → different behavior. Learning = searching for good values of θ\theta.

A trajectory now gets generated by alternating two random draws:

  • the policy picks an action: atπθ(st)a_t \sim \pi_\theta(\cdot \mid s_t),
  • the environment picks the next state: st+1P(st,at)s_{t+1} \sim P(\cdot \mid s_t, a_t).

Because two random processes are involved, the trajectory τ\tau is itself a random variable — run the same policy twice, get two different trajectories. We write τπ\tau \sim \pi for “the trajectory produced by running policy π\pi.”

Putting it all together: J(π)J(\pi)

Here is the objective — the thing we maximize:

J(π)=Eτπ ⁣[t=0T1γtr(st,at)]J(\pi) = \mathbb{E}_{\tau\sim\pi}\!\left[\sum_{t=0}^{T-1}\gamma^t\,r(s_t,a_t)\right]

Read it right to left, using everything above. The inner bracket is the return R(τ)R(\tau) — the discounted reward of one trajectory. The Eτπ[]\mathbb{E}_{\tau\sim\pi}[\cdots] is the expectation from Part 1 — the probability-weighted average — taken over trajectories the policy produces. So in one English sentence:

J(π)J(\pi) is the expected total discounted reward when you let policy π\pi run in the environment.

Written out explicitly as a weighted average (the  ⁣dτ\int\!\cdots d\tau is just shorthand; for an LM, whose trajectories are discrete token sequences, read it as a plain sum over all possible sequences):

J(π)=pπ(τ)R(τ)dτ,pπ(τ)=ρ(s0)t=0T1πθ(atst)agent  t=0T1P(st+1st,at)environment\begin{aligned} J(\pi) &= \int p_\pi(\tau)\,R(\tau)\,d\tau, \\ p_\pi(\tau) &= \rho(s_0)\,\underbrace{\prod_{t=0}^{T-1}\pi_\theta(a_t\mid s_t)}_{\text{agent}}\;\underbrace{\prod_{t=0}^{T-1}P(s_{t+1}\mid s_t,a_t)}_{\text{environment}} \end{aligned}

where ρ\rho (rho) is the distribution of starting states, the agent product runs over all TT chosen actions a0,,aT1a_0,\dots,a_{T-1}, and the environment product over the TT transitions s0s1sTs_0\to s_1\to\cdots\to s_T, ending in the terminal state sTs_T. This sum can’t be computed directly — the space of sequences is astronomically large — so we estimate it by Monte Carlo: sample NN trajectories by running the policy, average their returns. And the entire goal of RL, in one line:

θ=argmaxθJ(πθ),θθ+αθJ(πθ)(gradient ascent)\begin{aligned} \theta^* &= \arg\max_\theta\, J(\pi_\theta), \\ \theta &\leftarrow \theta + \alpha\,\nabla_\theta J(\pi_\theta) \quad (\text{gradient ascent}) \end{aligned}

"argmaxθ\arg\max_\theta" means “the θ\theta that makes JJ largest.” We climb JJ by repeatedly nudging θ\theta in the direction that increases it. The one hard part — and the reason RL is its own field — is computing that gradient θJ\nabla_\theta J, because θ\theta sits inside the sampling distribution. That’s Parts 5 and 6.


Here’s the bridge that makes all of this apply to LLMs. Take the abstract MDP and fill in every slot for a language model generating text. Notation first: xx is the prompt, yy is the completion, individual generated tokens are y1,y2,y_1, y_2, \dots, and y<ty_{<t} means “all completion tokens before position tt.”

MDP conceptLanguage model
State sts_tPrompt + tokens so far: (x,y<t)(x, y_{<t}) — literally the transformer’s context window at step tt. It grows as generation proceeds.
Action ata_tThe next token yty_t. The action space is the vocabulary — typically 32K–256K options at every step.
TransitionDeterministic: append the chosen token to the sequence. No randomness, nothing to learn.
Policy πθ(atst)\pi_\theta(a_t\mid s_t)The LM’s next-token distribution. The model already is the policy — the softmax over the vocabulary is exactly πθ\pi_\theta.
EpisodeOne prompt → one completion. Feed a prompt, generate until end-of-sequence or a length cap.
Terminal rewardA reward-model score (RLHF) or a verifier’s output (RLVR — did the math check out, did the code pass tests). Arrives only at the end.
γ\gammaTypically 1.01.0 — no discounting. Careful with the why, though: it’s not that γ\gamma would do nothing here. With a terminal-only reward, γ<1\gamma<1 would multiply the score by γT1\gamma^{T-1} — an implicit brevity penalty that grows with completion length. Setting γ=1\gamma=1 is the deliberate choice not to smuggle a length preference into the objective; when you want one, you put it in the reward explicitly.

// consequence 1 — the trajectory probability is pure policy

Every P()P(\cdot) factor in pπ(τ)p_\pi(\tau) equals 1, so the whole product collapses to just the policy terms:

pπ(yx)=t=1Tπθ(ytx,y<t)p_\pi(y\mid x) = \prod_{t=1}^{T}\pi_\theta(y_t\mid x, y_{<t})

This is literally the ordinary autoregressive likelihood a language model already computes during pretraining. The RL “trajectory distribution” and the LM’s “sequence likelihood” are the same object — which is the deep reason RL slots onto LMs so naturally.

// consequence 2 — all randomness lives in one place

In a robot, randomness comes from two taps: the policy’s action noise and the environment’s transition noise. In an LM there’s exactly one tap: the sampling of which token to emit. Decode greedily (always argmax) and the whole trajectory is deterministic. This is why sampling temperature is the entire exploration mechanism in LLM-RL. And a concrete reason to care, planted early: the group-based methods of Part 8 score a group of sampled completions and divide by the group’s spread, so the group needs temperature >0>0 to have any spread at all. If every completion comes out identical, their rewards are identical too: the numerator RimeanR_i - \text{mean} goes to 00, the standard deviation in the denominator goes to 00, and every advantage is 00\tfrac{0}{0} — no spread, no signal, the gradient vanishes. (That expression is GRPO’s advantage formula; it arrives properly in Part 8.)


Here is the heart of the whole thing, and it really does decompose into exactly two questions. The principle in one line: make actions more likely when they lead to better outcomes. The update has this shape:

Δθ    Ψtθlogπθ(atst)\Delta\theta \;\propto\; \Psi_t\,\nabla_\theta \log \pi_\theta(a_t\mid s_t)

Δθ\Delta\theta is the change you make to the parameters this step. The "\propto" means “proportional to” — there’s a learning-rate constant out front we’re suppressing to focus on direction and sign. The right side is a product of two things: a scalar Ψt\Psi_t (capital psi) and a vector θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t\mid s_t). Read it as two questions answered at once.

Factor 1 — which action? θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t\mid s_t)

Start inside. πθ(atst)\pi_\theta(a_t\mid s_t) is the probability the policy assigned to the action it actually took. Take its log (products of small probabilities underflow; logs turn products into sums — and, as we’ll see, the log is what makes the whole trick work). Now the gradient θ\nabla_\theta of that log-prob.

What is a gradient here? θ\theta is a giant vector — every weight in the network. logπθ(atst)\log\pi_\theta(a_t\mid s_t) is a single number depending on all of them. The gradient is a vector the same size as θ\theta, where each entry says: “if I nudge this weight up a touch, how much does the log-prob of this action change?” It is the direction in weight-space that most rapidly increases the probability of taking ata_t in sts_t.

That’s the whole meaning of Factor 1: it points the parameters toward making “do ata_t here” more likely next time. It carries no information about whether that’s a good idea — it’s pure “here’s how to do more of this.” Which is why you need Factor 2.

Factor 2 — how good was it? Ψt\Psi_t

Ψt\Psi_t is a scalar scoring the outcome. Its sign is the entire learning mechanism, because a scalar times a vector either keeps the vector’s direction or flips it:

  • Ψt>0\Psi_t > 0 (good outcome): Δθ\Delta\theta points along Factor 1 → make ata_t more likely. Reinforce it.
  • Ψt<0\Psi_t < 0 (bad outcome): the negative scalar flips the vector → make ata_t less likely. Suppress it.
  • Magnitude Ψt|\Psi_t|: bigger means a stronger push. Great outcome → shoved up hard; mildly good → gentle nudge.

Move the parameters to make good actions more likely and bad actions less likely, in proportion to how good or bad they were.


You can take Part 5 on faith, but the derivation — one identity plus bookkeeping — makes the log feel inevitable rather than magic, and it explains the one subtlety that makes RL hard. We want to ascend J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)], so we need its gradient θJ\nabla_\theta J.

// the obstacle

Write the objective as an integral. Since R(τ)R(\tau) is just a number for a fixed trajectory (only its probability depends on θ\theta), the gradient lands only on pθp_\theta:

J(θ)=τpθ(τ)R(τ)dτθJ(θ)=τθpθ(τ)R(τ)dτ\begin{aligned} J(\theta) &= \int_\tau p_\theta(\tau)\,R(\tau)\,d\tau \\ \Longrightarrow\quad \nabla_\theta J(\theta) &= \int_\tau \nabla_\theta\, p_\theta(\tau)\,R(\tau)\,d\tau \end{aligned}

This is correct — but not computable in this form, and here’s exactly why. The only tool we have for an integral this large is Monte Carlo: rewrite it as an expectation, then sample. But Monte Carlo only works on integrals shaped like an expectation — a genuine probability distribution multiplying some function:

Eτp[f(τ)]=τp(τ)a real distribution  f(τ)dτ\mathbb{E}_{\tau\sim p}[f(\tau)] = \int_\tau \underbrace{p(\tau)}_{\text{a real distribution}}\;f(\tau)\,d\tau

The sampling is the weighting: draw from pp, and frequent samples land where pp is large, exactly reproducing the "p(τ)×p(\tau)\times" factor. Now compare the two integrals:

  • JJ itself has pθ(τ)p_\theta(\tau) in the multiplying slot — a real distribution. Sampleable. Run the policy NN times, average returns. ✓
  • θJ\nabla_\theta J has θpθ(τ)\nabla_\theta p_\theta(\tau) in that slot — and that is not a probability distribution.

// the fix — the log-derivative trick

One algebraic identity rescues everything. It’s just the chain rule on log\log, run backwards (logp=p/p\nabla \log p = \nabla p / p):

θpθ(τ)=pθ(τ)θlogpθ(τ)\nabla_\theta\, p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau)

Substitute it into the broken integral and watch a real distribution reappear in the multiplying slot:

θJ=τθpθ(τ)R(τ)dτ=τpθ(τ)distributionθlogpθ(τ)R(τ)dτ=Eτπθ ⁣[θlogpθ(τ)R(τ)]\begin{aligned} \nabla_\theta J &= \int_\tau \nabla_\theta p_\theta(\tau)\,R(\tau)\,d\tau \\ &= \int_\tau \underbrace{p_\theta(\tau)}_{\text{distribution}}\,\nabla_\theta \log p_\theta(\tau)\,R(\tau)\,d\tau \\ &= \mathbb{E}_{\tau\sim\pi_\theta}\!\big[\nabla_\theta \log p_\theta(\tau)\,R(\tau)\big] \end{aligned}

The trick manufactured a pθ(τ)p_\theta(\tau) out front and pushed the gradient onto logpθ(τ)\log p_\theta(\tau), where it becomes a well-behaved, finite, signed vector — perfectly fine to evaluate (it’s allowed to be negative; only the distribution part needs to be a real distribution, and now it is). The integral is an expectation again. Sampleable. That is precisely why the log is forced on you.

// the final form

Recall logpθ(τ)=tlogπθ(atst)+(transition terms)\log p_\theta(\tau) = \sum_t \log\pi_\theta(a_t\mid s_t) + (\text{transition terms}). Those transition terms don’t depend on θ\theta (for an LM they’re literally zero; in general they just vanish under θ\nabla_\theta). So:

θJ=Eτπθ ⁣[tθlogπθ(atst)  R(τ)]\nabla_\theta J = \mathbb{E}_{\tau\sim\pi_\theta}\!\left[\sum_{t} \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;R(\tau)\right]

REINFORCE is just the bare skeleton with the simplest possible Ψt=R(τ)\Psi_t = R(\tau). For an LM — terminal reward, γ=1\gamma=1 — every token in a completion shares one scalar, the completion’s reward R(x,y)R(x,y):

θJ=Eyπθ ⁣[R(x,y)θtlogπθ(ytx,y<t)]\nabla_\theta J = \mathbb{E}_{y\sim\pi_\theta}\!\left[ R(x,y)\,\nabla_\theta \sum_{t} \log\pi_\theta(y_t\mid x, y_{<t})\right]

Generate a completion, score it, push up the log-probability of every token in it by an amount proportional to that score. It’s exactly correct on average. So what’s wrong with it?

The baseline — variance reduction for free

This is the single most important idea in the entire lineage. Subtract a baseline bb from the reward before using it as Ψt\Psi_t:

Ψt=R(x,y)b\Psi_t = R(x,y) - b

The remarkable fact: as long as bb doesn’t depend on the action (it may depend on the state ss), this does not change the gradient’s expectation — the estimator stays exactly unbiased — yet it can dramatically reduce the variance. The one-line reason: b(s)b(s) pulls out of the action-average, and what’s left differentiates the constant 11:

Eaπθ(s) ⁣[b(s)θlogπθ(as)]=b(s)aπθ(as)θπθ(as)πθ(as)=b(s)θ ⁣aπθ(as)=b(s)θ1=0\begin{aligned} \mathbb{E}_{a\sim\pi_\theta(\cdot\mid s)}\!\big[b(s)\,\nabla_\theta\log\pi_\theta(a\mid s)\big] &= b(s)\sum_a \pi_\theta(a\mid s)\,\frac{\nabla_\theta\pi_\theta(a\mid s)}{\pi_\theta(a\mid s)} \\ &= b(s)\,\nabla_\theta\!\sum_a \pi_\theta(a\mid s) \\ &= b(s)\,\nabla_\theta 1 = 0 \end{aligned}

So you may subtract something for free, and the smart choice is the average reward, which re-centers the signal:

Ψt=R(x,y)Rˉaverage reward\Psi_t = R(x,y) - \underbrace{\bar R}_{\text{average reward}}

Now better-than-average → Ψt>0\Psi_t > 0 (reinforce); worse-than-average → Ψt<0\Psi_t < 0 (suppress). Even if all rewards are +5+5 to +10+10, after subtracting the mean (~7.5) you get properly signed signals from 2.5-2.5 to +2.5+2.5. This re-centered quantity RbR - b is exactly what we call the advantage.

Every algorithm that follows is a different answer to one question: what baseline do I subtract, and how do I compute it?

Part Eight

The algorithm zoo

From here the post shifts from foundations to the frontier. The derivation is behind us; Parts 8–9 take the practitioner’s view — the bets people actually make and why their runs break — so the pace picks up a notch.

The whole family tree is “same gradient, progressively smarter Ψt\Psi_t, plus eventually some step-size control.” Two levers are turning throughout: the baseline inside Ψt\Psi_t (a variance story) and step-size control (a stability story). Let’s walk it in the order the ideas actually depend on each other.

RLOO — a clever, unbiased, critic-free baseline

The naive baseline is “average reward over the batch.” RLOO (REINFORCE Leave-One-Out) does something sharper. For each prompt, generate KK completions (KK plays the same role as the group size GG in GRPO below); the baseline for completion ii is the mean of all the others, excluding itself:

A^i=Ri1K1jiRj\hat A_i = R_i - \frac{1}{K-1}\sum_{j\neq i} R_j

Empirically it punches above its weight: Ahmadian et al. (2024, “Back to Basics”) found vanilla REINFORCE — and RLOO as its multi-sample extension — matching or beating PPO in their RLHF experiments, arguing PPO’s machinery is often unnecessary there. But mind the setting: those results are on short-form preference tasks (summarization, dialogue — completions of a few hundred tokens), where treating the whole completion as a single “bandit” action — one decision, scored once, no sequential credit — is reasonable. Whether one shared, sequence-level advantage suffices over long reasoning chains is actively contested — VinePPO (ICML 2025) and a line of credit-assignment work argue it’s too coarse there and report gains from finer per-step credit on math benchmarks. Well-replicated for short-form RLHF; genuinely disputed for long-horizon reasoning — not a universal law in either direction.

PPO — the other branch: a learned baseline + step control

PPO comes from a different instinct: instead of estimating the baseline from sibling samples, learn a value function Vϕ(st)V_\phi(s_t) — a separate network predicting expected reward from a state — and use it to build a per-token advantage via GAE. This is the token-level view: every token gets its own advantage. Two costs. First, the value network itself (≈ policy-sized, noisy, finicky to train). Second, and subtler: a baseline that depends only on the state is action-independent, so on its own it stays unbiased (the same free-lunch identity as Part 7) — the bias actually enters through GAE’s bootstrapped advantage targets, which estimate returns partly from other estimates. GAE’s λ\lambda is the dial that trades that bootstrap bias against variance.

PPO also adds the second, orthogonal ingredient — step-size control — that GRPO will inherit. The problem: you’d like to reuse expensive samples for several gradient steps, but after the first step the policy has moved, so the samples are “off-policy.” The fix is importance sampling: reweight each sample by the ratio of how likely it is under the new vs. old policy (written rt(θ)r_t(\theta) by convention — an unfortunate collision with the reward symbol; this rr is a ratio, not a reward),

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}

and then clip it into [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon] so no single sample with a runaway ratio can blow up the gradient. That clipped-ratio objective is PPO’s “surrogate objective” — a safety rail on the step.

GRPO — RLOO’s baseline idea wearing PPO’s clip

The synthesis, and the one you’ve probably actually run. GRPO (introduced with DeepSeekMath) takes the critic-free, group-based baseline (the RLOO branch) and bolts on PPO’s clipped surrogate (the stability branch). For each prompt, sample a group of GG completions and normalize:

A^i=Rimean({Rj}j=1G)std({Rj}j=1G)\hat A_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}

Above-average completions get positive advantage, below-average negative — and the std-normalization rescales it. The group is the baseline; no value function needed. Then GRPO keeps PPO’s per-token importance ratio and clipping essentially unchanged. In the lineage: GRPO is a simplified PPO that groups completions per prompt and normalizes rewards within the group, removing the need for a value function.

Flag that last phrase now, because it’s about to do a lot of work: “token-level vs. sequence-level” is really three separate questions — where the reward lands, how credit is spread across tokens, and where the importance ratio is clipped — and GRPO already answers them differently (sequence-level reward, sequence-level advantage, token-level ratio). Knob 2 in Part 9 pulls the three apart; for now just don’t collapse them into one.

A worked example: three advantages, one group

Make it concrete. One prompt, a group of four sampled completions, binary reward — exactly one is correct:

R=[0, 0, 1, 0]R = [\,0,\ 0,\ 1,\ 0\,]

Here is the advantage A^i\hat A_i each recipe assigns to those four completions (group mean =0.25=0.25; sample std =0.5=0.5):

Completionreward RiR_iRLOO (Rimean of othersR_i-\text{mean of others})GRPO (Rimeanstd\tfrac{R_i-\text{mean}}{\text{std}})GRPO, no std (Dr. GRPO)
1013-\tfrac{1}{3}0.5-0.50.25-0.25
2013-\tfrac{1}{3}0.5-0.50.25-0.25
31+1+1+1.5+1.5+0.75+0.75
4013-\tfrac{1}{3}0.5-0.50.25-0.25

All three agree on the sign — push the one correct completion up, the three wrong ones down — and since each is mean-centered, every column sums to zero. They differ only in scale and split. Dividing by the std (middle column) inflates magnitudes, and that is exactly the lever that bites: a group whose completions barely differ has a tiny std, so dividing by it blows the advantages up, while a high-variance group gets damped. Dropping the std (right column) removes that distortion — but also stops amplifying the rare all-but-one-wrong groups that are often your hardest, most informative prompts. Same skeleton, different bet.

// the same idea, in my own training loop

After all the machinery, GRPO’s Ψt\Psi_t is four lines of PyTorch. This is from qwen3-nano-math-reasoner, where I post-trained Qwen3-0.6B on MATH-500 with GRPO — lightly annotated, otherwise as it runs:

# src/qwen_grpo.py — the heart of one GRPO update (G rollouts, one prompt)
rewards = torch.tensor(roll_rewards, device=device)   # G verifier scores
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
adv = advantages.detach()
# Dr. GRPO is a one-line edit: delete the "/ (rewards.std() + 1e-4)".

# all rewards identical -> advantages all ~0 -> no gradient in this group
is_zero_adv = torch.allclose(
    advantages, torch.zeros_like(advantages), atol=1e-8, rtol=0.0
)
if skip_zero_adv and is_zero_adv:
    return {"loss": 0.0}  # nothing to learn here; skip the update

Two details the equations glossed over, sitting right there in the code: the 1e-4 in the denominator is the practical answer to Part 4’s zero-spread collapse (without it, a group of identical rewards divides 00 by 00), and the explicit zero-advantage skip is the honest admission that such a group teaches nothing. Hold that thought — the next section’s dynamic sampling is this same skip, turned into a sampling strategy.

The frontier — DAPO, GSPO, CISPO

GRPO has residual leaks; each recent algorithm patches one — still the same skeleton.

DAPO

Fixes a length bias in how the loss is averaged. GRPO averages token-contributions within each response then across responses, which favors shorter completions. DAPO switches to a token-level aggregation over all responses, so length no longer dilutes the signal — better for multi-step reasoning. (Also adds clip-higher — a looser upper clip bound, so high-advantage exploratory tokens aren’t over-suppressed — and dynamic sampling, which discards all-right or all-wrong groups since their advantages are zero and they contribute no gradient.)

GSPO

Fixes the granularity of the importance ratio. Per-token clipping zeros out the gradient for any token whose ratio strays outside the band — exactly the rare, important reasoning tokens you most want the model to learn. GSPO computes and clips the ratio at the sequence level. Its principle: the unit of optimization should match the unit of reward. Especially important for MoE, where per-token ratios have explosive variance.

CISPO

Attacks the same token-dropping problem differently — it drops PPO’s clip on the objective and instead clips the importance-sampling weight, wrapped in a stop-gradient. Detaching the clipped weight means it only rescales each token’s REINFORCE-style update and can never zero it out, so every token always contributes a gradient — including the rare high-ratio reasoning tokens that GRPO’s clip would discard. GSPO’s answer is “lift the ratio to the sequence level”; CISPO’s is “keep every token, but cap the weight so none is ever fully dropped.”


Three axes separate every algorithm above. Understand these and you understand why each design choice was made — and why your training run does or doesn’t work.

Knob 1 — bias vs. variance

Your gradient is computed from samples, so it’s a random variable with two properties. Bias: is it centered on the true gradient — would infinite samples recover it, or something systematically off? Variance: how much does it jitter batch-to-batch? They trade against each other, and they hurt differently: bias sends you confidently toward the wrong optimum; variance makes training noisy, forces tiny learning rates, and can collapse. You met all of these moves in Part 8 — the point now is to read the whole zoo as one progression along this single axis:

REINFORCE

Unbiased, catastrophic variance. Exactly correct on average; one noisy uncentered scalar makes it unusable raw.

Baseline / RLOO

Variance reduction for free. The one move that cuts variance without adding bias. RLOO’s leave-one-out is the careful version: maximal variance reduction subject to staying unbiased.

PPO value fn

Deliberate bias-for-variance. The state-value baseline is action-independent (so unbiased as a baseline); the bias enters through GAE’s bootstrapped advantage targets. GAE’s λ\lambda is literally the bias↔variance dial. The opposite bet from RLOO.

GRPO std-norm

A small bias for stability. Dividing by group std stabilizes magnitudes but biases toward low-variance prompts — the live knob inside GRPO.

Clipping

Bias for step-safety. Truncating the ratio is a biased operation accepted to stop any step blowing up.

A recurring empirical finding (not a theorem): in several reported LLM RLHF/RLVR settings, the unbiased-sampled-baseline branch (RLOO → GRPO → GSPO) matches or beats the learned-critic branch (PPO), because the critic’s bias and engineering cost can outweigh its variance reduction in that regime. How far it generalizes is still open.

Knob 2 — token-level vs. sequence-level

This phrase means three independent things; conflating them is the #1 source of confusion.

  • The reward. Where does the scalar land? Standard setup: sequence-level (terminal) — the verifier/RM scores the whole completion. There’s no natural per-token reward. (Process supervision, scoring each reasoning step, is the denser exception.) This is the ground truth: the reward genuinely lives at the sequence level.
  • The advantage / credit assignment. Given a sequence-level reward, how is credit spread over tokens? Sequence-level (REINFORCE/RLOO/GRPO): every token gets the same advantage, no value function. Token-level (PPO+GAE): each token gets its own, via the critic — finer, but you’re manufacturing per-token signal that wasn’t really there.
  • The importance ratio + clipping. Where is the off-policy correction computed? Token-level (PPO, GRPO) vs. sequence-level (GSPO).

// importance sampling, in one breath

The policy gradient is an expectation over trajectories from the current policy. Reusing samples from a slightly-older policy makes them off-policy; importance sampling reweights each by r(θ)=πθ()/πθold()r(\theta) = \pi_\theta(\cdot)/\pi_{\theta_{\text{old}}}(\cdot) to correct the mismatch. It only behaves when old and new are close — diverge, and the ratio explodes into huge variance, which is why you clip it. The token-vs-sequence question for the ratio is separate from the advantage question, and GSPO’s whole pitch is that GRPO’s remaining token-level ratio is the leak.

Knob 3 — the KL penalty: in the reward vs. as a regularizer

The KL penalty’s purpose is identical either way: keep πθ\pi_\theta from drifting too far from a reference πref\pi_{\text{ref}} (usually the SFT model), so it doesn’t hack the reward or destroy general capability. What differs is where you attach the leash.

Placement A — in the reward (RLOO)Placement B — separate loss term (GRPO)
Ri=RiβKL(πθπref)R'_i = R_i - \beta\,\text{KL}(\pi_\theta \Vert \pi_{\text{ref}}) — drift is treated as part of the scored outcome, flows through the baseline and gets the same credit-assignment treatment as task reward.L=E[A^ilogπθ]+βKL(πθπref)\mathcal{L} = -\mathbb{E}[\hat A_i \log\pi_\theta] + \beta\,\text{KL}(\pi_\theta \Vert \pi_{\text{ref}}) — advantage is pure task reward; KL is bolted on like weight decay, penalizing the distribution directly.

One more frontier note: when the reward is a clean verifier rather than an exploitable learned RM, the original justification for the leash weakens, and several recent RLVR recipes (DAPO among them) drop the KL penalty entirely, arguing it just holds the model back from the large policy moves that reasoning gains require.

Everything in this post is a single equation with two levers:

θJ=E[tΨtθlogπθ(atst)]\nabla_\theta J = \mathbb{E}\Big[\textstyle\sum_t \Psi_t\,\nabla_\theta \log\pi_\theta(a_t\mid s_t)\Big]

The direction factor θlogπθ\nabla_\theta\log\pi_\theta (“which action”) never changes. Everything is (a) what you put in Ψt\Psi_t to lower variance, and (b) how you control the step.

Lever 1 · Ψ

REINFORCE (none) → RLOO (unbiased leave-one-out group mean, no critic) → PPO (learned value fn + per-token GAE) → GRPO (group mean/std, critic-free like RLOO but per-token like PPO).

Lever 2 · step

REINFORCE/RLOO (none) → PPO (clipped importance ratio) → GRPO (inherits the clip) → DAPO (fix length-averaging), GSPO (ratio + clip at sequence level), CISPO (clip the weight, not the objective).

And the whole family as one matrix — every row is the same policy-gradient skeleton with different entries in a few columns:

MethodBaselineAdvantage unitRatio + clip unitLength handlingWhat it fixes / its failure mode
REINFORCEnonesequencenone (on-policy)unbiased, but catastrophic variance
RLOOleave-one-out group mean (critic-free)sequencenone (single step)token-meancuts variance, stays unbiased
PPOlearned value Vφtoken (GAE, bootstrapped → biased)tokentoken-levelenables sample reuse; cost is the critic
GRPOgroup mean / std (critic-free)sequence (shared over tokens)tokenper-response then per-group → length biasdrops the critic; std + length biases remain
DAPOgroupsequencetoken, “clip-higher” + dynamic samplingtoken-level over all responsesfixes GRPO’s length-averaging; often drops KL
GSPOgroupsequencesequencesequence-leveltames token-ratio variance (esp. MoE)
CISPOgroupsequencetoken, clips the IS-weight (stop-grad)token-levelkeeps every token’s gradient

A key for the Length column, since the terms haven’t all appeared yet: it records how the per-token loss terms are averaged, which is exactly where DAPO showed length bias hides. Token-mean / token-level = all tokens pooled into one flat average (each token weighs the same, regardless of which completion it came from). Per-response then per-group = GRPO’s nested averaging — average within each completion first, then across completions — which quietly down-weights every token of a long completion. Sequence-level = one loss term per completion, so length never enters the averaging at all.

The unit of optimization should match the unit of reward.

That one sentence — GSPO’s — is the throughline the modern papers keep rediscovering, and the most useful thing to keep in your head. The reward is sequence-level, which is exactly why — in the reasoning/RLVR results reported so far — the sequence-level branch (RLOO → GRPO → GSPO) keeps winning on simplicity and robustness over dragging a token-level value function (PPO) or a token-level importance correction (vanilla GRPO) into a setting where the reward never actually lived at the token level. Start from the unbiased-but-explosive REINFORCE estimator; every method after it is either reducing variance while preserving unbiasedness, or trading a little bias for a large stability win. Once you can place any new algorithm on those two levers, you understand it.


References

thanks for reading —j.