RL for Language Models, From First Principles

This post introduces the ideas behind post-training LLMs. I’ve always learned best by tracing a concept’s lineage — building intuition from where an idea started toward where it is now. Here I do that for post-training: starting from probability basics, deriving the policy gradient, and building up to the modern RL algorithms. Largely inspired by my reading of rlhfbook.com.

Part One

The probability you actually need

Almost everything in reinforcement learning is built out of five small ideas from probability. Not a semester’s worth — five. If you’re solid on these, the scary equations later turn out to be just these ideas wearing formal clothes. So we start here, from absolute zero, and we don’t move on until they feel obvious.

// random variable

Most numbers have one fixed value: 5 is 5. A random variable is a quantity whose value isn’t decided until some uncertain process resolves. A die roll $X$ is a random variable: before you roll, it doesn’t have a value — it has a set of possible values $\{1,2,3,4,5,6\}$ , each with some chance. We write $X=3$ to mean “the process came out 3.” Convention: capital $X$ for the uncertain thing itself, lowercase $x$ for a specific value it might take.

// probability distribution

This is the full description of a random variable: for every possible value, how likely it is. For a fair die, “each of 1–6 has probability $1/6$ .” Two rules every distribution obeys: each probability sits between 0 and 1, and they all sum to 1 (something must happen). We write $p(x)$ for “the probability that $X$ takes value $x$ .”

// conditional probability

Often the likelihood of one thing depends on another. We write $p(y \mid x)$ , read “probability of $y$ given $x$ .” The bar means “assuming $x$ has happened, here’s the distribution over $y$ .” For example $p(\text{umbrella} \mid \text{raining})$ is high; $p(\text{umbrella} \mid \text{sunny})$ is low — same outcome, different conditioning, different probability.

// expectation — the probability-weighted average

Suppose a game pays you the value of a die roll in rupees. Any single play is random. What do you get on average over many plays? Weight each outcome by its probability and add:

\mathbb{E}[X] = \sum_x p(x)\cdot x = \tfrac{1}{6}(1)+\tfrac{1}{6}(2)+\cdots+\tfrac{1}{6}(6) = 3.5

The symbol $\mathbb{E}[\cdot]$ means “expected value of.” Notice 3.5 isn’t even a possible roll — expectation is the long-run average, the center of mass of the distribution, not a value you’ll necessarily ever see. More generally, for any function $f$ of the random variable:

\mathbb{E}[f(X)] = \sum_x p(x)\cdot f(x)

“Average the value of $f$ , weighting each input by how likely it is.” When $X$ is continuous, the sum becomes an integral $\int p(x)\,f(x)\,dx$ — same meaning, a weighted average over all possibilities. A subscript tells you which distribution you’re averaging over: $\mathbb{E}_{X\sim p}[f(X)]$ means ” $X$ is drawn from $p$ , average $f(X)$ over that.” The " $\sim$ " reads “is distributed as.” Keep an eye on that subscript — in RL, the distribution you average over is the very thing you’re trying to change.

// sampling & Monte Carlo estimation

“Sampling from a distribution” means drawing one concrete outcome according to its probabilities — rolling once and getting a 4. Here’s the practical punchline: if you can’t compute an expectation by hand (too many outcomes to sum), you can estimate it by sampling many times and averaging:

\mathbb{E}[f(X)] \approx \frac{1}{N}\sum_{i=1}^{N} f(x_i), \qquad x_i \text{ sampled from } p

Roll the die 10,000 times, average — you’ll get close to 3.5. This is Monte Carlo estimation, and it is how essentially all of practical RL turns otherwise-impossible expectations into things a computer can actually compute. The more samples $N$ , the closer the estimate. Hold onto this; it’s the workhorse of the entire field.

Part Two

The RL setup, and the box we put it in

Reinforcement learning is about an agent (a decision-maker — a robot, a game player, a language model) interacting with an environment (everything outside it). The interaction is a loop over discrete timesteps $t = 0, 1, 2, \dots$

The agent observes the current state $s_t$ — a description of the situation right now.
It picks an action $a_t$ .
The environment responds with a reward $r_t$ (a number scoring that action) and a new state $s_{t+1}$ .
Repeat.

The goal is to pick actions that lead to lots of reward over time — not just immediate reward, but accumulated reward. That’s the whole game.

The MDP: bundling this into math

A Markov Decision Process (MDP) is the formal package describing such a problem. It’s written as a tuple of five things bundled together with a name, the way you’d define a struct:

\text{MDP}\;(\,\mathcal{S},\;\mathcal{A},\;P,\;r,\;\gamma\,)

Symbol	Name	What it is
$\mathcal{S}$	State space	The set of all possible states. In chess, every board configuration. The calligraphic letter denotes the whole set; $s_t \in \mathcal{S}$ is one particular state.
$\mathcal{A}$	Action space	The set of all possible actions. {up, down, left, right} for a grid-walker; a continuous range of angles for a steering wheel.
$P(s_{t+1}\mid s_t,a_t)$	Transition dynamics	A conditional distribution: given you’re in $s_t$ and take $a_t$ , the probabilities over each possible next state. A distribution, not one outcome, because environments can be stochastic.
$r(s_t,a_t)$	Reward function	A scalar scoring action $a_t$ in state $s_t$ . Higher is better. This is how you tell the agent what you want.
$\gamma$	Discount factor	A number in $[0,1]$ controlling how much future reward is worth relative to immediate reward. We’ll see exactly how it enters next.

Part Three

The objective: what we are actually maximizing

Run the agent from start to finish and you get a trajectory (also: rollout, episode) — the whole logged sequence:

\tau = (s_0,a_0,s_1,a_1,\dots,s_{T-1},a_{T-1},s_T)

( $\tau$ is the Greek letter tau; the episode ends in the terminal state $s_T$ , so actions run $t=0$ to $T-1$ .) One trajectory is one complete playthrough. We score it by its total reward, but with a twist — we weight each reward by $\gamma$ raised to its timestep. This is the return:

R(\tau) = \sum_{t=0}^{T-1}\gamma^t\,r(s_t,a_t) = r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3 + \cdots

Reading the sum: $\sum_{t=0}^{T-1}$ means “add up the following for $t=0$ , then $1$ , … up to $T-1$ .” Since $\gamma<1$ , higher powers are smaller ( $0.9^0=1,\ 0.9^1=0.9,\ 0.9^2=0.81,\dots$ ), so rewards far in the future count for less.

// why discount at all?

Modeling preference: a reward now is generally better than the same reward later (like money today vs. next year), and $\gamma$ encodes that.
Mathematical safety: if the loop runs forever, the undiscounted sum could blow up to infinity. With $\gamma<1$ and bounded rewards, a geometric series converges to something finite.
A dial on far-sightedness: the effective horizon is roughly $\frac{1}{1-\gamma}$ steps. $\gamma=0.9$ ≈ 10-step lookahead; $\gamma=0.99$ ≈ 100.

The policy: how the agent decides

The agent needs a rule for choosing actions given states. That rule is the policy, written $\pi$ (pi):

\pi_\theta(a_t \mid s_t)

Read it carefully. It’s a conditional distribution: given the current state $s_t$ , it gives a probability for each possible action. So the policy doesn’t output one fixed action — it outputs a distribution, and the agent samples from it. The subscript $\theta$ (theta) is the crucial part: $\theta$ stands for the parameters — concretely, the weights of a neural network. Different $\theta$ → different behavior. Learning = searching for good values of $\theta$ .

A trajectory now gets generated by alternating two random draws:

the policy picks an action: $a_t \sim \pi_\theta(\cdot \mid s_t)$ ,
the environment picks the next state: $s_{t+1} \sim P(\cdot \mid s_t, a_t)$ .

Because two random processes are involved, the trajectory $\tau$ is itself a random variable — run the same policy twice, get two different trajectories. We write $\tau \sim \pi$ for “the trajectory produced by running policy $\pi$ .”

Putting it all together: $J(\pi)$

Here is the objective — the thing we maximize:

J(\pi) = \mathbb{E}_{\tau\sim\pi}\!\left[\sum_{t=0}^{T-1}\gamma^t\,r(s_t,a_t)\right]

Read it right to left, using everything above. The inner bracket is the return $R(\tau)$ — the discounted reward of one trajectory. The $\mathbb{E}_{\tau\sim\pi}[\cdots]$ is the expectation from Part 1 — the probability-weighted average — taken over trajectories the policy produces. So in one English sentence:

$J(\pi)$ is the expected total discounted reward when you let policy $\pi$ run in the environment.

Written out explicitly as a weighted average (the $\int\!\cdots d\tau$ is just shorthand; for an LM, whose trajectories are discrete token sequences, read it as a plain sum over all possible sequences):

\begin{aligned} J(\pi) &= \int p_\pi(\tau)\,R(\tau)\,d\tau, \\ p_\pi(\tau) &= \rho(s_0)\,\underbrace{\prod_{t=0}^{T-1}\pi_\theta(a_t\mid s_t)}_{\text{agent}}\;\underbrace{\prod_{t=0}^{T-1}P(s_{t+1}\mid s_t,a_t)}_{\text{environment}} \end{aligned}

where $\rho$ (rho) is the distribution of starting states, the agent product runs over all $T$ chosen actions $a_0,\dots,a_{T-1}$ , and the environment product over the $T$ transitions $s_0\to s_1\to\cdots\to s_T$ , ending in the terminal state $s_T$ . This sum can’t be computed directly — the space of sequences is astronomically large — so we estimate it by Monte Carlo: sample $N$ trajectories by running the policy, average their returns. And the entire goal of RL, in one line:

\begin{aligned} \theta^* &= \arg\max_\theta\, J(\pi_\theta), \\ \theta &\leftarrow \theta + \alpha\,\nabla_\theta J(\pi_\theta) \quad (\text{gradient ascent}) \end{aligned}

" $\arg\max_\theta$ " means “the $\theta$ that makes $J$ largest.” We climb $J$ by repeatedly nudging $\theta$ in the direction that increases it. The one hard part — and the reason RL is its own field — is computing that gradient $\nabla_\theta J$ , because $\theta$ sits inside the sampling distribution. That’s Parts 5 and 6.

Part Four

Language models are MDPs

Here’s the bridge that makes all of this apply to LLMs. Take the abstract MDP and fill in every slot for a language model generating text. Notation first: $x$ is the prompt, $y$ is the completion, individual generated tokens are $y_1, y_2, \dots$ , and $y_{<t}$ means “all completion tokens before position $t$ .”

MDP concept	Language model
State $s_t$	Prompt + tokens so far: $(x, y_{<t})$ — literally the transformer’s context window at step $t$ . It grows as generation proceeds.
Action $a_t$	The next token $y_t$ . The action space is the vocabulary — typically 32K–256K options at every step.
Transition	Deterministic: append the chosen token to the sequence. No randomness, nothing to learn.
Policy $\pi_\theta(a_t\mid s_t)$	The LM’s next-token distribution. The model already is the policy — the softmax over the vocabulary is exactly $\pi_\theta$ .
Episode	One prompt → one completion. Feed a prompt, generate until end-of-sequence or a length cap.
Terminal reward	A reward-model score (RLHF) or a verifier’s output (RLVR — did the math check out, did the code pass tests). Arrives only at the end.
$\gamma$	Typically $1.0$ — no discounting. Careful with the why, though: it’s not that $\gamma$ would do nothing here. With a terminal-only reward, $\gamma<1$ would multiply the score by $\gamma^{T-1}$ — an implicit brevity penalty that grows with completion length. Setting $\gamma=1$ is the deliberate choice not to smuggle a length preference into the objective; when you want one, you put it in the reward explicitly.

// consequence 1 — the trajectory probability is pure policy

Every $P(\cdot)$ factor in $p_\pi(\tau)$ equals 1, so the whole product collapses to just the policy terms:

p_\pi(y\mid x) = \prod_{t=1}^{T}\pi_\theta(y_t\mid x, y_{<t})

This is literally the ordinary autoregressive likelihood a language model already computes during pretraining. The RL “trajectory distribution” and the LM’s “sequence likelihood” are the same object — which is the deep reason RL slots onto LMs so naturally.

// consequence 2 — all randomness lives in one place

In a robot, randomness comes from two taps: the policy’s action noise and the environment’s transition noise. In an LM there’s exactly one tap: the sampling of which token to emit. Decode greedily (always argmax) and the whole trajectory is deterministic. This is why sampling temperature is the entire exploration mechanism in LLM-RL. And a concrete reason to care, planted early: the group-based methods of Part 8 score a group of sampled completions and divide by the group’s spread, so the group needs temperature $>0$ to have any spread at all. If every completion comes out identical, their rewards are identical too: the numerator $R_i - \text{mean}$ goes to $0$ , the standard deviation in the denominator goes to $0$ , and every advantage is $\tfrac{0}{0}$ — no spread, no signal, the gradient vanishes. (That expression is GRPO’s advantage formula; it arrives properly in Part 8.)

Part Five

How a policy gradient works

Here is the heart of the whole thing, and it really does decompose into exactly two questions. The principle in one line: make actions more likely when they lead to better outcomes. The update has this shape:

\Delta\theta \;\propto\; \Psi_t\,\nabla_\theta \log \pi_\theta(a_t\mid s_t)

$\Delta\theta$ is the change you make to the parameters this step. The " $\propto$ " means “proportional to” — there’s a learning-rate constant out front we’re suppressing to focus on direction and sign. The right side is a product of two things: a scalar $\Psi_t$ (capital psi) and a vector $\nabla_\theta \log \pi_\theta(a_t\mid s_t)$ . Read it as two questions answered at once.

Factor 1 — which action? $\nabla_\theta \log \pi_\theta(a_t\mid s_t)$

Start inside. $\pi_\theta(a_t\mid s_t)$ is the probability the policy assigned to the action it actually took. Take its log (products of small probabilities underflow; logs turn products into sums — and, as we’ll see, the log is what makes the whole trick work). Now the gradient $\nabla_\theta$ of that log-prob.

What is a gradient here? $\theta$ is a giant vector — every weight in the network. $\log\pi_\theta(a_t\mid s_t)$ is a single number depending on all of them. The gradient is a vector the same size as $\theta$ , where each entry says: “if I nudge this weight up a touch, how much does the log-prob of this action change?” It is the direction in weight-space that most rapidly increases the probability of taking $a_t$ in $s_t$ .

That’s the whole meaning of Factor 1: it points the parameters toward making “do $a_t$ here” more likely next time. It carries no information about whether that’s a good idea — it’s pure “here’s how to do more of this.” Which is why you need Factor 2.

Factor 2 — how good was it? $\Psi_t$

$\Psi_t$ is a scalar scoring the outcome. Its sign is the entire learning mechanism, because a scalar times a vector either keeps the vector’s direction or flips it:

$\Psi_t > 0$ (good outcome): $\Delta\theta$ points along Factor 1 → make $a_t$ more likely. Reinforce it.
$\Psi_t < 0$ (bad outcome): the negative scalar flips the vector → make $a_t$ less likely. Suppress it.
Magnitude $|\Psi_t|$ : bigger means a stronger push. Great outcome → shoved up hard; mildly good → gentle nudge.

Move the parameters to make good actions more likely and bad actions less likely, in proportion to how good or bad they were.

Part Six

Where that equation comes from

You can take Part 5 on faith, but the derivation — one identity plus bookkeeping — makes the log feel inevitable rather than magic, and it explains the one subtlety that makes RL hard. We want to ascend $J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]$ , so we need its gradient $\nabla_\theta J$ .

// the obstacle

Write the objective as an integral. Since $R(\tau)$ is just a number for a fixed trajectory (only its probability depends on $\theta$ ), the gradient lands only on $p_\theta$ :

\begin{aligned} J(\theta) &= \int_\tau p_\theta(\tau)\,R(\tau)\,d\tau \\ \Longrightarrow\quad \nabla_\theta J(\theta) &= \int_\tau \nabla_\theta\, p_\theta(\tau)\,R(\tau)\,d\tau \end{aligned}

This is correct — but not computable in this form, and here’s exactly why. The only tool we have for an integral this large is Monte Carlo: rewrite it as an expectation, then sample. But Monte Carlo only works on integrals shaped like an expectation — a genuine probability distribution multiplying some function:

\mathbb{E}_{\tau\sim p}[f(\tau)] = \int_\tau \underbrace{p(\tau)}_{\text{a real distribution}}\;f(\tau)\,d\tau

The sampling is the weighting: draw from $p$ , and frequent samples land where $p$ is large, exactly reproducing the " $p(\tau)\times$ " factor. Now compare the two integrals:

$J$ itself has $p_\theta(\tau)$ in the multiplying slot — a real distribution. Sampleable. Run the policy $N$ times, average returns. ✓
$\nabla_\theta J$ has $\nabla_\theta p_\theta(\tau)$ in that slot — and that is not a probability distribution. ✗

Why $\nabla_\theta p_\theta(\tau)$ is not a distribution — and why that’s fatal It’s signed: increasing $\theta$ raises some trajectories’ probabilities and lowers others, so the gradient goes negative — but probabilities can’t be negative. And it integrates to zero, not one: since $\int p_\theta(\tau)\,d\tau = 1$ for every $\theta$ , its derivative is $\nabla_\theta(1) = 0$ (the positive bumps must cancel the negative dips, because total probability is conserved). You cannot sample from an object that goes negative and sums to zero — there’s no sampler you can build for it. So: the distribution we can sample ( $p_\theta$ ) isn’t in the gradient integral, and the object that is in it ( $\nabla_\theta p_\theta$ ) can’t be sampled. Dead end.

// the fix — the log-derivative trick

One algebraic identity rescues everything. It’s just the chain rule on $\log$ , run backwards ( $\nabla \log p = \nabla p / p$ ):

\nabla_\theta\, p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau)

Substitute it into the broken integral and watch a real distribution reappear in the multiplying slot:

\begin{aligned} \nabla_\theta J &= \int_\tau \nabla_\theta p_\theta(\tau)\,R(\tau)\,d\tau \\ &= \int_\tau \underbrace{p_\theta(\tau)}_{\text{distribution}}\,\nabla_\theta \log p_\theta(\tau)\,R(\tau)\,d\tau \\ &= \mathbb{E}_{\tau\sim\pi_\theta}\!\big[\nabla_\theta \log p_\theta(\tau)\,R(\tau)\big] \end{aligned}

The trick manufactured a $p_\theta(\tau)$ out front and pushed the gradient onto $\log p_\theta(\tau)$ , where it becomes a well-behaved, finite, signed vector — perfectly fine to evaluate (it’s allowed to be negative; only the distribution part needs to be a real distribution, and now it is). The integral is an expectation again. Sampleable. That is precisely why the log is forced on you.

// the final form

Recall $\log p_\theta(\tau) = \sum_t \log\pi_\theta(a_t\mid s_t) + (\text{transition terms})$ . Those transition terms don’t depend on $\theta$ (for an LM they’re literally zero; in general they just vanish under $\nabla_\theta$ ). So:

\nabla_\theta J = \mathbb{E}_{\tau\sim\pi_\theta}\!\left[\sum_{t} \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;R(\tau)\right]

Part Seven

REINFORCE, and the trick that saves it

REINFORCE is just the bare skeleton with the simplest possible $\Psi_t = R(\tau)$ . For an LM — terminal reward, $\gamma=1$ — every token in a completion shares one scalar, the completion’s reward $R(x,y)$ :

\nabla_\theta J = \mathbb{E}_{y\sim\pi_\theta}\!\left[ R(x,y)\,\nabla_\theta \sum_{t} \log\pi_\theta(y_t\mid x, y_{<t})\right]

Generate a completion, score it, push up the log-probability of every token in it by an amount proportional to that score. It’s exactly correct on average. So what’s wrong with it?

The baseline — variance reduction for free

This is the single most important idea in the entire lineage. Subtract a baseline $b$ from the reward before using it as $\Psi_t$ :

\Psi_t = R(x,y) - b

The remarkable fact: as long as $b$ doesn’t depend on the action (it may depend on the state $s$ ), this does not change the gradient’s expectation — the estimator stays exactly unbiased — yet it can dramatically reduce the variance. The one-line reason: $b(s)$ pulls out of the action-average, and what’s left differentiates the constant $1$ :

\begin{aligned} \mathbb{E}_{a\sim\pi_\theta(\cdot\mid s)}\!\big[b(s)\,\nabla_\theta\log\pi_\theta(a\mid s)\big] &= b(s)\sum_a \pi_\theta(a\mid s)\,\frac{\nabla_\theta\pi_\theta(a\mid s)}{\pi_\theta(a\mid s)} \\ &= b(s)\,\nabla_\theta\!\sum_a \pi_\theta(a\mid s) \\ &= b(s)\,\nabla_\theta 1 = 0 \end{aligned}

So you may subtract something for free, and the smart choice is the average reward, which re-centers the signal:

\Psi_t = R(x,y) - \underbrace{\bar R}_{\text{average reward}}

Now better-than-average → $\Psi_t > 0$ (reinforce); worse-than-average → $\Psi_t < 0$ (suppress). Even if all rewards are $+5$ to $+10$ , after subtracting the mean (~7.5) you get properly signed signals from $-2.5$ to $+2.5$ . This re-centered quantity $R - b$ is exactly what we call the advantage.

Every algorithm that follows is a different answer to one question: what baseline do I subtract, and how do I compute it?

Part Eight

The algorithm zoo

From here the post shifts from foundations to the frontier. The derivation is behind us; Parts 8–9 take the practitioner’s view — the bets people actually make and why their runs break — so the pace picks up a notch.

The whole family tree is “same gradient, progressively smarter $\Psi_t$ , plus eventually some step-size control.” Two levers are turning throughout: the baseline inside $\Psi_t$ (a variance story) and step-size control (a stability story). Let’s walk it in the order the ideas actually depend on each other.

RLOO — a clever, unbiased, critic-free baseline

The naive baseline is “average reward over the batch.” RLOO (REINFORCE Leave-One-Out) does something sharper. For each prompt, generate $K$ completions ( $K$ plays the same role as the group size $G$ in GRPO below); the baseline for completion $i$ is the mean of all the others, excluding itself:

\hat A_i = R_i - \frac{1}{K-1}\sum_{j\neq i} R_j

Empirically it punches above its weight: Ahmadian et al. (2024, “Back to Basics”) found vanilla REINFORCE — and RLOO as its multi-sample extension — matching or beating PPO in their RLHF experiments, arguing PPO’s machinery is often unnecessary there. But mind the setting: those results are on short-form preference tasks (summarization, dialogue — completions of a few hundred tokens), where treating the whole completion as a single “bandit” action — one decision, scored once, no sequential credit — is reasonable. Whether one shared, sequence-level advantage suffices over long reasoning chains is actively contested — VinePPO (ICML 2025) and a line of credit-assignment work argue it’s too coarse there and report gains from finer per-step credit on math benchmarks. Well-replicated for short-form RLHF; genuinely disputed for long-horizon reasoning — not a universal law in either direction.

PPO — the other branch: a learned baseline + step control

PPO comes from a different instinct: instead of estimating the baseline from sibling samples, learn a value function $V_\phi(s_t)$ — a separate network predicting expected reward from a state — and use it to build a per-token advantage via GAE. This is the token-level view: every token gets its own advantage. Two costs. First, the value network itself (≈ policy-sized, noisy, finicky to train). Second, and subtler: a baseline that depends only on the state is action-independent, so on its own it stays unbiased (the same free-lunch identity as Part 7) — the bias actually enters through GAE’s bootstrapped advantage targets, which estimate returns partly from other estimates. GAE’s $\lambda$ is the dial that trades that bootstrap bias against variance.

PPO also adds the second, orthogonal ingredient — step-size control — that GRPO will inherit. The problem: you’d like to reuse expensive samples for several gradient steps, but after the first step the policy has moved, so the samples are “off-policy.” The fix is importance sampling: reweight each sample by the ratio of how likely it is under the new vs. old policy (written $r_t(\theta)$ by convention — an unfortunate collision with the reward symbol; this $r$ is a ratio, not a reward),

r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}

and then clip it into $[1-\epsilon, 1+\epsilon]$ so no single sample with a runaway ratio can blow up the gradient. That clipped-ratio objective is PPO’s “surrogate objective” — a safety rail on the step.

GRPO — RLOO’s baseline idea wearing PPO’s clip

The synthesis, and the one you’ve probably actually run. GRPO (introduced with DeepSeekMath) takes the critic-free, group-based baseline (the RLOO branch) and bolts on PPO’s clipped surrogate (the stability branch). For each prompt, sample a group of $G$ completions and normalize:

\hat A_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}

Above-average completions get positive advantage, below-average negative — and the std-normalization rescales it. The group is the baseline; no value function needed. Then GRPO keeps PPO’s per-token importance ratio and clipping essentially unchanged. In the lineage: GRPO is a simplified PPO that groups completions per prompt and normalizes rewards within the group, removing the need for a value function.

Flag that last phrase now, because it’s about to do a lot of work: “token-level vs. sequence-level” is really three separate questions — where the reward lands, how credit is spread across tokens, and where the importance ratio is clipped — and GRPO already answers them differently (sequence-level reward, sequence-level advantage, token-level ratio). Knob 2 in Part 9 pulls the three apart; for now just don’t collapse them into one.

A worked example: three advantages, one group

Make it concrete. One prompt, a group of four sampled completions, binary reward — exactly one is correct:

R = [\,0,\ 0,\ 1,\ 0\,]

Here is the advantage $\hat A_i$ each recipe assigns to those four completions (group mean $=0.25$ ; sample std $=0.5$ ):

Completion	reward $R_i$	RLOO ( $R_i-\text{mean of others}$ )	GRPO ( $\tfrac{R_i-\text{mean}}{\text{std}}$ )	GRPO, no std (Dr. GRPO)
1	0	$-\tfrac{1}{3}$	$-0.5$	$-0.25$
2	0	$-\tfrac{1}{3}$	$-0.5$	$-0.25$
3	1	$+1$	$+1.5$	$+0.75$
4	0	$-\tfrac{1}{3}$	$-0.5$	$-0.25$

All three agree on the sign — push the one correct completion up, the three wrong ones down — and since each is mean-centered, every column sums to zero. They differ only in scale and split. Dividing by the std (middle column) inflates magnitudes, and that is exactly the lever that bites: a group whose completions barely differ has a tiny std, so dividing by it blows the advantages up, while a high-variance group gets damped. Dropping the std (right column) removes that distortion — but also stops amplifying the rare all-but-one-wrong groups that are often your hardest, most informative prompts. Same skeleton, different bet.

// the same idea, in my own training loop

After all the machinery, GRPO’s $\Psi_t$ is four lines of PyTorch. This is from qwen3-nano-math-reasoner, where I post-trained Qwen3-0.6B on MATH-500 with GRPO — lightly annotated, otherwise as it runs:

# src/qwen_grpo.py — the heart of one GRPO update (G rollouts, one prompt)
rewards = torch.tensor(roll_rewards, device=device)   # G verifier scores
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
adv = advantages.detach()
# Dr. GRPO is a one-line edit: delete the "/ (rewards.std() + 1e-4)".

# all rewards identical -> advantages all ~0 -> no gradient in this group
is_zero_adv = torch.allclose(
    advantages, torch.zeros_like(advantages), atol=1e-8, rtol=0.0
)
if skip_zero_adv and is_zero_adv:
    return {"loss": 0.0}  # nothing to learn here; skip the update

Two details the equations glossed over, sitting right there in the code: the 1e-4 in the denominator is the practical answer to Part 4’s zero-spread collapse (without it, a group of identical rewards divides $0$ by $0$ ), and the explicit zero-advantage skip is the honest admission that such a group teaches nothing. Hold that thought — the next section’s dynamic sampling is this same skip, turned into a sampling strategy.

The frontier — DAPO, GSPO, CISPO

GRPO has residual leaks; each recent algorithm patches one — still the same skeleton.

DAPO

Fixes a length bias in how the loss is averaged. GRPO averages token-contributions within each response then across responses, which favors shorter completions. DAPO switches to a token-level aggregation over all responses, so length no longer dilutes the signal — better for multi-step reasoning. (Also adds clip-higher — a looser upper clip bound, so high-advantage exploratory tokens aren’t over-suppressed — and dynamic sampling, which discards all-right or all-wrong groups since their advantages are zero and they contribute no gradient.)

GSPO

Fixes the granularity of the importance ratio. Per-token clipping zeros out the gradient for any token whose ratio strays outside the band — exactly the rare, important reasoning tokens you most want the model to learn. GSPO computes and clips the ratio at the sequence level. Its principle: the unit of optimization should match the unit of reward. Especially important for MoE, where per-token ratios have explosive variance.

CISPO

Attacks the same token-dropping problem differently — it drops PPO’s clip on the objective and instead clips the importance-sampling weight, wrapped in a stop-gradient. Detaching the clipped weight means it only rescales each token’s REINFORCE-style update and can never zero it out, so every token always contributes a gradient — including the rare high-ratio reasoning tokens that GRPO’s clip would discard. GSPO’s answer is “lift the ratio to the sequence level”; CISPO’s is “keep every token, but cap the weight so none is ever fully dropped.”

Part Nine

The knobs that actually decide everything

Three axes separate every algorithm above. Understand these and you understand why each design choice was made — and why your training run does or doesn’t work.

Knob 1 — bias vs. variance

Your gradient is computed from samples, so it’s a random variable with two properties. Bias: is it centered on the true gradient — would infinite samples recover it, or something systematically off? Variance: how much does it jitter batch-to-batch? They trade against each other, and they hurt differently: bias sends you confidently toward the wrong optimum; variance makes training noisy, forces tiny learning rates, and can collapse. You met all of these moves in Part 8 — the point now is to read the whole zoo as one progression along this single axis:

REINFORCE

Unbiased, catastrophic variance. Exactly correct on average; one noisy uncentered scalar makes it unusable raw.

Baseline / RLOO

Variance reduction for free. The one move that cuts variance without adding bias. RLOO’s leave-one-out is the careful version: maximal variance reduction subject to staying unbiased.

PPO value fn

Deliberate bias-for-variance. The state-value baseline is action-independent (so unbiased as a baseline); the bias enters through GAE’s bootstrapped advantage targets. GAE’s $\lambda$ is literally the bias↔variance dial. The opposite bet from RLOO.

GRPO std-norm

A small bias for stability. Dividing by group std stabilizes magnitudes but biases toward low-variance prompts — the live knob inside GRPO.

Clipping

Bias for step-safety. Truncating the ratio is a biased operation accepted to stop any step blowing up.

A recurring empirical finding (not a theorem): in several reported LLM RLHF/RLVR settings, the unbiased-sampled-baseline branch (RLOO → GRPO → GSPO) matches or beats the learned-critic branch (PPO), because the critic’s bias and engineering cost can outweigh its variance reduction in that regime. How far it generalizes is still open.

Knob 2 — token-level vs. sequence-level

This phrase means three independent things; conflating them is the #1 source of confusion.

The reward. Where does the scalar land? Standard setup: sequence-level (terminal) — the verifier/RM scores the whole completion. There’s no natural per-token reward. (Process supervision, scoring each reasoning step, is the denser exception.) This is the ground truth: the reward genuinely lives at the sequence level.
The advantage / credit assignment. Given a sequence-level reward, how is credit spread over tokens? Sequence-level (REINFORCE/RLOO/GRPO): every token gets the same advantage, no value function. Token-level (PPO+GAE): each token gets its own, via the critic — finer, but you’re manufacturing per-token signal that wasn’t really there.
The importance ratio + clipping. Where is the off-policy correction computed? Token-level (PPO, GRPO) vs. sequence-level (GSPO).

// importance sampling, in one breath

The policy gradient is an expectation over trajectories from the current policy. Reusing samples from a slightly-older policy makes them off-policy; importance sampling reweights each by $r(\theta) = \pi_\theta(\cdot)/\pi_{\theta_{\text{old}}}(\cdot)$ to correct the mismatch. It only behaves when old and new are close — diverge, and the ratio explodes into huge variance, which is why you clip it. The token-vs-sequence question for the ratio is separate from the advantage question, and GSPO’s whole pitch is that GRPO’s remaining token-level ratio is the leak.

Knob 3 — the KL penalty: in the reward vs. as a regularizer

The KL penalty’s purpose is identical either way: keep $\pi_\theta$ from drifting too far from a reference $\pi_{\text{ref}}$ (usually the SFT model), so it doesn’t hack the reward or destroy general capability. What differs is where you attach the leash.

Placement A — in the reward (RLOO)	Placement B — separate loss term (GRPO)
$R'_i = R_i - \beta\,\text{KL}(\pi_\theta \Vert \pi_{\text{ref}})$ — drift is treated as part of the scored outcome, flows through the baseline and gets the same credit-assignment treatment as task reward.	$\mathcal{L} = -\mathbb{E}[\hat A_i \log\pi_\theta] + \beta\,\text{KL}(\pi_\theta \Vert \pi_{\text{ref}})$ — advantage is pure task reward; KL is bolted on like weight decay, penalizing the distribution directly.

Why this is where silent failures live The KL isn’t computed exactly — it’s estimated from samples, and the estimator choice matters (following Schulman’s note on approximating KL). For $\text{KL}(\pi_\theta \Vert \pi_{\text{ref}})$ with samples drawn from $\pi_\theta$ , the k1 estimator $\log\tfrac{\pi_\theta}{\pi_{\text{ref}}}$ is unbiased but high-variance and can go negative on individual samples. Depending on how the penalty is wired — its sign, and whether the term carries a gradient or is stop-gradiented — a run of negative samples can flip the leash into a drift bonus: the term that should pull the policy back instead rewards it for fleeing the reference. Catastrophic, and silent.

The k3 estimator $\tfrac{\pi_{\text{ref}}}{\pi_\theta} - 1 - \log\tfrac{\pi_{\text{ref}}}{\pi_\theta}$ is constructed to stay $\geq 0$ for every sample, which is why switching to it fixes the drift. This bug is more dangerous in Placement B, where the term acts directly on the loss without being laundered through a baseline. The same nominal $\beta$ also behaves differently across the two placements (it interacts with the std-normalization in B but scales with task reward in A) — so “the same $\beta$ ” doesn’t mean the same thing in RLOO vs. GRPO.

One more frontier note: when the reward is a clean verifier rather than an exploitable learned RM, the original justification for the leash weakens, and several recent RLVR recipes (DAPO among them) drop the KL penalty entirely, arguing it just holds the model back from the large policy moves that reasoning gains require.

The knobs aren’t independent — and that’s the real lesson Tuning one axis in isolation will mislead you, because they pull on each other:

Knob 2 feeds back on Knob 1. Lifting the importance ratio to the sequence level (GSPO) removes a major variance source — the same variance the group-std normalization (a Knob-1 bias) was bolted on to suppress. Fix the ratio granularity and you can often retire the std hack, not just stack it.
Knob 1 distorts Knob 3. When the KL rides in the reward (Placement A), the group-std normalization rescales it, so the same nominal $\beta$ is a different leash than it is as a separate loss term. You can’t port $\beta$ across placements — the Knob-1 choice silently changed what Knob 3 means.
Knob 3 depends on the reward type. A clean verifier weakens the case for the leash at all; dropping it (DAPO) only pays off because the sequence-level machinery is stable enough to survive the larger policy moves that follow.

Read together, the modern recipes aren’t independent dials — they’re coordinated bets across all three axes. That’s why “I copied GRPO’s $\beta$ into my RLOO run” breaks: you moved one knob and silently turned two others.

Part Ten

The one-page map to carry

Everything in this post is a single equation with two levers:

\nabla_\theta J = \mathbb{E}\Big[\textstyle\sum_t \Psi_t\,\nabla_\theta \log\pi_\theta(a_t\mid s_t)\Big]

The direction factor $\nabla_\theta\log\pi_\theta$ (“which action”) never changes. Everything is (a) what you put in $\Psi_t$ to lower variance, and (b) how you control the step.

Lever 1 · Ψ

REINFORCE (none) → RLOO (unbiased leave-one-out group mean, no critic) → PPO (learned value fn + per-token GAE) → GRPO (group mean/std, critic-free like RLOO but per-token like PPO).

Lever 2 · step

REINFORCE/RLOO (none) → PPO (clipped importance ratio) → GRPO (inherits the clip) → DAPO (fix length-averaging), GSPO (ratio + clip at sequence level), CISPO (clip the weight, not the objective).

And the whole family as one matrix — every row is the same policy-gradient skeleton with different entries in a few columns:

Method	Baseline	Advantage unit	Ratio + clip unit	Length handling	What it fixes / its failure mode
REINFORCE	none	sequence	none (on-policy)	—	unbiased, but catastrophic variance
RLOO	leave-one-out group mean (critic-free)	sequence	none (single step)	token-mean	cuts variance, stays unbiased
PPO	learned value V_φ	token (GAE, bootstrapped → biased)	token	token-level	enables sample reuse; cost is the critic
GRPO	group mean / std (critic-free)	sequence (shared over tokens)	token	per-response then per-group → length bias	drops the critic; std + length biases remain
DAPO	group	sequence	token, “clip-higher” + dynamic sampling	token-level over all responses	fixes GRPO’s length-averaging; often drops KL
GSPO	group	sequence	sequence	sequence-level	tames token-ratio variance (esp. MoE)
CISPO	group	sequence	token, clips the IS-weight (stop-grad)	token-level	keeps every token’s gradient

A key for the Length column, since the terms haven’t all appeared yet: it records how the per-token loss terms are averaged, which is exactly where DAPO showed length bias hides. Token-mean / token-level = all tokens pooled into one flat average (each token weighs the same, regardless of which completion it came from). Per-response then per-group = GRPO’s nested averaging — average within each completion first, then across completions — which quietly down-weights every token of a long completion. Sequence-level = one loss term per completion, so length never enters the averaging at all.

The unit of optimization should match the unit of reward.

That one sentence — GSPO’s — is the throughline the modern papers keep rediscovering, and the most useful thing to keep in your head. The reward is sequence-level, which is exactly why — in the reasoning/RLVR results reported so far — the sequence-level branch (RLOO → GRPO → GSPO) keeps winning on simplicity and robustness over dragging a token-level value function (PPO) or a token-level importance correction (vanilla GRPO) into a setting where the reward never actually lived at the token level. Start from the unbiased-but-explosive REINFORCE estimator; every method after it is either reducing variance while preserving unbiasedness, or trading a little bias for a large stability win. Once you can place any new algorithm on those two levers, you understand it.

References

Williams (1992) — Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning — REINFORCE.
Schulman et al. (2015) — High-Dimensional Continuous Control Using Generalized Advantage Estimation — GAE.
Schulman et al. (2017) — Proximal Policy Optimization Algorithms — PPO.
Schulman — Approximating KL Divergence — the k1/k2/k3 estimators.
Shao et al. (2024) — DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — where GRPO was introduced.
Ahmadian et al. (2024) — Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs — RLOO.
Kazemnejad et al. (2024; ICML 2025) — VinePPO: Refining Credit Assignment in RL Training of LLMs — the credit-assignment counterpoint.
Liu et al. (2025) — Understanding R1-Zero-Like Training: A Critical Perspective — Dr. GRPO.
Yu et al. (2025) — DAPO: An Open-Source LLM Reinforcement Learning System at Scale — clip-higher, dynamic sampling, token-level loss.
Zheng et al. (2025) — Group Sequence Policy Optimization — GSPO.
Chen et al. (2025) — MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention — CISPO.
Lambert — The RLHF Book — the reading that inspired this post’s arc.