PYTHONPYTORCHGRPO 2026

qwen3-nano-math-reasoner

post-training a 0.6B model to reason about math — GRPO vs. distillation

[ shipped ]

How far can a 0.6B model be pushed on math reasoning? This project takes Qwen3-0.6B and runs three post-training recipes against the MATH-500 benchmark, holding everything else fixed so the comparison is honest.

The three approaches:

  • GRPO — reinforcement learning with a simple correctness reward, the same algorithm I derive step by step in RL for Language Models, From First Principles.
  • Cross-family SFT distillation — supervised fine-tuning on reasoning traces from DeepSeek V4 Pro.
  • Same-family SFT distillation — supervised fine-tuning on traces from the much larger Qwen3-235B-A22B.

The headline finding is about ordering: distillation followed by a GRPO continuation does best — 50% on a 50-problem MATH-500 subset, up from the base model’s 16% and within reach of the official Qwen3 reasoning variant at 60.4%. Same-family distillation alone lands at 37.8%, ahead of cross-family at 34.6%. Warm-start the policy with distillation, then let RL sharpen it.

Built on PyTorch and Transformers, with the reasoning_from_scratch package and support for both custom and HuggingFace runtimes.