Distributed training notes: DDP, NCCL, and the bf16 question

Notes I keep adding to as I learn. Less an essay than a working set of footnotes.

DDP is wonderfully simple until something hangs. Then it’s wonderfully complex. The first hour of any new training run, I assume hangs are NCCL until proven otherwise. The second hour, I assume they are data-loader stalls. The third hour, I sit down with the trace.

bf16 versus fp16: bf16 wins for training. fp16 wins for nothing in 2026. The dynamic-range win for bf16 is decisive enough that I’ve stopped reasoning about loss-scaling, which gives me back hours of my life.

Sequence packing earned its keep on the last run. Throughput went up 40%. Implementation took longer than I expected, mostly because the attention mask matters more than the loss mask, and getting that subtly wrong costs you accuracy without telling you.