Sherman Wong

I write about AI/ML/System tech reports

Latest from the Blog

Scaling Reinforcement Learning with Verifiable Reward (RLVR)

The Basics Post-training – Scaling Test Time Compute The biggest innovation from GPT-o1 is that it proves test time scaling is another dimension besides scaling data and model parameters.  Both RL and best-of-n therefore share a common structure, differing only in when the optimization is paid—RL pays the cost during training, best-of-n pays it during…

LLM-RL Fine-Tuning – Math Collections

A Humble Attempt to establish a systematic, theoretical understanding of LLM RL Fine-tuning. This is an initial effort to summarize how traditional RL loss formulations transition into those used in LLMs, note this is an ongoing list, and I plan to gradually enrich it with more equations as the framework becomes more mature.

State of SFT

In the previous post, we articulated the distinction between supervised fine-tuning (SFT) and instruction fine-tuning (IFT). As the field advanced through 2024, the focus shifted to applications. IFT is now widely regarded as alignment research, while SFT serves as a versatile tool to adapt generic LLM checkpoints to specific domains. For practitioners, SFT remains the…