Andrew Ng Launches Free LLM Post-Training Course, Covering Three Major Optimization Methods: SFT, DPO, RL

Andrew Ng (@AndrewYNg) has just released a new course: "Post-training of LLMs".

This course is taught by Banghua Zhu (@BanghuaZ), Assistant Professor at the University of Washington and co-founder of NexusFlow.

This course targets one of the most practical techniques in today's AI field:

How to turn a model that only predicts the next word into a truly useful assistant.

It's important to know that training an LLM requires two key stages: pre-training and post-training.

In the pre-training stage, the model learns to predict the next word or token from vast amounts of unlabeled text. It's only in the post-training stage that it learns truly useful behaviors: following instructions, using tools, and reasoning.

Post-training transforms a general token predictor (a model trained on trillions of unlabeled text tokens) into an assistant capable of following instructions and performing specific tasks.

More importantly, post-training is significantly cheaper than pre-training, allowing more teams to integrate post-training methods into their workflows.

Three Post-Training Methods

The course highlights three common post-training methods:

Supervised Fine-Tuning (SFT)

You provide the model with paired data of inputs and desired outputs for training. This is the most direct method, like teaching a child to recognize words: telling it "when you see this question, answer like this."

Direct Preference Optimization (DPO)

You simultaneously provide a preferred (chosen) and a less preferred (rejected) response, training the model to favor better outputs. This is like telling the model "this answer is good, that one is bad," teaching it to distinguish quality.

Online Reinforcement Learning (RL)

After the model generates an output, it receives a reward score based on human or automated feedback, and then the model is updated to improve performance. This is more like a "candy for doing it right, correction for doing it wrong" training approach.

Hands-On Practice is Key

The biggest highlight of this course is the extensive hands-on labs.

You will:

Build an SFT pipeline to transform a base model into an instruction-following model
Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing bad responses and reinforcing preferred ones
Implement a DPO pipeline to change the identity of a chatbot assistant
Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions
Train a model using GRPO to improve its mathematical abilities through verifiable rewards

All labs are based on pre-trained models downloaded from Hugging Face, allowing you to see firsthand how each technique shapes model behavior.

Enthusiastic Community Response

The course sparked heated discussion upon its release.

TaskDrift™ (@TaskDrift) pointed out the key:

More people need to understand the power of post-training. SFT, DPO, and RL are not just for large labs; they unlock real use cases for smaller teams. Glad this course makes it practical and hands-on.

Consciousness is logic (@logicThink11031) opined:

Theoretically, it can fill the gaps in LLMs, but essentially, this is still an effort to refine the grid (similar to the concept of calculus approaching infinity).

He further noted:

I have always believed: from Turing tools → McCulloch-Pitts neural networks → LLMs, from the perspective of intelligence, the overall direction should be wrong. If we don't change direction theoretically and just blindly fill gaps, it doesn't make much sense!

Sudhir Gajre (@SudhirGajre) offered practical advice:

Andrew, I haven't seen the course materials yet. But I suggest including some discussion on the boundaries and limitations of context engineering. In my opinion, you should only consider post-training after you have exhausted CE.

Post-training is one of the fastest-growing areas in LLM training

Whether you want to build highly accurate context-specific assistants, fine-tune a model's tone, or improve accuracy for specific tasks, this course will give you hands-on experience with the most important techniques shaping LLM post-training today.

Victor Ajayi (@the_victorajayi)'s comment is representative:

I've been eager to dive into post-training, and this course looks like the perfect opportunity. SFT, DPO, and RL are powerful tools for shaping real-world AI behavior, and I can't wait to see how each method works in practice. Thanks for making this accessible!

If pre-training is like teaching a person to read, enabling them to recognize all words and understand the basic rules of language,

then post-training is teaching them how to write—when to use which words, and how to organize language to convey specific meanings.

The former makes the model "know," the latter makes it "able to use."

This also explains why post-training is so important:

Without it, we would only have a knowledgeable but impractical "bookworm."

Now, Andrew Ng has placed the key to this technology in your hands.

To learn or not to learn, that's all the help I can offer.

Course Link: https://www.deeplearning.ai/short-courses/post-training-of-llms/

Andrew Ng Launches Free LLM Post-Training Course, Covering Three Major Optimization Methods: SFT, DPO, RL

Three Post-Training Methods

Hands-On Practice is Key

Enthusiastic Community Response

Post-training is one of the fastest-growing areas in LLM training

Share Short URL