Local Learning Dynamics
Help Explain (Post-)Training Behaviour

Danica J. Sutherland(she)

University of British Columbia (UBC) / Alberta Machine Intelligence Institute (Amii)

Knowl. dist. analysis
[ICLR-22]

Yi Ren

Shangmin Guo

Finetuning analysis
[ICLR-23]

Yi Ren

Shangmin Guo

Wonho Bae

Active learning
[NeurIPS-22]

M. Amin Mohamadi

Wonho Bae

Pseudo-NTK
[ICML-23]

M. Amin Mohamadi

Wonho Bae

DPO/etc analysis
[ICLR-25]

Yi Ren

GRPO analysis+fix
[arXiv]

Wenlong Deng

Yi Ren

Muchen Li

Xiaoxiao Li

Christos Thrampoulidis

Mila – June 2025

Slides available at djsutherland.ml/slides/entk-mila

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.)
PDF version

HTML version

Yi (Joshua) Ren

LLM “post-training”

Language models (e.g. GPT-2)
- Scrape up a ton of the internet (usually illegally)
- Train a big Transformer for next-token prediction
- Super-useful as component of lots of things…but not necessarily what we want itself
Turning a language model into a chatbot (e.g. ChatGPT):
- Run “supervised fine-tuning” on a dataset of chatbot-like interactions
- Run “preference optimization”: given prompt x, say A, not B

Surprises in LLM post-training

Preference optimization: “given prompt , say , not ”
Common algorithm: Direct Preference Optimization [RSM+ NeurIPS-23]
Weird things can happen here!

Even in the best case, “too much” DPO hurts [RHPF CoLM-24]

Makes way less likely, but eventually, model almost always says some
There are some workarounds, but…why?

Learning dynamics

Most theoretical analyses in this area ask: what do optimal solutions look like?

Turns out the loss function is very underspecified; there are many optimal solutions
“Implicit regularization” studies which one GD/SGD/Adam/… eventually converges to
- “Eventually” can take a really long time

We'll take a related, but more qualitative approach
What does each step do to the model?
- Mapping from parameters to “what the model does” is complicated
- When I take an SGD step to “learn” ,
  what happens to my predictions on ?
  - Also been called “local elasticity” [HS ICLR-20]

Learning dynamics(Taylor's version)

“Learning dynamics” of a model: for some fixed as params change
Suppose , , “plain” SGD on :

Learning dynamics

“Learning dynamics” of a model: for some fixed as params change
To start: , , “plain” SGD on :

: how much do I need to change my prediction?
- For square loss with , : how wrong was I before?
- For cross-entropy on logits, : how wrong was I before?
just “converts” prediction changes
- If , is the identity; if ,
is empirical neural tangent kernel of
- If , are “dissimilar” (small eNTK), stepping on barely changes prediction
- If , are “similar” (large eNTK), makes prediction more like

Example: learning dynamics on MNIST

But wait…aren't NTKs an unrealistic approximation?

A quick aside: the “NTK regime” and infinite limits

Full-batch GD: “stacking things up”,
Observation I: If is “wide enough” with any usual architecture+init* [Yang+Litwin 2021], is roughly constant through training
- For square loss, : dynamics agree with kernel regression!
Observation II: As becomes “infinitely wide” with any usual architecture+init* [Yang 2019], , independent of the random

Infinite NTKs are great

Infinitely-wide neural networks have very simple behaviour!
- No need to worry about bad local minima, optimization complications, …
- Understanding “implicit bias” of wide nets understanding NTK norm of functions
Can compute exactly for many architectures
- github.com/google/neural-tangents
A great kernel for many kernel methods!
- Using in SVMs was then-best overall method across many small-data tasks [Arora+ 2020]
- Good results in statistical testing [Jia+ 2021], dataset distillation [Nguyen+ 2021],
  clustering for active learning batch queries [Holzmüller+ 2022], …

But (infinite) NTKs aren't “the answer”

Computational expense:
- Poor scaling for large-data problems: typically memory and to computation
  - CIFAR-10 has , : an matrix of float64s is 2 terabytes!
  - ILSVRC2012 has , : 11.5 million terabytes (exabytes)
- For deep/complex models (especially CNNs), each pair very slow / memory-intensive
- Attention is even harder to handle
Practical performance:
- Typically performs worse than GD for “non-small-data” tasks (MNIST and up)
Theoretical limitations:
- NTK “doesn't do feature learning”:
  - , activations in the net don't change much [Chizat+ 2019] [Yang/Hu 2021]
- We now know many problems where gradient descent on an NN any kernel method
  - Cases where GD error , any kernel is barely better than random [Malach+ 2021]

What can we learn from empirical NTKs?

As a theoretical tool for local understanding:
- Why DPO breaks
- Why GRPO does weird stuff + how to fix
- Fine-grained explanation for early stopping in knowledge distillation
- How you should fine-tune models
As a practical tool for approximating “lookahead” in active learning
Plus: efficiently approximating s for large output dimensions , with guarantees

Adapting to the LLM setting

First problem: we don't classify a full response at a time, we do it token-by-token
Once we've framed it correctly, this is fine: stack prompt+response into
Change in based on token-by-token update of is

Second problem: we can't check all possible output probabilities anymore
Workaround: track some informative possible responses
- The dataset responses, rephrases, similar strings with different meanings
- Irrelevant responses in training set, random sentences…

LLM supervised fine-tuning

SFT makes dispreferred answers more likely
…because they're “similar enough” to the preferred ones
is reasonably large; starts big (pulls up), gets small (pulls up less)
Ungrammatical responses just go down; is small, so no upwards pressure
Also makes answers to different questions more likely…one form of hallucination?

Direct Preference Optimization (DPO)

which gives that is about

This negative gradient can do really weird things:

Negative gradients and the squeezing effect

To decrease , decrease numerator and increase denominator
If is big, dominates the sum: increasing it is almost as effective as decreasing

Positive gradients cancel out…in the positive context

Squeezing effect accumulates over time

What can we learn from empirical NTKs?

As a theoretical tool for local understanding:
- Why DPO breaks
- Why GRPO does weird stuff + how to fix
- Fine-grained explanation for early stopping in knowledge distillation
- How you should fine-tune models
As a practical tool for approximating “lookahead” in active learning
Plus: efficiently approximating s for large output dimensions , with guarantees

Group Relative Policy Optimization (GRPO) [DeepSeekMath 24]

Similar to a “group-wise” version of DPO; negative gradients have similar effect!

Negative token hidden rewards

Estimate which tokens are bad by correlation to tokens in positive responses

Down-weight penalties on tokens that are probably okay

What can we learn from empirical NTKs?

As a theoretical tool for local understanding:
- Why DPO breaks
- Why GRPO does weird stuff + how to fix
- Fine-grained explanation for early stopping in knowledge distillation
- How you should fine-tune models
As a practical tool for approximating “lookahead” in active learning
Plus: efficiently approximating s for large output dimensions , with guarantees

Better supervisory signal implies better learning

Classification: target is
Normally: see , minimize ( is vector of losses for all possible labels)
Potentially better scheme: see , minimize
- Can reduce variance if , the true conditional probabilities

Knowledge distillation

Process:
- Train a teacher on with standard ERM,
- Train a student on with
Usually is “smaller” than
But “self-distillation” (using the same architecture), often outperforms !
One possible explanation: is closer to than sampled
But why would that be?

Zig-Zagging behaviour in learning

Plots of (three-way) probabilistic predictions: shows , shows

eNTK explains it

Let ; for cross-entropy loss, one SGD step gives us is the covariance of a

Improves distillation (esp. with noisy labels) to take moving average of as

What can we learn from empirical NTKs?

As a theoretical tool for local understanding:
- Why DPO breaks
- Why GRPO does weird stuff + how to fix
- Fine-grained explanation for early stopping in knowledge distillation
- How you should fine-tune models
As a practical tool for approximating “lookahead” in active learning
Plus: efficiently approximating s for large output dimensions , with guarantees

Fine-tuning

Pretrain, re-initialize a random head, then adapt to a downstream task. Two phases:
- Head probing: only update the head
- Fine-tuning: update head and backbone together
If we only fine-tune: noise from random head might break our features!
If we head-probe to convergence: might already fit training data and not change features!

How much do we change our features?

Same kind of decomposition with backbone features , head :
If initial “energy”, e.g. , is small, features don't change much
If we didn't do any head probing, “direction” is very random, especially if is rich
Specializing to simple linear-linear model, can get insights about trends in
Recommendations from paper:
- Early stop during head probing (ideally, try multiple lengths for downstream task)
- Label smoothing can help; so can more complex heads, but be careful

How good will our fine-tuned features be? [Wei/Hu/Steinhardt 2022]

With random head (no head probing),
generalized cross-validation on eNTK model gives excellent estimate of downstream loss

What can we learn from empirical NTKs?

As a theoretical tool for local understanding:
- Why DPO breaks
- Why GRPO does weird stuff + how to fix
- Fine-grained explanation for early stopping in knowledge distillation
- How you should fine-tune models
As a practical tool for approximating “lookahead” in active learning
Plus: efficiently approximating s for large output dimensions , with guarantees

Pool-based active learning

Pool of unlabeled data available; ask for annotations of the “most informative” points
Kinds of acquisition functions used for deep active learning:
- Uncertainty-based: maximum entropy, BALD
- Representation-based: BADGE, LL4AL
Another kind used for simpler models: lookahead criteria
- “How much would my model change if I saw with label ?”
- Too expensive for deep learning…unless you use a local approximation to retraining

Approximate retraining with local linearization

Given trained on labeled data , approximate with local linearization
- Rank-one updates for efficient computation: schema
We prove this is exact for infinitely wide networks
- agrees with direct
Local approximation with eNTK “should” work much more broadly than “NTK regime”

Much faster than SGD

Much more effective than infinite NTK and one-step SGD

Matches/beats state of the art

Downside: usually more computationally expensive (especially memory)

Enables new interaction modes

“Sequential” querying: incorporate true new labels one-at-a-time instead of batch
- Only update occasionally
- Makes sense when labels cost $ but are fast; other deep AL methods need to retrain

What can we learn from empirical NTKs?

As a theoretical tool for local understanding:
- Why DPO breaks
- Why GRPO does weird stuff + how to fix
- Fine-grained explanation for early stopping in knowledge distillation
- How you should fine-tune models
As a practical tool for approximating “lookahead” in active learning
Plus: efficiently approximating s for large output dimensions , with guarantees

Approximating empirical NTKs

I hid something from you on active learning (and Wei/Hu/Steinhardt fine-tuning) results…
With classes, – potentially very big
But actually, we know that is diagonal for most architectures
- Let . (no !)
- Can also use “sum of logits” instead of just “first logit”
Lots of work (including above) has used instead of
- Often without saying anything; sometimes doesn't seem like they know they're doing it
Can we justify this more rigorously?

pNTK motivation

Say , , and has rows with iid entries
If , then and have same distribution
We want to bound difference
- Want and to be close, and small, for random and fixed
- Using Hanson-Wright:
- Fully-connected ReLU nets at init., fan-in mode: numerator , denom

pNTK's Frobenius error

Same kind of theorem / empirical results for largest eigenvalue,
and empirical results for , condition number

Kernel regression with pNTK

Reshape things to handle prediction appropriately:
We have again
- If we add regularization, need to “scale” between the two

Kernel regression with pNTK

pNTK speed-up

pNTK speed-up on active learning task

pNTK for full CIFAR-10 regression

on CIFAR-10: 1.8 terabytes of memory
on CIFAR-10: 18 gigabytes of memory

Worse than infinite NTK for FCN/ConvNet (where they can be computed, if you try hard)
Way worse than SGD

Recap

eNTK is a good tool for intuitive understanding of the learning process

Ren, Guo, S.

Better Supervisory Signals by Observing Learning Paths

Ren, Guo, Bae, S.

How to prepare your task head for finetuning

Ren, S.

Learning dynamics of LLM Finetuning

Deng, Ren, M. Li, S., X. Li, Thrampoulidis

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

eNTK is practically very effective at “lookahead” for active learning

Mohamadi*, Bae*, S.

Making Look-Ahead Active Learning Strategies Feasible with Neural Tangent Kernels

You should probably use pNTK instead of eNTK for high-dim output problems:

Mohamadi, Bae, S.

A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel

Local Learning DynamicsHelp Explain (Post-)Training Behaviour

LLM “post-training”

Surprises in LLM post-training

Learning dynamics

Learning dynamics(Taylor's version)

Learning dynamics

Example: learning dynamics on MNIST

But wait…aren't NTKs an unrealistic approximation?

Infinite NTKs are great

But (infinite) NTKs aren't “the answer”

What can we learn from empirical NTKs?

Adapting to the LLM setting

LLM supervised fine-tuning

Direct Preference Optimization (DPO)

Negative gradients and the squeezing effect

Positive gradients cancel out…in the positive context

Squeezing effect accumulates over time

What can we learn from empirical NTKs?

Group Relative Policy Optimization (GRPO) [DeepSeekMath 24]

Negative token hidden rewards

Down-weight penalties on tokens that are probably okay

What can we learn from empirical NTKs?

Better supervisory signal implies better learning

Knowledge distillation

Zig-Zagging behaviour in learning

eNTK explains it

What can we learn from empirical NTKs?

Fine-tuning

How much do we change our features?

How good will our fine-tuned features be? [Wei/Hu/Steinhardt 2022]

What can we learn from empirical NTKs?

Pool-based active learning

Approximate retraining with local linearization

Much faster than SGD

Much more effective than infinite NTK and one-step SGD

Matches/beats state of the art

Enables new interaction modes

What can we learn from empirical NTKs?

Approximating empirical NTKs

pNTK motivation

pNTK's Frobenius error

Kernel regression with pNTK

Kernel regression with pNTK

pNTK speed-up

pNTK speed-up on active learning task

pNTK for full CIFAR-10 regression

Recap

Local Learning Dynamics
Help Explain (Post-)Training Behaviour