Explain Interpolation Learning?

TTI-Chicago → UBC

based on arXiv:2006.05942 (NeurIPS 2020), with:

Lijia Zhou

U Chicago

Nati Srebro

TTI-Chicago

Penn State Statistics Seminar, October 8 2020

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.)

- Given i.i.d. samples
- features/covariates , labels/targets

- Want such that for new samples from :
- e.g. squared loss:

- Standard approaches based on empirical risk minimization:

We have lots of bounds like: with probability ,

could be from VC dimension, covering number, RKHS norm, Rademacher complexity, fat-shattering dimension, …

Then for large , , so

Classical wisdom: “a model with zero training error is overfit to the training data and will typically generalize poorly”

(when )

(when )

We'll call a model with an *interpolating* predictor

Added label noise on MNIST (%)

Belkin/Ma/Mandal, ICML 2018Added label noise on MNIST (%)

Lots of recent theoretical work on interpolation.

[Belkin+ NeurIPS 2018],
[Belkin+ AISTATS 2018],
[Belkin+ 2019],
[Hastie+ 2019],

[Muthukumar+ JSAIT 2020],
[Bartlett+ PNAS 2020],
[Liang+ COLT 2020],
[Montanari+ 2019],
many more…

None* bound .

Is it possible to find such a bound?

Can uniform convergence explain interpolation learning?

*One exception-ish
[Negrea/Dziugaite/Roy, ICML 2020]:

relates to a surrogate predictor,

shows uniform convergence for the surrogate

We're only going to worry about *consistency*:

…in a *non-realizable* setting:

Is it possible to show consistency of an interpolator with

This requires tight constants!

“signal”, | “junk”, | |

controls scale of junk:

Linear regression:

Min-norm interpolator:

As , approaches ridge regression on the signal: for almost all

If , is consistent:

is consistent when fixed, , .

Could we have shown that with uniform convergence?

__Theorem:__
If ,

Proof idea:

For each ,
let ,

a *natural* consistent interpolator,

and . Then, almost surely,

([Negrea/Dziugaite/Roy, ICML 2020] had a very similar result for )

*Natural* interpolators: doesn't change if flips to . Examples:

,
,
,

with each convex,

Proof shows that for most ,

there's a typical predictor (in )

that's good on most inputs (),

but very bad on *specifically* ()

- Convergence of surrogates [Negrea/Dziugaite/Roy, ICML 2020]?
- Nice, but not really the same thing…

- Give up?
- Or…

We don't really care about small , big ….

Could we bound instead of ?

- Existing uniform convergence proofs are “really” about [Nagarajan/Kolter, NeurIPS 2019]
- Strongly expect still for norm balls in our testbed
- instead of

- Not possible to show is big for
*all*- If consistent and , use

So far, used

But *we only care about interpolators*. How about

Is this “uniform convergence”?

It's the standard notion for realizable () analyses…

Are there analyses like this for ?

Applying [Srebro/Sridharan/Tewari 2010]: for all , : high-prob bound on

if , ,

But if we suppose , would get a novel prediction:

__Theorem:__ If ,

- Confirms speculation based on assumption
- Shows consistency with uniform convergence (of interpolators)
- New result for error of not-quite-minimal-norm interpolators
- Norm is asympotically consistent
- Norm is at worst

What does look like?

is the plane

Intersection of -ball with -hyperplane:

-ball centered at

Can write as

where is *any* interpolator,
is basis for

Can change variables in to

Quadratic program, one quadratic constraint: strong duality

Exactly equivalent to problem in *one* scalar variable:

Can analyze this for different choices of …

In Gaussian least squares generally, have that so is consistent iff .

Very useful for lower bounds! [Muthukumar+ JSAIT 2020]

Roughly, “how much” of is “missed” by

Analyzing dual with ,

get without**any** distributional assumptions that

get without

(amount of missed energy) (available norm)

If consistent, everything smaller-norm also consistent iff term

In our setting:

is consistent,

Plugging in:

Analyzing dual with for , , get in general: if is consistent

In our setting:

is consistent, because

Plugging in:

…and we're done!

On Uniform Convergence and Low-Norm Interpolation Learning