Can Uniform Convergence
Explain Interpolation Learning?

Danica J. Sutherland(she/her)

University of British Columbia (UBC) / Alberta Machine Intelligence Institute (Amii)

based on arXiv:2006.05942 and 2106.09276, with:

Lijia Zhou

UChicago

Frederic Koehler

MIT → Simons Institute

Nati Srebro

TTI-Chicago

NYU Center for Data Science – November 10, 2021

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.
Vertical slides are backup slides.)

The HTML version is the “official” version, though this PDF is basically the same.

Supervised learning

Given i.i.d. samples
- features/covariates , labels/targets
Want such that for new samples from :
- e.g. squared loss:
Standard approaches based on empirical risk minimization:

Statistical learning theory

We have lots of bounds like: with probability ,

could be from VC dimension, covering number, RKHS norm, Rademacher complexity, fat-shattering dimension, …

Then for large , , so

Interpolation learning

Classical wisdom: “a model with zero training error is overfit to the training data and will typically generalize poorly”
(when )

; Zhang et al., “Rethinking generalization”, ICLR 2017

We'll call a model with an interpolating predictor

Misha Belkin
Simons Institute
July 2019

Added label noise on MNIST (%)

Belkin/Ma/Mandal, ICML 2018

Misha Belkin
Simons Institute
July 2019

Added label noise on MNIST (%)

Belkin/Ma/Mandal, ICML 2018

Misha Belkin
Simons Institute
July 2019

Lots of recent theoretical work on interpolation.
[Belkin+ NeurIPS 2018], [Belkin+ AISTATS 2018], [Belkin+ 2019], [Hastie+ 2019],
[Muthukumar+ JSAIT 2020], [Bartlett+ PNAS 2020], [Liang+ COLT 2020], [Montanari+ 2019], many more…
None* bound .
Is it possible to find such a bound?
Can uniform convergence explain interpolation learning?

*One exception-ish [Negrea/Dziugaite/Roy, ICML 2020]:
relates to a surrogate predictor,
shows uniform convergence for the surrogate.
(Also, a few things since our first paper.)

A more specific version of the question

Today, we're mainly going to worry about consistency:

…in a noisy setting:

…for Gaussian linear regression:

Is it possible to show consistency of an interpolator with

This requires tight constants!

A testbed problem: “junk features”

	“signal”,	“junk”,

controls scale of junk:

Linear regression:

Min-norm interpolator:

Consistency of

As , approaches ridge regression on the signal

is consistent when fixed, ,

Could we have shown that with uniform convergence?

No uniform convergence on norm balls

Theorem: In junk features with ,

No uniform convergence on norm balls - proof sketch

Theorem: In junk features with ,

Proof idea:

Koltchinskii/Lounici, Bernoulli 2017

A more refined uniform convergence analysis?

is no good. Maybe ?

Theorem (à la [Nagarajan/Kolter, NeurIPS 2019]):

In junk features, for each , let ,

a natural consistent interpolator,

and . Then, almost surely,

([Negrea/Dziugaite/Roy, ICML 2020] had a very similar result for )

Natural interpolators: doesn't change if flips to . Examples:
, , ,
with each convex,

Proof shows that for most ,
there's a typical predictor (in )
that's good on most inputs (),
but very bad on specifically ()

So, what are we left with?

Convergence of surrogates [Negrea/Dziugaite/Roy, ICML 2020]?
- Nice, but not really the same thing…
Only do analyses based on e.g. exact form of ?
We'd like to keep good things about uniform convergence:
- Apply to more than just one specific predictor
- Tell us more about “why” things generalize
- Easier to apply without a nice closed form
Or…

One-sided uniform convergence?

We don't really care about small , big ….
Could we bound instead of ?

Existing uniform convergence proofs are “really” about [Nagarajan/Kolter, NeurIPS 2019]
Strongly expect still for norm balls in our testbed
- instead of
Not possible to show is big for all
- If consistent and , use

A broader view of uniform convergence

So far, used

But we only care about interpolators. How about

Is this “uniform convergence”?

It's the standard notion for noiseless () analyses…

Used at least since [Vapnik 1982] and [Valiant 1984]

From [Devroye/Györfi/Lugosi 1996]:

The interpolator ball in linear regression

What does look like?

is the plane

Intersection of -ball with -hyperplane:
-ball centered at

Optimistic rates

Applying [Srebro/Sridharan/Tewari 2010]: for all , : high-prob bound on

if , ,

If this holds with (and maybe ),
would explain consistency on junk features,
and predict that gives

Main result of first paper

Theorem: If ,

Confirms speculation based on assumption
Shows consistency with uniform convergence (of interpolators)
New result for error of not-quite-minimal-norm interpolators
- Norm is asympotically consistent
- Norm is at worst

What does look like?

is the plane

Intersection of -ball with -hyperplane:
-ball centered at

Can write as
where is any interpolator, is basis for

Decomposition via strong duality

Can change variables in to

Quadratic program, one quadratic constraint: strong duality

Exactly equivalent to problem in one scalar variable:

Can analyze this for different choices of …

The minimal-risk interpolator

In Gaussian least squares generally, have that so is consistent iff .

Very useful for lower bounds! [Muthukumar+ JSAIT 2020]

Restricted eigenvalue under interpolation

Roughly, “how much” of is “missed” by

Consistency up to

Analyzing dual with ,
get without any distributional assumptions that

(amount of missed energy) (available norm)

If consistent, everything smaller-norm also consistent iff term

In our setting:

is consistent,

Plugging in:

In the generic results, means for some

Error up to

Analyzing dual with for , , get in general: if is consistent

In our setting:

is consistent, because

Plugging in:

…and we're done!

Conjecture holds (for Gaussian linear regression)

Specifically, our more general bound implies that w.h.p. splits up covariance eigenvectors;

For this to mean anything, need

Combine with a new analysis on : whp,

Benign overfitting of

Plugging the two bounds together:

Including all the fiddly conditions I didn't mention,
we recover the consistency conditions of the landmark paper [Bartlett/Long/Lugosi/Tsigler PNAS 2020]

Additionally tells us about nearly-minimal-norm interpolators

Generalization error in compact sets

Theorem. If with , w.h.p. where
is the Gaussian width (a standard tool)

this is an informal statement, but gets the gist

Norm needed to interpolate for general norms

Theorem. Let be the dual norm of .
Call .
Under some conditions, w.h.p.

Plugging them together, get consistency conditions analogous to the [BLLT] ones for minimal-norm interpolators for any norm.

New application: minimum

LASSO, Adaboost, compressed sensing, basis pursuit, …

Much harder to analyze directly, because no closed form!
Some analysis in isotropic case; didn't show consistency
[Ju/Lin/Liu NeurIPS 2020] [Chinot/Löffler/van de Geer 2021]

Our conditions hold in a junk features setting, if

Very limited setting, but (as far as we know)
first consistency result for ,

On Uniform Convergence and Low-Norm Interpolation Learning

Zhou, Sutherland, Srebro

NeurIPS 2020

arXiv:2006.05942

Uniform Convergence of Interpolators:
Gaussian Width, Norm Bounds and Benign Overfitting

Koehler*, Zhou*, Sutherland, Srebro

NeurIPS 2021

arXiv:2106.09276

Junk features example:
- is consistent; usual uniform convergence can't show that
- Uniform convergence over norm ball can't show any learning
Uniform convergence of interpolators does work
- Matches previously known (nearly necessary) sufficient conditions
- Applies to general norm balls (though can be hard to evaluate)
- Our analysis is very specific to Gaussian data
Coming soon: extension to near-interpolators via optimistic rates

Can Uniform Convergence Explain Interpolation Learning?

Supervised learning

Statistical learning theory

Interpolation learning

A more specific version of the question

A testbed problem: “junk features”

Consistency of \wmn

No uniform convergence on norm balls

No uniform convergence on norm balls - proof sketch

A more refined uniform convergence analysis?

So, what are we left with?

One-sided uniform convergence?

A broader view of uniform convergence

The interpolator ball in linear regression

Optimistic rates

Main result of first paper

Decomposition via strong duality

The minimal-risk interpolator

Restricted eigenvalue under interpolation

Consistency up to \lVert\wmr\rVert

Error up to \alpha \lVert\wmn\rVert

Conjecture holds (for Gaussian linear regression)

Benign overfitting of \wmn

Generalization error in compact sets

Norm needed to interpolate for general norms

New application: minimum \norm\w_1

Can Uniform Convergence
Explain Interpolation Learning?

Consistency of

Consistency up to

Error up to

Benign overfitting of

New application: minimum