Modern Kernel Methods
in Machine Learning:
Part I

Danica J. Sutherland(she/her)
Computer Science, University of British Columbia
ETICS "summer" school, Oct 2022

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.)

Motivation

Machine learning! …but how do we actually do it?
Linear models! ,
Extend …
Kernels are basically a way to study doing this with any, potentially very complicated,
Convenient way to make models on documents, graphs, videos, datasets, …
will live in a reproducing kernel Hilbert space

Hilbert spaces

A complete (real or complex) inner product space.
Inner product space: a vector space with an inner product:
- for ,
Induces a norm:
Complete: “well-behaved” (Cauchy sequences have limits in )

Kernel: an inner product between feature maps

Call our domain , some set
- , functions, distributions of graphs of images, …
is a kernel on if there exists a Hilbert space and a feature map so that
Roughly, is a notion of “similarity” between inputs
Linear kernel on :

Aside: the name “kernel”

Our concept: "positive semi-definite kernel," "Mercer kernel," "RKHS kernel"
Semi-related: kernel density estimation
- , usually symmetric, like RKHS kernel
- Always requires , unlike RKHS kernel
- Often requires , unlike RKHS kernel
- Not required to be inner product, unlike RKHS kernel
Unrelated:
- The kernel (null space) of a linear map
- The kernel of a probability density
- The kernel of a convolution
- CUDA kernels
- The Linux kernel
- Popcorn kernels

Building kernels from other kernels

Scaling: if , is a kernel
Sum: is a kernel
Is necessarily a kernel?
- Take , , .
- Then
- But .

Positive definiteness

A symmetric function (i.e. have ) is positive semi-definite (psd) if for all , , ,

Equivalently: kernel matrix is PSD

Hilbert space kernels are psd
psd functions are Hilbert space kernels
- Moore-Aronszajn Theorem; we'll come back to this

Some more ways to build kernels

Limits: if exists, is psd
Products: is psd
- Let , be independent
- Covariance matrices are psd, so is too
Powers: is pd for any integer
, the polynomial kernel
Exponents: is pd
If , is pd
- Use the feature map

, the Gaussian kernel

Reproducing property

Recall original motivating example with
Kernel is
Classifier based on linear
is the function itself; corresponds to vector in
is the function evaluated at a point
Elements of are functions,
Reproducing prop.: for

Reproducing kernel Hilbert space (RKHS)

Every psd kernel on defines a (unique) Hilbert space, its RKHS , and a map where
- Elements are functions on , with
Combining the two, we sometimes write
is the evaluation functional
An RKHS is defined by it being continuous, or

Moore-Aronszajn Theorem

Building for a given psd :
- Start with
- Define from
- Take to be completion of in the metric from
- Get that the reproducing property holds for in
- Can also show uniqueness
Theorem: is psd iff it's the reproducing kernel of an RKHS

A quick check: linear kernels

on
- "corresponds to"
If , then
Closure doesn't add anything here, since is closed
So, linear kernel gives you RKHS of linear functions

More complicated: Gaussian kernels

is infinite-dimensional
Functions in are bounded:
Choice of controls how fast functions can vary:
Can say lots more with Fourier properties

Kernel ridge regression

Linear kernel gives normal ridge regression: Nonlinear kernels will give nonlinear regression!

How to find ? Representer Theorem:

Let
its orthogonal complement in
Decompose with ,
Minimizer needs , and so

Setting derivative to zero gives
satisfied by

Other kernel algorithms

Representer theorem applies if is strictly increasing in
Kernel methods can then train based on kernel matrix
Classification algorithms:
- Support vector machines: is hinge loss
- Kernel logistic regression: is logistic loss
Principal component analysis, canonical correlation analysis
Many, many more…
But not everything works...e.g. Lasso regularizer

Some theory: generalization

Rademacher complexity of is upper-bounded by if
Implies for -Lipschitz losses that
Same kind of rates with stability-based analyses
Implies that, if the “truth” is low-norm, most kernel methods are suboptimal
Difficulty of learning is controlled by RKHS norm of target

Some theory: universality

One definition: a continuous kernel on a compact metric space is universal if is -dense in :
for every continuous , for every , there is an with
Implies that, on compact , can separate compact sets
- with for , for
- Which implies there are with arbitrarily small loss
- Might take arbitrarily large norm: approximation/estimation tradeoff
Can prove via Stone-Weierstrass or Fourier properties
Never true for finite-dim kernels: need

Translation-invariant kernels on

Assume is bounded, continuous, and translation invariant
Then is proportional to the Fourier transform of a probability measure (Bochner's theorem)
If , the measure has a density
If that density is positive everywhere, is universal
For all nonzero finite signed measures ,
True for Gaussian
and Laplace

Limitations of kernel-based learning

Generally bad at learning sparsity
- e.g. for large
Provably statistically slower than deep learning for a few problems
- e.g. to learn a single ReLU, , need norm exponential in [Yehudai/Shamir NeurIPS-19]
- Also some hierarchical problems, etc [Kamath+ COLT-20]
computational complexity, memory
- Various approximations you can make

Relationship to deep learning

Deep models usually end as
Can think of as learned kernel,
Does this gain us anything?
- Random nets with trained last layer (NNGP) can be decent
- As width , nets become neural tangent kernel
  - Widely used theoretical analysis...more tomorrow
  - SVMs with NTK can be great on small data
- Inspiration: learn the kernel model end-to-end
  - Ongoing area; good results in two-sample testing, GANs, density estimation, meta-learning, semi-supervised learning, …
  - Explored a bit in interactive session!

What's next

After break: interactive session exploring w/ ridge regression
Tomorrow: a subset of
- Representing distributions
  - Uses for statistical testing + generative models
- Connections to Gaussian processes, probabilistic numerics
- Approximation methods for faster computation
- Deeper connection to deep learning
More details on basics:
- Berlinet and Thomas-Agnan, RKHS in Probability and Statistics
- Steinwart and Christmann, Support Vector Machines

Modern Kernel Methodsin Machine Learning:Part I