Modern Kernel Methods
in Machine Learning:
Part I
Danica J. Sutherland(she/her)
Computer Science, University of British Columbia
ETICS "summer" school, Oct 2022
(Swipe or arrow keys to move through slides; for a menu to jump; to show more.)
Motivation
- Machine learning! …but how do we actually do it?
- Linear models! ,
- Extend …
- Kernels are basically a way to study doing this with any, potentially very complicated,
- Convenient way to make models on documents, graphs, videos, datasets, …
- will live in a reproducing kernel Hilbert space
Hilbert spaces
- A complete (real or complex) inner product space.
- Inner product space: a vector space with an inner product:
- for ,
Induces a norm: - Complete: “well-behaved” (Cauchy sequences have limits in )
Kernel: an inner product between feature maps
- Call our domain , some set
- , functions, distributions of graphs of images, …
- is a kernel on if there exists a Hilbert space and a feature map so that
- Roughly, is a notion of “similarity” between inputs
- Linear kernel on :
Aside: the name “kernel”
- Our concept: "positive semi-definite kernel," "Mercer kernel," "RKHS kernel"
- Semi-related: kernel density estimation
- , usually symmetric, like RKHS kernel
- Always requires , unlike RKHS kernel
- Often requires , unlike RKHS kernel
- Not required to be inner product, unlike RKHS kernel
- Unrelated:
- The kernel (null space) of a linear map
- The kernel of a probability density
- The kernel of a convolution
- CUDA kernels
- The Linux kernel
- Popcorn kernels
Building kernels from other kernels
- Scaling: if , is a kernel
- Sum: is a kernel
- Is necessarily a kernel?
- Take , , .
- Then
- But .
Positive definiteness
- A symmetric function (i.e. have )
is positive semi-definite (psd) if
for all , ,
,
- Equivalently: kernel matrix is PSD
- Hilbert space kernels are psd
- psd functions are Hilbert space kernels
- Moore-Aronszajn Theorem; we'll come back to this
Some more ways to build kernels
Limits: if
exists,
is psd
- Products: is psd
- Let , be independent
- Covariance matrices are psd, so is too
- Powers: is pd for any integer
, the polynomial kernel
- Exponents: is pd
- If , is pd
- Use the feature map
,
the Gaussian kernel
Reproducing property
- Recall original motivating example with
- Kernel is
- Classifier based on linear
- is the function itself; corresponds to vector in
is the function evaluated at a point - Elements of are functions,
- Reproducing prop.:
for
Reproducing kernel Hilbert space (RKHS)
- Every psd kernel on defines a (unique) Hilbert space, its RKHS ,
and a map where
- Elements are functions on , with
- Combining the two, we sometimes write
- is the evaluation functional
An RKHS is defined by it being continuous, or
Moore-Aronszajn Theorem
- Building for a given psd :
- Start with
- Define from
- Take to be completion of in the metric from
- Get that the reproducing property holds for in
- Can also show uniqueness
- Theorem: is psd iff it's the reproducing kernel of an RKHS
A quick check: linear kernels
- on
- "corresponds to"
- If , then
- Closure doesn't add anything here, since is closed
- So, linear kernel gives you RKHS of linear functions
More complicated: Gaussian kernels
- is infinite-dimensional
- Functions in are bounded:
- Choice of controls how fast functions can vary:
- Can say lots more with Fourier properties
Kernel ridge regression
Linear kernel gives normal ridge regression:
Nonlinear kernels will give nonlinear regression!
How to find ? Representer Theorem:
- Let
its orthogonal complement in - Decompose with ,
- Minimizer needs , and so
Setting derivative to zero gives
satisfied by
Other kernel algorithms
- Representer theorem applies if is strictly increasing in
- Kernel methods can then train based on kernel matrix
- Classification algorithms:
- Support vector machines: is hinge loss
- Kernel logistic regression: is logistic loss
- Principal component analysis, canonical correlation analysis
- Many, many more…
- But not everything works...e.g. Lasso regularizer
Some theory: generalization
- Rademacher complexity of is upper-bounded by if
- Implies for -Lipschitz losses that
- Same kind of rates with stability-based analyses
- Implies that, if the “truth” is low-norm, most kernel methods are suboptimal
- Difficulty of learning is controlled by RKHS norm of target
Some theory: universality
- One definition:
a continuous kernel on a compact metric space is universal if is -dense in :
for every continuous , for every ,
there is an with - Implies that, on compact , can separate compact sets
- with for , for
- Which implies there are with arbitrarily small loss
- Might take arbitrarily large norm: approximation/estimation tradeoff
- Can prove via Stone-Weierstrass or Fourier properties
- Never true for finite-dim kernels: need
Translation-invariant kernels on
- Assume is bounded, continuous, and translation invariant
- Then is proportional to the Fourier transform of a probability measure
(Bochner's theorem)
- If , the measure has a density
- If that density is positive everywhere, is universal
- For all nonzero finite signed measures ,
- True for Gaussian
- and Laplace
Limitations of kernel-based learning
- Generally bad at learning sparsity
- e.g. for large
- Provably statistically slower than deep learning for a few problems
- computational complexity, memory
- Various approximations you can make
Relationship to deep learning
- Deep models usually end as
- Can think of as learned kernel,
- Does this gain us anything?
- Random nets with trained last layer (NNGP) can be decent
- As width , nets become neural tangent kernel
- Widely used theoretical analysis...more tomorrow
- SVMs with NTK can be great on small data
- Inspiration: learn the kernel model end-to-end
- Ongoing area; good results in two-sample testing, GANs, density estimation, meta-learning, semi-supervised learning, …
- Explored a bit in interactive session!
What's next
- After break: interactive session exploring w/ ridge regression
- Tomorrow: a subset of
- Representing distributions
- Uses for statistical testing + generative models
- Connections to Gaussian processes, probabilistic numerics
- Approximation methods for faster computation
- Deeper connection to deep learning
- More details on basics:
- Berlinet and Thomas-Agnan, RKHS in Probability and Statistics
- Steinwart and Christmann, Support Vector Machines