Kernel Methods:
From Basics to Modern Applications

Danica J. Sutherland (she/her)
Computer Science, University of British Columbia
Data Science Summer School, January 2021

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.)

Motivation

Machine learning! …but how do we actually do it?
Linear models! ,
Extend …
Kernels are basically a way to study doing this with any, potentially very complicated,
Convenient way to make models on documents, graphs, videos, datasets, …
will live in a reproducing kernel Hilbert space

Hilbert spaces

A complete (real or complex) inner product space.
Inner product space: a vector space with an inner product:
- for ,
Induces a norm:
Complete: “well-behaved” (Cauchy sequences have limits in )

Kernel: an inner product between feature maps

Call our domain , some set
- , functions, distributions of graphs of images, …
is a kernel on if there exists a Hilbert space and a feature map so that
Roughly, is a notion of “similarity” between inputs
Linear kernel on :

Building kernels from other kernels

Scaling: if , is a kernel
Sum: is a kernel
Is necessarily a kernel?
- Take , , .
- Then
- But .

Positive definiteness

A symmetric function is positive semi-definite (psd) if for all , , ,

Equivalently: kernel matrix is PSD

Hilbert space kernels are psd
psd functions are Hilbert space kernels
- Moore-Aronszajn Theorem; we'll come back to this

Some more ways to build kernels

Limits: if exists, is psd
Products: is psd
- Let , be independent
- Covariance matrices are psd, so is too
Powers: is pd for any integer
, the polynomial kernel
Exponents: is pd
If , is pd
- Use the feature map

, the Gaussian kernel

Reproducing property

Recall original motivating example with
Kernel is
Classifier based on linear
is the function itself, represented by a vector in
is the function evaluated at a point
Elements of correspond to functions,
Reproducing prop.: for

Reproducing kernel Hilbert space (RKHS)

Every psd kernel on defines a (unique) Hilbert space, its RKHS , and a map where
- Elements are functions on , with
Combining the two, we sometimes write
is the evaluation functional
An RKHS is defined by it being continuous, or

Moore-Aronszajn Theorem

Building for a given psd :
- Start with
- Define from
- Take to be completion of in the metric from
- Get that the reproducing property holds for in
- Can also show uniqueness
Theorem: is psd iff it's the reproducing kernel of an RKHS

A quick check: linear kernels

on
If , then
Closure doesn't add anything here, since is closed
So, linear kernel gives you RKHS of linear functions

More complicated: Gaussian kernels

is infinite-dimensional
Functions in are bounded:
Choice of controls how fast functions can vary:
Can say lots more with Fourier properties

Kernel ridge regression

Linear kernel gives normal ridge regression: Nonlinear kernels will give nonlinear regression!

How to find ? Representer Theorem:

Let
its orthogonal complement in
Decompose with ,
Minimizer needs , and so

Setting derivative to zero gives
satisfied by

Other kernel algorithms

Representer theorem applies if strictly increasing:
Classification algorithms:
- Support vector machines: is hinge loss
- Kernel logistic regression: is logistic loss
Principal component analysis, canonical correlation analysis
Many, many more…

Some theory

If universal, can approximate any continuous func
- for all nonzero finite signed measures
- True for Gaussian, many other common kernels (but no finite-dimensional ones!)
- Norm may go to as approximation gets better
If RKHS norm is small, can learn quickly
- e.g. Rademacher complexity of is at most

Limitations of kernel-based learning

Generally bad at learning sparsity
- e.g. for large
Provably slower than deep learning for a few problems
- e.g. to learn a single ReLU, , need norm exponential in [Yehudai/Shamir NeurIPS-19]
- Also some hierarchical problems, etc [Kamath+ COLT-20]

Relationship to deep learning

Deep models usually end as
Can think of as learned kernel,
Does this gain us anything?
- Random nets with trained last layer (NNGP) can be decent
- As width , nets become neural tangent kernel
  - Widely used theoretical analysis
  - SVMs with NTK can be great on small data
- Inspiration: learn the kernel model end-to-end
  - Ongoing area; good results in two-sample testing, GANs, density estimation, meta-learning, semi-supervised learning, …

Mean embeddings of distributions

Represent point as ,
Represent distribution as ,
- Last step assumed e.g.
Okay. Why?
- One reason: ML on distributions [Szabó+ JMLR-16]
- More common reason: comparing distributions

Maximum Mean Discrepancy

Last line is Integral Probability Metric (IPM) form
is called “witness function” or “critic”: high on , low on

More!

Foundations: Berlinet and Thomas-Agnan, RKHS in Probability and Statistics
Hardcore theoretical details: Steinwart and Christmann, Support Vector Machines
Close connections to Gaussian processes [Kanagawa+ 'GPs and Kernel Methods' 2018]
Mean embeddings: survey of [Muandet+ 'Kernel Mean Embedding of Distributions']
The practical sessions! Some pointers in there

Kernel Methods:From Basics to Modern Applications