\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\bigO}{\mathcal O}
\newcommand{\cH}{\mathcal{H}}
\DeclareMathOperator{\Cov}{Cov}
\DeclareMathOperator{\E}{\mathbb{E}}
\DeclareMathOperator{\MMD}{MMD}
\newcommand{\PP}{\mathbb{P}}
\newcommand{\QQ}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\span}{span}
\newcommand{\tp}{^\mathsf{T}}
\DeclareMathOperator{\Tr}{Tr}
\DeclareMathOperator{\Var}{Var}
\newcommand{\X}{\mathcal{X}}
\newcommand{\abs}[1]{\lvert #1 \rvert}
\newcommand{\Abs}[1]{\left\lvert #1 \right\rvert}
\newcommand{\norm}[1]{\lVert #1 \rVert}
\newcommand{\Norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\hnorm}[2][\cH]{\norm{#2}_{#1}}
\newcommand{\hNorm}[2][\cH]{\Norm{#2}_{#1}}
\newcommand{\inner}[2]{\langle #1, #2 \rangle}
\newcommand{\Inner}[2]{\left\langle #1, #2 \right\rangle}
\newcommand{\hinner}[3][\cH]{\inner{#2}{#3}_{#1}}
\newcommand{\hInner}[3][\cH]{\Inner{#2}{#3}_{#1}}
Modern Kernel Methods in Machine Learning: Part I Danica J. Sutherland (she/her) Computer Science, University of British Columbia ETICS "summer" school, Oct 2022
(Swipe or arrow keys to move through slides;
m
for a menu to jump;
?
to show more.)
Motivation Machine learning! …but how do we actually do it? Linear models!
f(x) = w_0 + w x
,
\hat{y}(x) = \sign(f(x))
Extend
x
…
f(x) = w\tp (1, x, x^2) = w\tp \phi(x)
Kernels are basically a way to study doing this with any, potentially very complicated,
\phi
Convenient way to make models on documents, graphs, videos, datasets, …
\phi
will live in a reproducing kernel Hilbert space
Hilbert spaces A complete (real or complex ) inner product space. Inner product space: a vector space with an inner product :
\hinner{\alpha_1 f_1 + \alpha_2 f_2}{g} = \alpha_1 \hinner{f_1}{g} + \alpha_2 \hinner{f_2}{g}
\hinner{f}{g} = \hinner{g}{f}
\hinner{f}{f} > 0
for
f \ne 0
,
\hinner{0}{0} = 0
Induces a norm :
\hnorm f = \sqrt{\hinner{f}{f}}
Complete: “well-behaved” (Cauchy sequences have limits in
\cH
) Kernel: an inner product between feature maps Call our domain
\X
, some set
\R^d
, functions, distributions of graphs of images, …
k : \X \times \X \to \R
is a kernel on
\X
if there exists a Hilbert space
\cH
and a feature map
\phi : \X \to \cH
so that
k(x, y) = \hinner{\phi(x)}{\phi(y)}
Roughly,
k
is a notion of “similarity” between inputs Linear kernel on
\R^d
:
k(x, y) = \hinner[\R^d]{x}{y}
Aside: the name “kernel” Our concept: "positive semi-definite kernel," "Mercer kernel," "RKHS kernel" Semi-related: kernel density estimation
k : \X \times \X \to \R
, usually symmetric, like RKHS kernelAlways requires
\int k(x, y) \mathrm d y = 1
, unlike RKHS kernel Often requires
k(x, y) \ge 0
, unlike RKHS kernel Not required to be inner product, unlike RKHS kernel Unrelated:The kernel (null space) of a linear map The kernel of a probability density The kernel of a convolution CUDA kernels The Linux kernel Popcorn kernels Building kernels from other kernels Scaling: if
\gamma \ge 0
,
k_\gamma(x, y) = \gamma k(x, y)
is a kernel
k_\gamma(x, y) = \gamma \hinner{\phi(x)}{\phi(y)} = \hinner{\sqrt\gamma \phi(x)}{\sqrt\gamma \phi(y)}
Sum:
k_+(x, y) = k_1(x, y) + k_2(x, y)
is a kernel
k_+(x, y) = \hInner[\cH_1 \oplus \cH_2]{\begin{bmatrix}\phi_1(x) \\ \phi_2(x)\end{bmatrix}}{\begin{bmatrix}\phi_1(y) \\ \phi_2(y)\end{bmatrix}}
Is
k_1(x, y) - k_2(x, y)
necessarily a kernel?Take
k_1(x, y) = 0
,
k_2(x, y) = x y
,
x \ne 0
. Then
k_1(x, x) - k_2(x, x) = - x^2 < 0
But
k(x, x) = \hnorm{\phi(x)}^2 \ge 0
. Positive definiteness A symmetric function
k : \X \times \X \to \R
(i.e. have
k(x, y) = k(y, x)
)
is positive semi-definite (psd) if
for all
n \ge 1
,
(a_1, \dots, a_n) \in \R^n
,
(x_1, \dots, x_n) \in \X^n
,
\sum_{i=1}^n \sum_{j=1}^n a_i a_j k(x_i, x_j) \ge 0
Equivalently: kernel matrix
K
is PSD
K := \begin{bmatrix}
k(x_1, x_1) & k(x_1, x_2) & \dots & k(x_1, x_n) \\
k(x_2, x_1) & k(x_2, x_2) & \dots & k(x_2, x_n) \\
\vdots & \vdots & \ddots & \vdots \\
k(x_n, x_1) & k(x_n, x_2) & \dots & k(x_n, x_n)
\end{bmatrix}
Hilbert space kernels are psd
\begin{align*}
\!\!\!\!
\sum_{i=1}^n \sum_{j=1}^n \hinner{a_i \phi(x_i)}{a_j \phi(x_j)}
&\fragment[3]{{}= \hInner{\sum_{i=1}^n a_i \phi(x_i)}{\sum_{j=1}^n a_j \phi(x_j)}}
\\&\fragment[4]{{}= \hNorm{ \sum_{i=1}^n a_i \phi(x_i) }^2}
\fragment[5]{{}\ge 0}
\end{align*}
psd functions are Hilbert space kernelsMoore-Aronszajn Theorem; we'll come back to this Some more ways to build kernels Limits: if
k_\infty(x, y) = \lim_{m \to \infty} k_m(x, y)
exists,
k_\infty
is psd
\displaystyle \lim_{m\to\infty} \sum_{i=1}^n \sum_{j=1}^n a_i a_j k_m(x_i, x_j) \ge 0
Products:
k_\times(x, y) = k_1(x, y) k_2(x, y)
is psdLet
V \sim \mathcal N(0, K_1)
,
W \sim \mathcal N(0, K_2)
be independent
\Cov(V_i W_i, V_j W_j) = \Cov(V_i, V_j) \Cov(W_i, W_j) = k_\times(x_i, x_j)
Covariance matrices are psd, so
k_\times
is too Powers:
k_n(x, y) = k(x, y)^n
is pd for any integer
n \ge 0
\fragment[ 8 ]{\big(} x \tp y \fragment[ 7 ]{ {} + c } \fragment[ 8 ]{\big)}\fragment[ 8 ]{^n}
, the polynomial kernel
Exponents:
k_{\exp}(x, y) = \exp(k(x, y))
is pd
k_{\exp}(x, y) = \lim_{N \to \infty} \sum_{n=0}^N \frac{1}{n!} k(x, y)^n
If
f : \X \to \R
,
k_f(x, y) = f(x) k(x, y) f(y)
is pdUse the feature map
x \mapsto f(x) \phi(x)
\fragment[ 20 ]{\exp\Big( -\frac{1}{2 \sigma^2} \norm{x}^2 \Big)}
\fragment[ 19 ]{\exp\Big(}
\fragment[ 18 ]{\frac{1}{\sigma^2}} x\tp y
\fragment[ 19 ]{\Big)}
\fragment[ 20 ]{\exp\Big( -\frac{1}{2 \sigma^2} \norm{y}^2 \Big)}
{} = \exp\Big( - \frac{1}{2 \sigma^2} \left[ \norm{x}^2 - 2 x\tp y + \norm{y}^2 \right] \Big)
{} = \exp\Big( - \frac{\norm{x - y}^2}{2 \sigma^2} \Big)
,
the Gaussian kernel
Reproducing property Recall original motivating example with
\X = \R \qquad \phi(x) = (1, x, x^2) \in \R^3
Kernel is
k(x, y) = \hinner{\phi(x)}{\phi(y)} = 1 + x y + x^2 y^2
Classifier based on linear
f(x) = \hinner{w}{\phi(x)}
f(\cdot)
is the function
f
itself; corresponds to vector
w
in
\R^3
f(x) \in \R
is the function evaluated at a point
x
Elements of
\cH
are functions ,
f : \X \to \R
Reproducing prop. :
f(x) = \hinner{f(\cdot)}{\phi(x)}
for
f \in \cH
Reproducing kernel Hilbert space (RKHS) Every psd kernel
k
on
\X
defines a (unique) Hilbert space, its RKHS
\cH
,
and a map
\phi : \X \to \cH
where
k(x, y) = \hinner{\phi(x)}{\phi(y)}
Elements
f \in \cH
are functions on
\X
, with
f(x) = \hinner{f}{\phi(x)}
Combining the two, we sometimes write
k(x, \cdot) = \phi(x)
k(x, \cdot)
is the evaluation functional An RKHS is defined by it being continuous , or
\abs{f(x)} \le M_x \hnorm{f}
Moore-Aronszajn Theorem Building
\cH
for a given psd
k
:Start with
\cH_0 = \span\left(\left\{ k(x, \cdot) : x \in \X \right\}\right)
Define
\hinner[\cH_0]{\cdot}{\cdot}
from
\hinner[\cH_0]{k(x,\cdot)}{k(y,\cdot)} = k(x, y)
Take
\cH
to be completion of
\cH_0
in the metric from
\hinner[\cH_0]{\cdot}{\cdot}
Get that the reproducing property holds for
k(x, \cdot)
in
\cH
Can also show uniqueness Theorem:
k
is psd iff it's the reproducing kernel of an RKHS A quick check: linear kernels
k(x, y) = x\tp y
on
\X = \R^d
k(x, \cdot) = [y \mapsto x\tp y]
"corresponds to"
x
If
\displaystyle f(y) = \sum_{i=1}^n a_i k(x_i, y)
, then
f(y) = \left[ \sum_{i=1}^n a_i x_i \right]\tp y
Closure doesn't add anything here, since
\R^d
is closed So, linear kernel gives you RKHS of linear functions
\hnorm{f} = \sqrt{\sum_{i=1}^n \sum_{j=1}^n a_i a_j k(x_i, x_j)} = \Norm{\sum_{i=1}^n a_i x_i}
More complicated: Gaussian kernels
k(x, y) = \exp(\frac{1}{2 \sigma^2} \norm{x - y}^2)
\cH
is infinite-dimensional Functions in
\cH
are bounded:
f(x) = \hinner{f}{k(x, \cdot)} \le \sqrt{k(x, x)} \hnorm{f} = \hnorm{f}
Choice of
\sigma
controls how fast functions can vary:
\begin{align*}
f(x + t) - f(x)
&\le \hnorm{k(x + t, \cdot) - k(x', \cdot)} \hnorm{f}
\\
\hnorm{k(x + t, \cdot) - k(x, \cdot)}^2
& = 2 - 2 k(x, x + t)
= 2 - 2 \exp\left(-\tfrac{\norm{t}^2}{2 \sigma^2}\right)
\end{align*}
Can say lots more with Fourier properties
Kernel ridge regression
\hat f = \argmin_{f \in \cH} \frac1n \sum_{i=1}^n ( f(x_i) - y_i )^2 + \lambda \hnorm{f}^2
Linear kernel gives normal ridge regression:
\hat f(x) = \hat w\tp x; \quad
\hat w = \argmin_{w \in \R^d} \frac1n \sum_{i=1}^n ( w\tp x_i - y_i )^2 + \lambda \norm{w}^2
Nonlinear kernels will give nonlinear regression!
How to find
\hat f
? Representer Theorem :
\hat f = \sum_{i=1}^n \hat\alpha_i k(x_i, \cdot)
Let
\cH_X = \span\{ k(x_i, \cdot) \}_{i=1}^n
\cH_\perp
its orthogonal complement in
\cH
Decompose
f = f_X + f_\perp
with
f_\X \in \cH_X
,
f_\perp \in \cH_\perp
f(x_i) = \hinner{f_X + f_\perp}{k(x_i, \cdot)} = \hinner{f_X}{k(x_i, \cdot)}
\hnorm{f}^2 = \hnorm{f_X}^2 + \hnorm{f_\perp}^2
Minimizer needs
f_\perp = 0
, and so
\hat f = \sum_{i=1}^n \alpha_i k(x_i, \cdot)
\begin{align*}
\sum_{i=1}^n \left( \sum_{j=1}^n \alpha_j k(x_i, x_j) - y_i \right)^2
&= \sum_{i=1}^n \left( [K \alpha]_i - y_i \right)^2
\fragment[11]{{}= \norm{K \alpha - y}^2}
\\&\fragment[12]{{} = \alpha\tp K^2 \alpha - 2 y \tp K \alpha + y\tp y}
\end{align*}
\hNorm{\sum_{i=1}^n \alpha_i k(x_i, \cdot)}^2
= \sum_{i=1}^n \sum_{j=1}^n \alpha_i k(x_i, x_j) \alpha_j
\fragment[16]{{} = \alpha\tp K \alpha}
\begin{align*}
\hat\alpha
&= \argmin_{\alpha \in \R^n} \alpha\tp K^2 \alpha - 2 y \tp K \alpha + y\tp y + n \lambda \alpha\tp K \alpha
\\&\fragment[21]{{} = \argmin_{\alpha \in \R^n} \alpha\tp K (K + n \lambda I) \alpha - 2 y \tp K \alpha}
\end{align*}
Setting derivative to zero gives
K (K + n \lambda I) \hat\alpha = K y
,
satisfied by
\hat\alpha = (K + n \lambda I)^{-1} y
Other kernel algorithms Representer theorem applies if
R
is strictly increasing in
\min_{f \in \cH} L(f(x_1), \cdots, f(x_n)) + R(\hnorm{f})
Kernel methods can then train based on kernel matrix
K
Classification algorithms:Support vector machines:
L
is hinge loss Kernel logistic regression:
L
is logistic loss Principal component analysis, canonical correlation analysis Many, many more… But not everything works...e.g. Lasso
\norm w_1
regularizer Some theory: generalization Rademacher complexity of
\{ f \in \cH : \hnorm f \le B \}
is upper-bounded by
B / \sqrt n
if
k(x, x) \le 1
Implies for
L
-Lipschitz losses
\ell(\cdot, y)
that
\sup_{f : \hnorm f \le B} \E[\ell(f(x), y)] - \frac1n \sum_{i=1}^n \ell(f(x_i), y_i) \le \frac{2 L B}{\sqrt n}
Same kind of rates with stability-based analyses Implies that, if the “truth” is low-norm, most kernel methods are
\tilde\bigO(1 / \sqrt n)
suboptimal Difficulty of learning is controlled by RKHS norm of target Some theory: universality One definition:
a continuous kernel on a compact metric space
\X
is universal if
\cH
is
L_\infty
-dense in
C(\X)
: for every continuous
g : \X \to \R
, for every
\varepsilon > 0
,
there is an
f \in \cH
with
\norm{f - g}_\infty = \sup_{x \in \X} \abs{f(x) - g(x)} \le \varepsilon
Implies that, on compact
\X
,
\cH
can separate compact sets
\exists f \in \cH
with
f(x) > 0
for
x \in X_1
,
f(x) < 0
for
x \in X_2
Which implies there are
f \in \cH
with arbitrarily small loss Might take arbitrarily large norm: approximation/estimation tradeoff Can prove via Stone-Weierstrass or Fourier properties Never true for finite-dim kernels: need
\operatorname{rank}(K) = n
Translation-invariant kernels on
\R^d
Assume
k
is bounded, continuous, and translation invariant Then
\psi
is proportional to the Fourier transform of a probability measure
(Bochner's theorem) If
\psi \in L_1
, the measure has a density If that density is positive everywhere,
k
is universal For all nonzero finite signed measures
\mu
,
\int_\X \int_\X k(x, y) \, \mathrm{d}\mu(x) \, \mathrm{d}\mu(y) > 0
True for Gaussian
\exp(-\tfrac1{2 \sigma^2} \norm{x - y}^2 )
and Laplace
\exp(-\tfrac1\sigma \norm{x - y} )
Limitations of kernel-based learning Generally bad at learning sparsity e.g.
f(x_1, \dots, x_d) = 3 x_2 - 5 x_{17}
for large
d
Provably statistically slower than deep learning for a few problems
\bigO(n^3)
computational complexity,
\bigO(n^2)
memoryVarious approximations you can make Relationship to deep learning Deep models usually end as
f_L(x) = w_L\tp f_{L-1}(x)
Can think of as learned kernel,
k(x, y) = f_{L-1}(x) f_{L-1}(y)
Does this gain us anything?Random nets with trained last layer (NNGP) can be decent As width
\to \infty
, nets become neural tangent kernelWidely used theoretical analysis...more tomorrow SVMs with NTK can be great on small data Inspiration: learn the kernel model end-to-endOngoing area; good results in two-sample testing, GANs, density estimation, meta-learning, semi-supervised learning, … Explored a bit in interactive session! What's next After break: interactive session exploring w/ ridge regression Tomorrow: a subset ofRepresenting distributionsUses for statistical testing + generative models Connections to Gaussian processes, probabilistic numerics Approximation methods for faster computation Deeper connection to deep learning More details on basics:Berlinet and Thomas-Agnan, RKHS in Probability and Statistics Steinwart and Christmann, Support Vector Machines
Modern Kernel Methodsin Machine Learning:Part IDanica J. Sutherland(she/her)Computer Science, University of British ColumbiaETICS "summer" school, Oct 2022(Swipe or arrow keys to move through slides;
m
for a menu to jump;
?
to show more.)