\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\bigO}{\mathcal O} \newcommand{\cH}{\mathcal{H}} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\MMD}{MMD} \newcommand{\PP}{\mathbb{P}} \newcommand{\QQ}{\mathbb{Q}} \newcommand{\R}{\mathbb{R}} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\span}{span} \newcommand{\tp}{^\mathsf{T}} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\Var}{Var} \newcommand{\X}{\mathcal{X}} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\Abs}[1]{\left\lvert #1 \right\rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\Norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\hnorm}[2][\cH]{\norm{#2}_{#1}} \newcommand{\hNorm}[2][\cH]{\Norm{#2}_{#1}} \newcommand{\inner}[2]{\langle #1, #2 \rangle} \newcommand{\Inner}[2]{\left\langle #1, #2 \right\rangle} \newcommand{\hinner}[3][\cH]{\inner{#2}{#3}_{#1}} \newcommand{\hInner}[3][\cH]{\Inner{#2}{#3}_{#1}}

Modern Kernel Methods
in Machine Learning:
Part I

Danica J. Sutherland(she/her)
Computer Science, University of British Columbia
ETICS "summer" school, Oct 2022

(Swipe or arrow keys to move through slides; m for a menu to jump; ? to show more.)

Motivation

  • Machine learning! …but how do we actually do it?
  • Linear models! f(x) = w_0 + w x , \hat{y}(x) = \sign(f(x))
  • Extend x f(x) = w\tp (1, x, x^2) = w\tp \phi(x)
  • Kernels are basically a way to study doing this with any, potentially very complicated, \phi
  • Convenient way to make models on documents, graphs, videos, datasets, …
  • \phi will live in a reproducing kernel Hilbert space

Hilbert spaces

  • A complete (real or complex) inner product space.
  • Inner product space: a vector space with an inner product:
    • \hinner{\alpha_1 f_1 + \alpha_2 f_2}{g} = \alpha_1 \hinner{f_1}{g} + \alpha_2 \hinner{f_2}{g}
    • \hinner{f}{g} = \hinner{g}{f}
    • \hinner{f}{f} > 0 for f \ne 0 , \hinner{0}{0} = 0
    Induces a norm: \hnorm f = \sqrt{\hinner{f}{f}}
  • Complete: “well-behaved” (Cauchy sequences have limits in \cH )

Kernel: an inner product between feature maps

  • Call our domain \X , some set
    • \R^d , functions, distributions of graphs of images, …
  • k : \X \times \X \to \R is a kernel on \X if there exists a Hilbert space \cH and a feature map \phi : \X \to \cH so that k(x, y) = \hinner{\phi(x)}{\phi(y)}
  • Roughly, k is a notion of “similarity” between inputs
  • Linear kernel on \R^d : k(x, y) = \hinner[\R^d]{x}{y}

Aside: the name “kernel”

  • Our concept: "positive semi-definite kernel," "Mercer kernel," "RKHS kernel"
  • Semi-related: kernel density estimation
    • k : \X \times \X \to \R , usually symmetric, like RKHS kernel
    • Always requires \int k(x, y) \mathrm d y = 1 , unlike RKHS kernel
    • Often requires k(x, y) \ge 0 , unlike RKHS kernel
    • Not required to be inner product, unlike RKHS kernel
  • Unrelated:
    • The kernel (null space) of a linear map
    • The kernel of a probability density
    • The kernel of a convolution
    • CUDA kernels
    • The Linux kernel
    • Popcorn kernels

Building kernels from other kernels

  • Scaling: if \gamma \ge 0 , k_\gamma(x, y) = \gamma k(x, y) is a kernel
    • k_\gamma(x, y) = \gamma \hinner{\phi(x)}{\phi(y)} = \hinner{\sqrt\gamma \phi(x)}{\sqrt\gamma \phi(y)}
  • Sum: k_+(x, y) = k_1(x, y) + k_2(x, y) is a kernel
    • k_+(x, y) = \hInner[\cH_1 \oplus \cH_2]{\begin{bmatrix}\phi_1(x) \\ \phi_2(x)\end{bmatrix}}{\begin{bmatrix}\phi_1(y) \\ \phi_2(y)\end{bmatrix}}
  • Is k_1(x, y) - k_2(x, y) necessarily a kernel?
    • Take k_1(x, y) = 0 , k_2(x, y) = x y , x \ne 0 .
    • Then k_1(x, x) - k_2(x, x) = - x^2 < 0
    • But k(x, x) = \hnorm{\phi(x)}^2 \ge 0 .

Positive definiteness

  • A symmetric function k : \X \times \X \to \R (i.e. have k(x, y) = k(y, x) ) is positive semi-definite (psd) if for all n \ge 1 , (a_1, \dots, a_n) \in \R^n , (x_1, \dots, x_n) \in \X^n , \sum_{i=1}^n \sum_{j=1}^n a_i a_j k(x_i, x_j) \ge 0
  • Equivalently: kernel matrix K is PSD K := \begin{bmatrix} k(x_1, x_1) & k(x_1, x_2) & \dots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \dots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \dots & k(x_n, x_n) \end{bmatrix}
  • Hilbert space kernels are psd
    \begin{align*} \!\!\!\! \sum_{i=1}^n \sum_{j=1}^n \hinner{a_i \phi(x_i)}{a_j \phi(x_j)} &\fragment[3]{{}= \hInner{\sum_{i=1}^n a_i \phi(x_i)}{\sum_{j=1}^n a_j \phi(x_j)}} \\&\fragment[4]{{}= \hNorm{ \sum_{i=1}^n a_i \phi(x_i) }^2} \fragment[5]{{}\ge 0} \end{align*}
  • psd functions are Hilbert space kernels
    • Moore-Aronszajn Theorem; we'll come back to this

Some more ways to build kernels

  • Limits: if k_\infty(x, y) = \lim_{m \to \infty} k_m(x, y) exists, k_\infty is psd
    • \displaystyle \lim_{m\to\infty} \sum_{i=1}^n \sum_{j=1}^n a_i a_j k_m(x_i, x_j) \ge 0
  • Products: k_\times(x, y) = k_1(x, y) k_2(x, y) is psd
    • Let V \sim \mathcal N(0, K_1) , W \sim \mathcal N(0, K_2) be independent
    • \Cov(V_i W_i, V_j W_j) = \Cov(V_i, V_j) \Cov(W_i, W_j) = k_\times(x_i, x_j)
    • Covariance matrices are psd, so k_\times is too
  • Powers: k_n(x, y) = k(x, y)^n is pd for any integer n \ge 0

    \fragment[ 8 ]{\big(} x \tp y \fragment[ 7 ]{ {} + c } \fragment[ 8 ]{\big)}\fragment[ 8 ]{^n} , the polynomial kernel

  • Exponents: k_{\exp}(x, y) = \exp(k(x, y)) is pd
    • k_{\exp}(x, y) = \lim_{N \to \infty} \sum_{n=0}^N \frac{1}{n!} k(x, y)^n
  • If f : \X \to \R , k_f(x, y) = f(x) k(x, y) f(y) is pd
    • Use the feature map x \mapsto f(x) \phi(x)

\fragment[ 20 ]{\exp\Big( -\frac{1}{2 \sigma^2} \norm{x}^2 \Big)} \fragment[ 19 ]{\exp\Big(} \fragment[ 18 ]{\frac{1}{\sigma^2}} x\tp y \fragment[ 19 ]{\Big)} \fragment[ 20 ]{\exp\Big( -\frac{1}{2 \sigma^2} \norm{y}^2 \Big)}

{} = \exp\Big( - \frac{1}{2 \sigma^2} \left[ \norm{x}^2 - 2 x\tp y + \norm{y}^2 \right] \Big)

{} = \exp\Big( - \frac{\norm{x - y}^2}{2 \sigma^2} \Big) , the Gaussian kernel

Reproducing property

  • Recall original motivating example with \X = \R \qquad \phi(x) = (1, x, x^2) \in \R^3
  • Kernel is k(x, y) = \hinner{\phi(x)}{\phi(y)} = 1 + x y + x^2 y^2
  • Classifier based on linear f(x) = \hinner{w}{\phi(x)}
  • f(\cdot) is the function f itself; corresponds to vector w in \R^3
    f(x) \in \R is the function evaluated at a point x
  • Elements of \cH are functions, f : \X \to \R
  • Reproducing prop.: f(x) = \hinner{f(\cdot)}{\phi(x)} for f \in \cH

Reproducing kernel Hilbert space (RKHS)

  • Every psd kernel k on \X defines a (unique) Hilbert space, its RKHS \cH , and a map \phi : \X \to \cH where
    • k(x, y) = \hinner{\phi(x)}{\phi(y)}
    • Elements f \in \cH are functions on \X , with f(x) = \hinner{f}{\phi(x)}
  • Combining the two, we sometimes write k(x, \cdot) = \phi(x)
  • k(x, \cdot) is the evaluation functional
    An RKHS is defined by it being continuous, or \abs{f(x)} \le M_x \hnorm{f}

Moore-Aronszajn Theorem

  • Building \cH for a given psd k :
    • Start with \cH_0 = \span\left(\left\{ k(x, \cdot) : x \in \X \right\}\right)
    • Define \hinner[\cH_0]{\cdot}{\cdot} from \hinner[\cH_0]{k(x,\cdot)}{k(y,\cdot)} = k(x, y)
    • Take \cH to be completion of \cH_0 in the metric from \hinner[\cH_0]{\cdot}{\cdot}
    • Get that the reproducing property holds for k(x, \cdot) in \cH
    • Can also show uniqueness
  • Theorem: k is psd iff it's the reproducing kernel of an RKHS

A quick check: linear kernels

  • k(x, y) = x\tp y on \X = \R^d
    • k(x, \cdot) = [y \mapsto x\tp y] "corresponds to" x
  • If \displaystyle f(y) = \sum_{i=1}^n a_i k(x_i, y) , then f(y) = \left[ \sum_{i=1}^n a_i x_i \right]\tp y
  • Closure doesn't add anything here, since \R^d is closed
  • So, linear kernel gives you RKHS of linear functions
  • \hnorm{f} = \sqrt{\sum_{i=1}^n \sum_{j=1}^n a_i a_j k(x_i, x_j)} = \Norm{\sum_{i=1}^n a_i x_i}

More complicated: Gaussian kernels

k(x, y) = \exp(\frac{1}{2 \sigma^2} \norm{x - y}^2)

  • \cH is infinite-dimensional
  • Functions in \cH are bounded: f(x) = \hinner{f}{k(x, \cdot)} \le \sqrt{k(x, x)} \hnorm{f} = \hnorm{f}
  • Choice of \sigma controls how fast functions can vary: \begin{align*} f(x + t) - f(x) &\le \hnorm{k(x + t, \cdot) - k(x', \cdot)} \hnorm{f} \\ \hnorm{k(x + t, \cdot) - k(x, \cdot)}^2 & = 2 - 2 k(x, x + t) = 2 - 2 \exp\left(-\tfrac{\norm{t}^2}{2 \sigma^2}\right) \end{align*}
  • Can say lots more with Fourier properties

Kernel ridge regression

\hat f = \argmin_{f \in \cH} \frac1n \sum_{i=1}^n ( f(x_i) - y_i )^2 + \lambda \hnorm{f}^2

Linear kernel gives normal ridge regression: \hat f(x) = \hat w\tp x; \quad \hat w = \argmin_{w \in \R^d} \frac1n \sum_{i=1}^n ( w\tp x_i - y_i )^2 + \lambda \norm{w}^2 Nonlinear kernels will give nonlinear regression!

How to find \hat f ? Representer Theorem: \hat f = \sum_{i=1}^n \hat\alpha_i k(x_i, \cdot)

  • Let \cH_X = \span\{ k(x_i, \cdot) \}_{i=1}^n
    \cH_\perp its orthogonal complement in \cH
  • Decompose f = f_X + f_\perp with f_\X \in \cH_X , f_\perp \in \cH_\perp
  • f(x_i) = \hinner{f_X + f_\perp}{k(x_i, \cdot)} = \hinner{f_X}{k(x_i, \cdot)}
  • \hnorm{f}^2 = \hnorm{f_X}^2 + \hnorm{f_\perp}^2
  • Minimizer needs f_\perp = 0 , and so \hat f = \sum_{i=1}^n \alpha_i k(x_i, \cdot)

\begin{align*} \sum_{i=1}^n \left( \sum_{j=1}^n \alpha_j k(x_i, x_j) - y_i \right)^2 &= \sum_{i=1}^n \left( [K \alpha]_i - y_i \right)^2 \fragment[11]{{}= \norm{K \alpha - y}^2} \\&\fragment[12]{{} = \alpha\tp K^2 \alpha - 2 y \tp K \alpha + y\tp y} \end{align*}

\hNorm{\sum_{i=1}^n \alpha_i k(x_i, \cdot)}^2 = \sum_{i=1}^n \sum_{j=1}^n \alpha_i k(x_i, x_j) \alpha_j \fragment[16]{{} = \alpha\tp K \alpha}

\begin{align*} \hat\alpha &= \argmin_{\alpha \in \R^n} \alpha\tp K^2 \alpha - 2 y \tp K \alpha + y\tp y + n \lambda \alpha\tp K \alpha \\&\fragment[21]{{} = \argmin_{\alpha \in \R^n} \alpha\tp K (K + n \lambda I) \alpha - 2 y \tp K \alpha} \end{align*}

Setting derivative to zero gives K (K + n \lambda I) \hat\alpha = K y ,
satisfied by \hat\alpha = (K + n \lambda I)^{-1} y

Other kernel algorithms

  • Representer theorem applies if R is strictly increasing in \min_{f \in \cH} L(f(x_1), \cdots, f(x_n)) + R(\hnorm{f})
  • Kernel methods can then train based on kernel matrix K
  • Classification algorithms:
    • Support vector machines: L is hinge loss
    • Kernel logistic regression: L is logistic loss
  • Principal component analysis, canonical correlation analysis
  • Many, many more…
  • But not everything works...e.g. Lasso \norm w_1 regularizer

Some theory: generalization

  • Rademacher complexity of \{ f \in \cH : \hnorm f \le B \} is upper-bounded by B / \sqrt n if k(x, x) \le 1
  • Implies for L -Lipschitz losses \ell(\cdot, y) that \sup_{f : \hnorm f \le B} \E[\ell(f(x), y)] - \frac1n \sum_{i=1}^n \ell(f(x_i), y_i) \le \frac{2 L B}{\sqrt n}
  • Same kind of rates with stability-based analyses
  • Implies that, if the “truth” is low-norm, most kernel methods are \tilde\bigO(1 / \sqrt n) suboptimal
  • Difficulty of learning is controlled by RKHS norm of target

Some theory: universality

  • One definition: a continuous kernel on a compact metric space \X is universal if \cH is L_\infty -dense in C(\X) :
    for every continuous g : \X \to \R , for every \varepsilon > 0 , there is an f \in \cH with \norm{f - g}_\infty = \sup_{x \in \X} \abs{f(x) - g(x)} \le \varepsilon
  • Implies that, on compact \X , \cH can separate compact sets
    • \exists f \in \cH with f(x) > 0 for x \in X_1 , f(x) < 0 for x \in X_2
    • Which implies there are f \in \cH with arbitrarily small loss
    • Might take arbitrarily large norm: approximation/estimation tradeoff
  • Can prove via Stone-Weierstrass or Fourier properties
  • Never true for finite-dim kernels: need \operatorname{rank}(K) = n

Translation-invariant kernels on \R^d

  • Assume k is bounded, continuous, and translation invariant
    • k(x, y) = \psi(x - y)
  • Then \psi is proportional to the Fourier transform of a probability measure (Bochner's theorem)
  • If \psi \in L_1 , the measure has a density
  • If that density is positive everywhere, k is universal
  • For all nonzero finite signed measures \mu , \int_\X \int_\X k(x, y) \, \mathrm{d}\mu(x) \, \mathrm{d}\mu(y) > 0
  • True for Gaussian \exp(-\tfrac1{2 \sigma^2} \norm{x - y}^2 )
  • and Laplace \exp(-\tfrac1\sigma \norm{x - y} )

Limitations of kernel-based learning

  • Generally bad at learning sparsity
    • e.g. f(x_1, \dots, x_d) = 3 x_2 - 5 x_{17} for large d
  • Provably statistically slower than deep learning for a few problems
  • \bigO(n^3) computational complexity, \bigO(n^2) memory
    • Various approximations you can make

Relationship to deep learning

  • Deep models usually end as f_L(x) = w_L\tp f_{L-1}(x)
  • Can think of as learned kernel, k(x, y) = f_{L-1}(x) f_{L-1}(y)
  • Does this gain us anything?
    • Random nets with trained last layer (NNGP) can be decent
    • As width \to \infty , nets become neural tangent kernel
      • Widely used theoretical analysis...more tomorrow
      • SVMs with NTK can be great on small data
    • Inspiration: learn the kernel model end-to-end
      • Ongoing area; good results in two-sample testing, GANs, density estimation, meta-learning, semi-supervised learning, …
      • Explored a bit in interactive session!

What's next

  • After break: interactive session exploring w/ ridge regression
  • Tomorrow: a subset of
    • Representing distributions
      • Uses for statistical testing + generative models
    • Connections to Gaussian processes, probabilistic numerics
    • Approximation methods for faster computation
    • Deeper connection to deep learning
  • More details on basics:
    • Berlinet and Thomas-Agnan, RKHS in Probability and Statistics
    • Steinwart and Christmann, Support Vector Machines