\definecolor{cb1}{RGB}{76,114,176} \definecolor{cb2}{RGB}{221,132,82} \definecolor{cb3}{RGB}{85,168,104} \definecolor{cb4}{RGB}{196,78,82} \definecolor{cb5}{RGB}{129,114,179} \definecolor{cb6}{RGB}{147,120,96} \definecolor{cb7}{RGB}{218,139,195} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\bigO}{\mathcal O} \newcommand{\cH}{\mathcal{H}} \newcommand{\cX}{\mathcal{X}} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\E}{\mathbb{E}} \newcommand{\HS}{\mathrm{HS}} \DeclareMathOperator{\HSIC}{HSIC} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\MMD}{MMD} \DeclareMathOperator{\MMDhat}{\widehat{MMD}} \newcommand{\R}{\mathbb{R}} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\span}{span} \newcommand{\tp}{^\mathsf{T}} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\Var}{Var} % \newcommand{\indep}{\mathrel{\unicode{x2AEB}}} \newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\cP}[1]{{\color{cb1} #1}} \newcommand{\PP}{\cP{\mathbb{P}}} \newcommand{\pp}{\cP{p}} \newcommand{\X}{\cP{X}} \newcommand{\Xi}{\cP{X_i}} \newcommand{\Xp}{\cP{X'}} \newcommand{\Xpp}{\cP{X''}} \newcommand{\Hx}{\cP{\cH_x}} \newcommand{\kx}{\cP{k_x}} \newcommand{\fc}{\cP{f}} \newcommand{\muP}{\cP{\mu_{\mathbb P}}} \newcommand{\Pdata}{\cP{\mathbb{P}_\mathrm{data}}} \newcommand{\cQ}[1]{{\color{cb2} #1}} \newcommand{\QQ}{\cQ{\mathbb{Q}}} \newcommand{\qq}{\cQ{q}} \newcommand{\qtheta}{\cQ{q_\theta}} \newcommand{\Y}{\cQ{Y}} \newcommand{\Yp}{\cQ{Y'}} \newcommand{\Ypp}{\cQ{Y''}} \newcommand{\Yj}{\cQ{Y_j}} \newcommand{\thetac}{\cQ{\theta}} \newcommand{\vtheta}{\thetac} \newcommand{\Qtheta}{\QQ_\thetac} \newcommand{\Gtheta}{\cQ{G_\theta}} \newcommand{\Hy}{\cQ{\cH_y}} \newcommand{\ky}{\cQ{k_y}} \newcommand{\gc}{\cQ{g}} \newcommand{\muQ}{\cQ{\mu_{\mathbb Q}}} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\Abs}[1]{\left\lvert #1 \right\rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\Norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\hnorm}[2][\cH]{\norm{#2}_{#1}} \newcommand{\hNorm}[2][\cH]{\Norm{#2}_{#1}} \newcommand{\inner}[2]{\langle #1, #2 \rangle} \newcommand{\Inner}[2]{\left\langle #1, #2 \right\rangle} \newcommand{\hinner}[3][\cH]{\inner{#2}{#3}_{#1}} \newcommand{\hInner}[3][\cH]{\Inner{#2}{#3}_{#1}} \newcommand{\cpsi}[1]{{\color{cb3} #1}} \newcommand{\psic}{\cpsi{\psi}} \newcommand{\Psic}{\cpsi{\Psi}} \newcommand{\Dpsi}{\cpsi{D_\psi}}

Modern Kernel Methods
in Machine Learning:
Part II

Danica J. Sutherland(she/her)
Computer Science, University of British Columbia
ETICS "summer" school, Oct 2022

(Swipe or arrow keys to move through slides; m for a menu to jump; ? to show more.)

Yesterday, we saw:

  • RKHS \cH is a function space, f : \cX \to \R
  • Reproducing property: \hinner{f}{k(x, \cdot)} = f(x)
  • Representer theorem: \argmin L(f(x_1), \dots, f(x_n)) + R(\hnorm f) \in \span\{ k(x_i, \cdot) \}_{i=1}^n
  • Can use to do kernel ridge regression, SVMs, etc


  • Kernel mean embeddings of distributions
  • Gaussian processes and probabilistic numerics
  • Kernel approximations, for better computation
  • Neural tangent kernels

Mean embeddings of distributions

  • Represent point x \in \cX as k(x, \cdot) : \quad f(x) = \hinner{f}{k(x, \cdot)}
  • Represent distribution \PP as \muP : \quad \E_{X \sim \PP} f(X) = \hinner{f}{\muP} \E_{\X \sim \PP} f(\X) = \E_{\X \sim \PP} \hinner{f}{k(\X, \cdot)} = \hInner{f}{\fragment[][highlight-blue]{\E_{\X \sim \PP} k(\X, \cdot)}}
    • Last step assumed e.g. \E \sqrt{k(\X, \X)} < \infty
  • \hinner{\muP}{\muQ} = \E_{\X \sim \PP, \Y \sim \QQ} k(\X, \Y)
  • Okay. Why?
    • One reason: ML on distributions [Szabó+ JMLR-16]
    • More common reason: comparing distributions

Maximum Mean Discrepancy

\begin{align*} \MMD(\PP, \QQ) &= \hnorm{\muP - \muQ} \\&\fragment[0]{= \sup_{\hnorm{f} \le 1} \hinner{f}{\muP - \muQ}} \\&\fragment[1]{= \sup_{\hnorm{f} \le 1} \E_{X \sim \PP} f(X) - \E_{Y \sim \QQ} f(Y)} \end{align*}

  • Last line is Integral Probability Metric (IPM) form
  • f is called “witness function” or “critic”: high on \PP , low on \QQ f^*(t) \propto \hinner{\muP - \muQ}{k(t, \cdot)} = \E_\PP k(t, X) - \E_\QQ k(t, Y)

MMD properties

  • \MMD(\PP, \PP) = 0 , symmetry, triangle inequality
  • If k is characteristic, then \MMD(\PP, \QQ) = 0 iff \PP = \QQ
    • i.e. \PP \mapsto \muP is injective
    • Makes MMD a metric on probability distributions
    • Universal => characteristic
  • Linear kernel: \MMD(\PP, \QQ) = \hnorm{\muP - \muQ} is just Euclidean distance between means

Application: Kernel Herding

  • Want a "super-sample" from \PP : \frac1n \sum_i f(\Xi) \approx \E f(\X)
  • If f \in \cH , error \le \hnorm f \MMD(\PP, \frac1n \sum_{i=1}^T \delta_{\Xi})
  • Greedily minimize the MMD: \displaystyle\!\!\!\X_{\cP{T+1}} \in \argmin_{\X \in \cX} \E_{\Xp \sim \PP} k(\X, \Xp) - \frac{1}{T+1} \sum_{i=1}^T k(\X, \Xi)
  • Get \bigO(1 / T) approximation instead of \bigO(1 / \sqrt T) with random samples

Estimating MMD from samples

\begin{gather} \MMD_k^2(\PP, \QQ) = \E_{\substack{\X, \Xp \sim \PP\\\Y, \Yp \sim \QQ}}\left[ k(\X, \Xp) - 2 k(\X, \Y) + k(\Y, \Yp) \right] % = \E_{\X, \Xp \sim \PP}[k(\X, \Xp)] % + \E_{\Y, \Yp \sim \QQ}[k(\Y, \Yp)] % - 2 \E_{\substack{\X \sim \PP\\\Y \sim \QQ}}[k(\X, \Y)] \\ \fragment[0]{ \MMDhat_k^2(\X, \Y) = \fragment[1][highlight-current-red]{\mean(K_{\X\X})} + \fragment[2][highlight-current-red]{\mean(K_{\Y\Y})} - 2 \fragment[3][highlight-current-red]{\mean(K_{\X\Y})} } \end{gather}




MMD vs other distances

  • MMD has easy \bigO(n^2) estimator
    • block or incomplete estimators are \bigO(n^\alpha) for \alpha \in [1, 2] , but noisier
  • For bounded kernel, \bigO_p(1 / \sqrt n) estimation error
    • Independent of data dimension!
    • But, no free lunch…the value of the MMD generally shrinks with growing dimension, so constant \bigO_p(1 / \sqrt n) error gets worse relatively

Application: Two-sample testing

  • Given samples from two unknown distributions \X \sim \PP \qquad \Y \sim \QQ
  • Question: is \PP = \QQ ?
  • Hypothesis testing approach: H_0: \PP = \QQ \qquad H_1: \PP \ne \QQ
  • Reject H_0 if \MMDhat(\X, \Y) > c_\alpha
  • Do smokers/non-smokers get different cancers?
  • Do Brits have the same friend network types as Americans?
  • When does my laser agree with the one on Mars?
  • Are storms in the 2000s different from storms in the 1800s?
  • Does presence of this protein affect DNA binding? [MMDiff2]
  • Do these dob and birthday columns mean the same thing?
  • Does my generative model \Qtheta match \Pdata ?

What's a hypothesis test again?

MMD-based testing

  • H_0 : n \MMDhat^2 converges in distribution to…something
    • Infinite mixture of \chi^2 s, params depend on \PP and k
    • Can estimate threshold with permutation testing
  • H_1 : \sqrt n (\MMDhat^2 - \MMD^2) \stackrel{d}{\to} asymptotically normal
  • Any characteristic kernel gives consistent test…eventually
  • Need enormous n if kernel is bad for problem

Classifier two-sample tests

  • \hat T(\X, \Y) is the accuracy of f on the test set
  • Under H_0 , classification impossible: \hat T \sim \mathrm{Binomial}(n, \frac12)
  • With k(x, y) = \frac14 f(x) f(y) where f(x) \in \{-1, 1\} ,
    get \MMDhat(\X, \Y) = \left\lvert \hat{T}(\X, \Y) - \frac12 \right\rvert

Deep learning and deep kernels

  • k(x, y) = \tfrac14 f(x) f(y) is one form of deep kernel
  • Deep models are usually of the form f(x) = w\tp \phi_\psic(x)
    • With a learned \phi_\psic(x) : \mathcal X \to \R^{D}
  • If we fix \psic , have f \in \cH_\psic with k_\psic(x, y) = \phi_\psic(x)\tp \phi_\psic(y)
    • Same idea as NNGP approximation
  • Generalize to a deep kernel: k_\psic(x, y) = \kappa\left( \phi_\psic(x), \phi_\psic(y) \right)

Normal deep learning \subset deep kernels

  • Take k_\psic(x, y) = \tfrac14 f_\psic(x) f_\psic(y) \fragment[1]{+ 1}
  • Final function in \cH_\psic will be a f_\psic(x) \fragment[1]{+ b}
  • With logistic loss: this is Platt scaling

“Normal deep learning \subset deep kernels” – so?

  • This definitely does not say that deep learning is (even approximately) a kernel method
  • …despite what some people might want you to think
  • We know theoretically deep learning can learn some things faster than any kernel method [see Malach+ ICML-21 + refs]
  • But deep kernel learning ≠ traditional kernel models
    • exactly like how usual deep learning ≠ linear models

Optimizing power of MMD tests

  • Asymptotics of \MMDhat^2 give us immediately that \Pr_{H_1}\left( n \MMDhat^2 > c_\alpha \right) \approx \Phi\left( \frac{\sqrt n \MMD^2}{\sigma_{H_1}} - \frac{c_\alpha}{\sqrt n \sigma_{H_1}} \right) \MMD , \sigma_{H_1} , c_\alpha are constants: first term usually dominates
  • Pick k to maximize an estimate of \MMD^2 / \sigma_{H_1}
  • Use \MMDhat from before, get \hat\sigma_{H_1} from U-statistic theory
  • Can show uniform \mathcal O_P(n^{-\frac13}) convergence of estimator
  • Get better tests (even after data splitting)

Application: (S)MMD GANs

  • An implicit generative model:
    • A generator net outputs samples from \Qtheta
    • Minimize estimate of \MMD{\psic}(\PP^m, \Qtheta^n) on a minibatch
  • MMD GAN: \min_{\vtheta} \left[ \max_{\psic} \MMD_{\psic}(\PP, \Qtheta) \right]
  • SMMD GAN: \min_{\vtheta} \left[ \max_{\psic} \textcolor{red}{\mathrm S}\!\MMD_{\psic}(\PP, \Qtheta) \right]
    • Scaled MMD uses kernel properties to ensure smooth loss for \vtheta by making witness function smooth [Arbel+ NeurIPS-18]
    • Uses \hinner{f}{\partial_{x_1} k(x, \cdot)} = \partial_{x_1} f(x)
    • Standard WGAN-GP better thought of in kernel framework

Application: distribution regression/classification/…

Example: age from face images [Law+ AISTATS-18]

Bayesian distribution regression: incorporate \muP uncertainty

\Biggl\{ , , , , \Biggr\} \to 35

IMDb database [Rothe+ 2015]: 400k images of 20k celebrities


  • \X \indep \Y iff \Cov(f(\X), g(\Y)) = 0 for all measurable f , g
  • Let's implement for RKHS functions f \in \Hx , g \in \Hy : \begin{align*} \E[\fc(\X)] \E[\gc(\Y)] &\fragment[1]{{}= \hinner[\Hx]{\fc}{\muP} \hinner[\Hy]{\muQ}{\gc}} \\&\fragment[2]{{}= \hinner[\Hx]{\fc}{(\muP \otimes \muQ) \gc}} \\ \fragment[3]{\E[\fc(\X) \gc(\Y)]} &\fragment[4]{{}= \E[ \hinner[\Hx]{\fc}{\kx(\X, \cdot)} \hinner[\Hy]{\ky(\Y, \cdot)}{\gc}]} \\&\fragment[5]{{}= \hinner[\Hx]{\fc}{ \E\left[ \kx(\X, \cdot) \otimes \ky(\Y, \cdot) \right] \, \gc }} \\ \fragment[6]{\Cov(\fc(\X), \gc(\Y))} &\fragment[6]{{}= \hinner[\Hx]{\fc}{ C_{\X\Y} \gc }} \end{align*} where C_{\X\Y} : \Hy \to \Hx is \E\left[ \kx(\X, \cdot) \otimes \ky(\Y, \cdot) \right] - \E\left[ \kx(\X, \cdot) \right] \otimes \E\left[ \ky(\Y, \cdot) \right]

Cross-covariance operator and independence

  • \Cov(\fc(\X), \gc(\Y)) = \hinner[\Hx]{\fc}{C_{\X\Y} \gc}
  • C_{\X\Y} = \E\left[ \kx(\X, \cdot) \otimes \ky(\Y, \cdot) \right] - \muP \otimes \muQ
  • If \X \indep \Y , then C_{\X\Y} = 0
  • If C_{\X\Y} = 0 , \Cov(\fc(\X), \gc(\Y)) = 0 \quad \forall \fc \in \Hx, \gc \in \Hy
  • If \kx , \ky are characteristic:
    • C_{\X\Y} = 0 implies X \indep Y [Szabó/Sriperumbudur JMLR-18]
    • \X \indep \Y iff C_{\X\Y} = 0
    • \X \indep Y iff 0 = \norm{C_{\X\Y}}_\HS^2 (sum squared singular values)
      • HSIC: "Hilbert-Schmidt Independence Criterion"


\begin{align*} C_{\X\Y} &= \E\left[ \kx(\X, \cdot) \otimes \ky(\Y, \cdot) \right] - \muP \otimes \muQ \\ \norm{C_{\X\Y}}_{\HS}^2 &= \hNorm[\Hx \otimes \Hy]{\mu_{\mathbb{P}_{\X\Y}} - \muP \otimes \muQ}^2 \\&\fragment[1]{{}= \MMD(\mathbb{P}_{\X\Y}, \PP \times \QQ)^2} \\&\fragment[2]{{} = \E[ \kx(\X, \Xp) \ky(\Y, \Yp) ] }\\&\fragment[2]{{}\quad - 2 \E[\kx(\X, \Xp) \kx(\Y, \Ypp)] }\\&\fragment[2]{{}\quad + \E[\kx(\X, \Xp)] \E[\ky(\Y, \Yp)] } \end{align*}
  • Linear case: C_{\X\Y} is cross-covariance matrix,
    HSIC is squared Frobenius norm
  • Default estimator (biased, but simple):
    \Tr(H K_\X H K_\Y) where H = I - \mathbf{1} \mathbf{1}\tp

HSIC applications

Example: SSL-HSIC [Li+ NeurIPS-21]

  • Maximizes dependence between image features f and its identity on a minibatch
  • Using a learned deep kernel based on g


  • Mean embedding \muP = \E k(\X, \cdot)
  • \MMD(\PP, \QQ) = \hnorm{\muP - \muQ} is 0 iff \PP = \QQ (for characteristic kernels)
  • \HSIC(\X, \Y) = \hnorm[\HS]{C_{\X\Y}} = \MMD(\mathbb{P}_{\X\Y}, \PP \times \QQ)^2 is 0 iff X \indep Y (for characteristic \kx , \ky )
  • After break: last interactive session exploring testing
  • More details: