# Modern Kernel Methodsin Machine Learning:Part II

Danica J. Sutherland(she/her)
Computer Science, University of British Columbia
ETICS "summer" school, Oct 2022

## Yesterday, we saw:

• RKHS is a function space,
• Reproducing property:
• Representer theorem:
• Can use to do kernel ridge regression, SVMs, etc

## Today:

• Kernel mean embeddings of distributions
• Gaussian processes and probabilistic numerics
• Kernel approximations, for better computation
• Neural tangent kernels

## Mean embeddings of distributions

• Represent point as :
• Represent distribution as :
• Last step assumed e.g.
• Okay. Why?
• One reason: ML on distributions
• More common reason: comparing distributions

## Maximum Mean Discrepancy

• Last line is Integral Probability Metric (IPM) form
• is called “witness function” or “critic”: high on , low on

## MMD properties

• , symmetry, triangle inequality
• If is characteristic, then iff
• i.e. is injective
• Makes MMD a metric on probability distributions
• Universal => characteristic
• Linear kernel: is just Euclidean distance between means

## Application: Kernel Herding

• Want a "super-sample" from :
• If , error
• Greedily minimize the MMD:
• Get approximation instead of with random samples

1.00.20.6
0.21.00.5
0.60.51.0

1.00.80.7
0.81.00.6
0.70.61.0

0.30.10.2
0.20.30.3
0.20.10.4

## MMD vs other distances

• MMD has easy estimator
• block or incomplete estimators are for , but noisier
• For bounded kernel, estimation error
• Independent of data dimension!
• But, no free lunch…the value of the MMD generally shrinks with growing dimension, so constant error gets worse relatively

## Application: Two-sample testing

• Given samples from two unknown distributions
• Question: is ?
• Hypothesis testing approach:
• Reject if
• Do smokers/non-smokers get different cancers?
• Do Brits have the same friend network types as Americans?
• When does my laser agree with the one on Mars?
• Are storms in the 2000s different from storms in the 1800s?
• Does presence of this protein affect DNA binding? [MMDiff2]
• Do these dob and birthday columns mean the same thing?
• Does my generative model match ?

## MMD-based testing

• : converges in distribution to…something
• Infinite mixture of s, params depend on and
• Can estimate threshold with permutation testing
• : asymptotically normal
• Any characteristic kernel gives consistent test…eventually
• Need enormous if kernel is bad for problem

## Classifier two-sample tests

• is the accuracy of on the test set
• Under , classification impossible:
• With where ,
get

## Deep learning and deep kernels

• is one form of deep kernel
• Deep models are usually of the form
• With a learned
• If we fix , have with
• Same idea as NNGP approximation
• Generalize to a deep kernel:

## Normal deep learning \subset deep kernels

• Take
• Final function in will be
• With logistic loss: this is Platt scaling

## “Normal deep learning \subset deep kernels” – so?

• This definitely does not say that deep learning is (even approximately) a kernel method
• …despite what some people might want you to think
• We know theoretically deep learning can learn some things faster than any kernel method [see Malach+ ICML-21 + refs]
• But deep kernel learning ≠ traditional kernel models
• exactly like how usual deep learning ≠ linear models

## Optimizing power of MMD tests

• Asymptotics of give us immediately that , , are constants: first term usually dominates
• Pick to maximize an estimate of
• Use from before, get from U-statistic theory
• Can show uniform convergence of estimator
• Get better tests (even after data splitting)

## Application: (S)MMD GANs

• An implicit generative model:
• A generator net outputs samples from
• Minimize estimate of on a minibatch
• MMD GAN:
• SMMD GAN:
• Scaled MMD uses kernel properties to ensure smooth loss for by making witness function smooth
• Uses
• Standard WGAN-GP better thought of in kernel framework

## Application: distribution regression/classification/…

• We can define a kernel on distributions by, e.g.,
• Some pointers:

## Example: age from face images [Law+ AISTATS-18]

Bayesian distribution regression: incorporate uncertainty

, , , ,

IMDb database : 400k images of 20k celebrities

## Independence

• iff for all measurable ,
• Let's implement for RKHS functions , :where is

## Cross-covariance operator and independence

• If , then
• If ,
• If , are characteristic:
• implies
• iff
• iff (sum squared singular values)
• HSIC: "Hilbert-Schmidt Independence Criterion"

## HSIC

• Linear case: is cross-covariance matrix,
HSIC is squared Frobenius norm
• Default estimator (biased, but simple):
where

## HSIC applications

• Independence testing
• Clustering
• Feature selection
• Self-supervised learning
• Broadly: easier-to-estimate, sometimes-nicer version of mutual information

## Example: SSL-HSIC [Li+ NeurIPS-21]

• Maximizes dependence between image features and its identity on a minibatch
• Using a learned deep kernel based on

## Recap

• Mean embedding
• is 0 iff (for characteristic kernels)
• is 0 iff (for characteristic , )
• After break: last interactive session exploring testing
• More details:
• Close connections to Gaussian processes
• Mean embeddings: survey