in Machine Learning:

Part II

Danica J. Sutherland(she/her)

Computer Science, University of British Columbia

ETICS "summer" school, Oct 2022

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.)

- RKHS is a function space,
- Reproducing property:
- Representer theorem:
- Can use to do kernel ridge regression, SVMs, etc

- Kernel mean embeddings of distributions
- Gaussian processes and probabilistic numerics
- Kernel approximations, for better computation
- Neural tangent kernels

- Represent point as :
- Represent
*distribution*as :- Last step assumed e.g.

- Okay. Why?
- One reason: ML on distributions [Szabó+ JMLR-16]
- More common reason: comparing distributions

- Last line is Integral Probability Metric (IPM) form
- is called “witness function” or “critic”: high on , low on

- , symmetry, triangle inequality
- If is
*characteristic*, then iff- i.e. is injective
- Makes MMD a metric on probability distributions
- Universal => characteristic

- Linear kernel: is just Euclidean distance between means

- Want a "super-sample" from :
- If , error
- Greedily minimize the MMD:
- Get approximation instead of with random samples

1.0 | 0.2 | 0.6 | |

0.2 | 1.0 | 0.5 | |

0.6 | 0.5 | 1.0 |

1.0 | 0.8 | 0.7 | |

0.8 | 1.0 | 0.6 | |

0.7 | 0.6 | 1.0 |

0.3 | 0.1 | 0.2 | |

0.2 | 0.3 | 0.3 | |

0.2 | 0.1 | 0.4 |

- MMD has easy estimator
*block*or*incomplete*estimators are for , but noisier

- For bounded kernel, estimation error
- Independent of data dimension!
- But, no free lunch…the
*value*of the MMD generally shrinks with growing dimension, so constant error gets worse relatively

- Given samples from two unknown distributions
- Question: is ?
- Hypothesis testing approach:
- Reject if

- Do smokers/non-smokers get different cancers?
- Do Brits have the same friend network types as Americans?
- When does my laser agree with the one on Mars?
- Are storms in the 2000s different from storms in the 1800s?
- Does presence of this protein affect DNA binding? [
`MMDiff2`] - Do these
`dob`and`birthday`columns mean the same thing? - Does my generative model match ?

- : converges in distribution to…something
- Infinite mixture of s, params depend on and
- Can estimate threshold with
*permutation testing*

- : asymptotically normal
- Any characteristic kernel gives consistent test…eventually
- Need enormous if kernel is bad for problem

- is the accuracy of on the test set
- Under , classification impossible:
- With where ,

get

- is one form of
*deep kernel* - Deep models are usually of the form
- With a
*learned*

- With a
- If we fix , have with
- Same idea as NNGP approximation

- Generalize to a
**deep kernel**:

- Take
- Final function in will be
- With logistic loss: this is Platt scaling

- This definitely does
*not*say that deep learning is (even approximately) a kernel method - …despite what some people might want you to think
- We know theoretically deep learning can learn some things faster than any kernel method [see Malach+ ICML-21 + refs]
- But deep kernel learning ≠ traditional kernel models
- exactly like how usual deep learning ≠ linear models

- Asymptotics of give us immediately that , , are constants: first term usually dominates
- Pick to maximize an estimate of
- Use from before, get from U-statistic theory
- Can show uniform convergence of estimator
- Get better tests (even after data splitting)

- An implicit generative model:
- A generator net outputs samples from
- Minimize estimate of on a minibatch

- MMD GAN:
- SMMD GAN:
- Scaled MMD uses kernel properties to ensure smooth loss for by making witness function smooth [Arbel+ NeurIPS-18]
- Uses
- Standard WGAN-GP better thought of in kernel framework

- We can define a kernel on distributions by, e.g.,
- Some pointers:

[Muandet+ NeurIPS-12] [Sutherland 2016] [Szabó+ JMLR-16]

Bayesian distribution regression: incorporate uncertainty

, , , ,

IMDb database [Rothe+ 2015]: 400k images of 20k celebrities

- iff for all measurable ,
- Let's implement for RKHS functions , :where is

- If , then
- If ,
- If , are characteristic:
- implies [Szabó/Sriperumbudur JMLR-18]
- iff
- iff (sum squared singular values)
- HSIC: "Hilbert-Schmidt Independence Criterion"

- Linear case: is cross-covariance matrix,

HSIC is squared Frobenius norm - Default estimator (biased, but simple):

where

- Independence testing [Gretton+ NeurIPS-07]
- Clustering [Song+ ICML-07]
- Feature selection [Song+ JMLR-12]
- Self-supervised learning [Li+ NeurIPS-21]
- ⋮
- Broadly: easier-to-estimate, sometimes-nicer version of mutual information

- Maximizes dependence between image features and its identity on a minibatch
- Using a learned deep kernel based on

- Mean embedding
- is 0 iff (for characteristic kernels)
- is 0 iff (for characteristic , )
- After break: last interactive session exploring testing
- More details:
- Close connections to Gaussian processes [Kanagawa+ 'GPs and Kernel Methods' 2018]
- Mean embeddings: survey [Muandet+ 'Kernel Mean Embedding of Distributions']