Based on work with:

Michael Arbel

Mikołaj Bińkowski

Soumyajit De

Arthur Gretton

Feng Liu

Jie Lu

Aaditya Ramdas

Alex Smola

Heiko Strathmann

Hsiao-Yu (Fish) Tung

Wenkai Xu

Guangquan Zhang

PIHOT kick-off, 30 Jan 2021

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.)

- Linear classifiers: ,
- Use a “richer” :
- Can avoid explicit ; instead
- “Kernelized” algorithms access data only through

- Ex: Gaussian RBF / exponentiated quadratic / squared exponential / …
- Reproducing property: for
- , where
- is in – the representer theorem

- is the
*feature map*for - If , ; MMD is distance between means
- Many kernels:
**infinite-dimensional**

1.0 | 0.2 | 0.6 | |

0.2 | 1.0 | 0.5 | |

0.6 | 0.5 | 1.0 |

1.0 | 0.8 | 0.7 | |

0.8 | 1.0 | 0.6 | |

0.7 | 0.6 | 1.0 |

0.3 | 0.1 | 0.2 | |

0.2 | 0.3 | 0.3 | |

0.2 | 0.1 | 0.4 |

- Given samples from two unknown distributions
- Question: is ?
- Hypothesis testing approach:
- Reject if test statistic

- Do smokers/non-smokers get different cancers?
- Do Brits have the same friend network types as Americans?
- When does my laser agree with the one on Mars?
- Are storms in the 2000s different from storms in the 1800s?
- Does presence of this protein affect DNA binding? [
`MMDiff2`] - Do these
`dob`and`birthday`columns mean the same thing? - Does my generative model match ?
- Independence testing: is ?

Need

: th quantile of

- If is
*characteristic*, iff - Efficient permutation testing for
- : converges in distribution
- : asymptotically normal

- Any characteristic kernel gives consistent test…eventually
- Need enormous if kernel is bad for problem

- is the accuracy of on the test set
- Under , classification impossible:
- With where ,

get

- Asymptotics of give us immediately that , , are constants: first term dominates
- Pick to maximize an estimate of
- Can show uniform convergence of estimator

Train on 1 000, test on 1 031, repeat 10 times. Rejection rates:

ME | SCF | C2ST | MMD-O | MMD-D |
---|---|---|---|---|

0.588 | 0.171 | 0.452 | 0.316 | 0.744 |

Cross-entropy | Max power | |||||
---|---|---|---|---|---|---|

Dataset | Sign | Lin | Ours | Sign | Lin | Ours |

Blob | 0.84 | 0.94 | 0.90 | – | 0.95 | 0.99 |

High- Gauss. mix. | 0.47 | 0.59 | 0.29 | – | 0.64 | 0.66 |

Higgs | 0.26 | 0.40 | 0.35 | – | 0.30 | 0.40 |

MNIST vs GAN | 0.65 | 0.71 | 0.80 | – | 0.94 | 1.00 |

Given samples from a distribution over ,

we want a model that can produce new samples from

“Everybody Dance Now” [Chan et al. ICCV-19]

Fixed distribution of latents:

Maps through a network:

DCGAN generator [Radford+ ICLR-16]

How to choose ?

- GANs [Goodfellow+ NeurIPS-14] minimize discriminator accuracy (like classifier test) between and
- Problem: if there's a perfect classifier, discontinuous loss, no gradient to improve it [Arjovsky/Bottou ICLR-17]
- Disjoint at init:::
- For usual , is supported on a countable union of manifolds with dim
- “Natural image manifold” usually considered low-dim
- Won't align at init, so won't ever align

- Integral probability metrics with “smooth” are continuous
- WGAN: a set of neural networks satisfying
- WGAN-GP: instead penalize near the data
- Both losses are MMD with
- Some kind of constraint on is important!

Illustrative problem in , DiracGAN [Mescheder+ ICML-18]:

- Just need to stay away from tiny bandwidths
- …deep kernel analogue is hard.
- Instead, keep witness function from being too steep
- would give Wasserstein
- Nice distance, but hard to estimate

- Control
*on average, near the data*

- If gives uniformly Lipschitz critics, is smooth
- Original MMD-GAN paper [Li+ NeurIPS-17]: box constraint
- We [Bińkowski+ ICLR-18] used gradient penalty on critic instead
- Better in practice, but doesn't fix the Dirac problem…

Want to ensure

Can solve with …but too expensive!

Guaranteed if

Gives distance

**Theorem:** is continuous.

If has a density;
is Gaussian/linear/…;

is fully-connected, Leaky-ReLU, non-increasing width;

all weights in have bounded condition number;
then

SN-SMMD-GANKID: 0.006

WGAN-GPKID: 0.022

- Human evaluation: good at precision, bad at recall
- Likelihood: hard for GANs, maybe not right thing anyway
- Two-sample tests: always reject!
- Most common: Fréchet Inception Distance, FID
- Run pretrained featurizer on model and target
- Model each as Gaussian; compute
- Strong bias, small variance: very misleading
- Simple examples where but