Better GANs by Using Kernels

D.J. Sutherland

TTIC

Michael Arbel

UCL

Mikołaj Bińkowski

Imperial

Arthur Gretton

UCL

UMass Amherst, Sep 30 2019

(Swipe or arrow keys to move through slides; m for a menu to jump; ? for help. Vertical slides are backups that I probably won't show in the talk.)

Implicit generative models

Given samples from a distribution over ,
we want a model that can produce new samples from

Why implicit generative models?

Automated animation (anime characters)
Learn calligraphic font style (zi2zi)
Image translation (pix2pix, Wolf+ 17, Everybody Dance Now)
Help musicians improvise, “natural” image editing,
plan robot actions, anonymize datasets, …
Make $432,500 selling a generated image

How to generate images things?

One choice: with a generator!

Fixed distribution of latents:

Maps through a network:

How to choose ?

GANs: trick a discriminator [Goodfellow+ NeurIPS-14]

Generator ()

Discriminator

Target ()

Is this real?

No way!

:( I'll try harder…

⋮

Is this real?

Umm…

One view: distances between distributions

What happens when is at its optimum?
If distributions have densities,
If stays optimal throughout, tries to minimize which is

JS with disjoint support [Arjovsky/Bottou ICLR-17]

If and have (almost) disjoint support so

Discriminator point of view

Generator ()

Discriminator

Target ()

Is this real?

No way!

:( I don't know how to do any better…

How likely is disjoint support?

At initialization, pretty reasonable:
:
:
Remember we might have
For usual , is supported on a countable union of
manifolds with dim
“Natural image manifold” usually considered low-dim
No chance that they'd align at init, so

Path to a solution: integral probability metrics

is a critic function

Total variation:

Wasserstein:

Maximum Mean Discrepancy [Gretton+ 2012]

Kernel – a “similarity” function

For many kernels, iff

MMD as feature matching

is the feature map for
If , ; MMD is distance between means
Many kernels: infinite-dimensional

Derivation of MMD

Reproducing property: if ,

Estimating MMD


1.0	0.2	0.6
0.2	1.0	0.5
0.6	0.5	1.0


1.0	0.8	0.7
0.8	1.0	0.6
0.7	0.6	1.0


0.3	0.1	0.2
0.2	0.3	0.3
0.2	0.1	0.4

MMD as loss [Li+ ICML-15, Dziugaite+ UAI-15]

No need for a discriminator – just minimize !
Continuous loss, gives “partial credit”

Generator ()

Critic

Target ()

How are these?

Not great!

:( I'll try harder…

⋮

MMD models [Li+ ICML-15, Dziugaite+ UAI-15]

MNIST, mix of Gaussian kernels

Celeb-A, mix of rational quadratic + linear kernels

Deep kernels

usually Gaussian, linear, …

MMD loss with a deep kernel

from pretrained Inception net
simple: exponentiated quadratic or polynomial

We just got adversarial examples!

[anishathalye/obfuscated-gradients]

Optimized MMD: MMD GANs [Li+ NeurIPS-17]

Don't just use one kernel, use a class parameterized by :
New distance based on all these kernels:
Minimax optimization problem

Non-smoothness of Optimized MMD

Illustrative problem in , DiracGAN [Mescheder+ ICML-18]:

Just need to stay away from tiny bandwidths
…deep kernel analogue is hard.
Instead, keep witness function from being too steep
would give Wasserstein
- Nice distance, but hard to estimate
Control on average, near the data
- [Gulrajani+ NeurIPS-17 / Roth+ NeurIPS-17 / Mescheder+ ICML-18]

MMD GANs versus WGANs

Linear- MMD GAN, :
WGAN has:
We were just trying something like an unregularized WGAN…

MMD-GAN with gradient control

If gives uniformly Lipschitz critics, is smooth
Original MMD-GAN paper [Li+ NeurIPS-17]: box constraint
We [Bińkowski+ ICLR-18] used gradient penalty on critic instead
- Better in practice, but doesn't fix the Dirac problem…

New distance: Scaled MMD

Want to ensure

Can do directly with kernel properties…but too expensive!

Guaranteed if

Gives distance

Deriving the Scaled MMD

Constraint can be written

Smoothness of

Theorem: is continuous.

If has a density; is Gaussian/linear/…; is fully-connected, Leaky-ReLU, non-increasing width; all weights in have bounded condition number; then

Keeping weight condition numbers bounded

Spectral parameterization [Miyato+ ICLR-18]:
; learn and freely
Encourages diversity without limiting representation

Rank collapse

Occasional optimization failure without spectral param:
- Generator doing reasonably well
- Critic filters become low-rank
- Generator corrects it by breaking everything else
- Generator gets stuck

What if we just did spectral normalization?

, so that ,
Works well for original GANs [Miyato+ ICLR-18]
…but doesn't work at all as only constraint in a WGAN
Limits representation too much
- In DiracGAN, only allows bandwidth 1

Continuity theorem proof

means
Can show
By assumption on ,
Because Leaky-ReLU, ,
For Lebesgue-almost all ,

: 2d example

Target and model samples

Kernels from , early in optimization

Kernels from (early)

Critic gradients from (early)

Kernels from , late in optimization

Kernels from (late)

Critic gradients from (late)

Model on CelebA

SN-SMMD-GAN

KID: 0.006

WGAN-GP

KID: 0.022

Implicit generative model evaluation

No likelihoods, so…how to compare models?
Main approach:
look at a bunch of pictures and see if they're pretty or not
- Easy to find (really) bad samples
- Hard to see if modes are missing / have wrong probabilities
- Hard to compare models beyond certain threshold
Need better, quantitative methods
Our method: Kernel Inception Distance (KID)
- , cubic on pretrained Inception rep
- tf.contrib.gan.eval.kernel_inception_distance

Inception score [Salimans+ NIPS-16]

Previously standard quantitative method
Based on ImageNet classifier label predictions
- Classifier should be confident on individual images
- Predicted labels should be diverse across sample
No notion of target distribution
Scores completely meaningless on LSUN, Celeb-A, SVHN, …
Not great on CIFAR-10 either

Fréchet Inception Distance (FID) [Heusel+ NIPS-17]

Fit normals to Inception hidden layer activations of and
Compute Fréchet (Wasserstein-2) distance between fits
Meaningful on not-ImageNet datasets
Estimator extremely biased, tiny variance
,

New method: Kernel Inception Distance (KID)

between Inception hidden layer activations
Default polynomial kernel:
Unbiased estimator, reasonable with few samples
tensorflow.contrib.gan.eval.kernel_classifier_distance

Automatic learning rate adaptation with KID

Models need appropriate learning rate schedule to work well
Automate with three-sample MMD test [Bounliphone+ ICLR-16]:

Training process on CelebA

Controlling critic complexity

Model on ImageNet

SN-SMMDGAN

KID: 0.035

SN-GAN

KID: 0.044

BGAN

KID: 0.047

Recap

Can train generative models by minimizing a flexible, smooth distance between distributions
Combine kernels with gradient penalties
Strong practical results, some understanding of theory

Demystifying MMD GANs

Bińkowski^*, Sutherland^*, Arbel, and Gretton

ICLR 2018

On Gradient Regularizers for MMD GANs

Arbel^*, Sutherland^*, Bińkowski, and Gretton

NeurIPS 2018

Links + code: see djsutherland.ml. Thanks!

Learning deep kernels for exponential family densities

Wenliang^*, Sutherland^*, Strathmann, and Gretton

ICML 2019

Better GANs by Using Kernels

Implicit generative models

Why implicit generative models?

How to generate images things?

GANs: trick a discriminator [Goodfellow+ NeurIPS-14]

One view: distances between distributions

JS with disjoint support [Arjovsky/Bottou ICLR-17]

Discriminator point of view

How likely is disjoint support?

Path to a solution: integral probability metrics

Maximum Mean Discrepancy [Gretton+ 2012]

MMD as feature matching

Derivation of MMD

Estimating MMD

MMD as loss [Li+ ICML-15, Dziugaite+ UAI-15]

MMD models [Li+ ICML-15, Dziugaite+ UAI-15]

Deep kernels

MMD loss with a deep kernel

Optimized MMD: MMD GANs [Li+ NeurIPS-17]

Non-smoothness of Optimized MMD

MMD GANs versus WGANs

MMD-GAN with gradient control

New distance: Scaled MMD

Deriving the Scaled MMD

Smoothness of \D_\mathrm{SMMD}

Keeping weight condition numbers bounded

Rank collapse

What if we just did spectral normalization?

Continuity theorem proof

\D_\mathrm{SMMD} : 2d example

Model on 160 \times 160 CelebA

Implicit generative model evaluation

Inception score [Salimans+ NIPS-16]

Fréchet Inception Distance (FID) [Heusel+ NIPS-17]

New method: Kernel Inception Distance (KID)

Automatic learning rate adaptation with KID

Training process on CelebA

Controlling critic complexity

Model on 64 \times 64 ImageNet

Recap

Smoothness of

: 2d example

Model on CelebA

Model on ImageNet