Michael Arbel

UCL

Mikołaj Bińkowski

Imperial

Arthur Gretton

UCL

UMass Amherst, Sep 30 2019

(Swipe or arrow keys to move through slides; `m` for a menu to jump; `?` for help.
Vertical slides are backups that I probably won't show in the talk.)

Given samples from a distribution over ,

we want a model that can produce new samples from

- Automated animation (anime characters)
- Learn calligraphic font style (
`zi2zi`) - Image translation (
`pix2pix`, Wolf+ 17, Everybody Dance Now) - Help musicians improvise, “natural” image editing,

plan robot actions, anonymize datasets, … - Make $432,500 selling a generated image

One choice: with a generator!

Fixed distribution of latents:

Maps through a network:

How to choose ?

Generator ()

Discriminator

Target ()

Is this real?

No way!

:( I'll try harder…

⋮

Is this real?

Umm…

- What happens when is at its optimum?
- If distributions have densities,
- If stays optimal throughout, tries to minimize which is

- If and have (almost) disjoint support so

Generator ()

Discriminator

Target ()

Is this real?

No way!

:( I don't know how to do any better…

- At initialization, pretty reasonable:::
- Remember we might have
- For usual , is supported on a countable union of

manifolds with dim - “Natural image manifold” usually considered low-dim
- No chance that they'd align at init, so

is a *critic function*

Total variation:

Wasserstein:

Kernel – a “similarity” function

For many kernels, iff

- is the
*feature map*for - If , ; MMD is distance between means
- Many kernels:
**infinite-dimensional**

*Reproducing property*: if ,

1.0 | 0.2 | 0.6 | |

0.2 | 1.0 | 0.5 | |

0.6 | 0.5 | 1.0 |

1.0 | 0.8 | 0.7 | |

0.8 | 1.0 | 0.6 | |

0.7 | 0.6 | 1.0 |

0.3 | 0.1 | 0.2 | |

0.2 | 0.3 | 0.3 | |

0.2 | 0.1 | 0.4 |

- No need for a discriminator – just minimize !
- Continuous loss, gives “partial credit”

Generator ()

Critic

Target ()

How are these?

Not great!

:( I'll try harder…

⋮

MNIST, mix of Gaussian kernels

Celeb-A, mix of rational quadratic + linear kernels

- usually Gaussian, linear, …

- from pretrained Inception net
- simple: exponentiated quadratic or polynomial

- Don't just use one kernel, use a
*class*parameterized by : - New distance based on
*all*these kernels: - Minimax optimization problem

Illustrative problem in , DiracGAN [Mescheder+ ICML-18]:

- Just need to stay away from tiny bandwidths
- …deep kernel analogue is hard.
- Instead, keep witness function from being too steep
- would give Wasserstein
- Nice distance, but hard to estimate

- Control
*on average, near the data*

- Linear- MMD GAN, :
- WGAN has:
- We were just trying something like an unregularized WGAN…

- If gives uniformly Lipschitz critics, is smooth
- Original MMD-GAN paper [Li+ NeurIPS-17]: box constraint
- We [Bińkowski+ ICLR-18] used gradient penalty on critic instead
- Better in practice, but doesn't fix the Dirac problem…

Want to ensure

Can do directly with kernel properties…but too expensive!

Guaranteed if

Gives distance

Constraint can be written

**Theorem:** is continuous.

If has a density; is Gaussian/linear/…; is fully-connected, Leaky-ReLU, non-increasing width; all weights in have bounded condition number; then

- Spectral parameterization [Miyato+ ICLR-18]:
- ; learn and freely
- Encourages diversity without limiting representation

- Occasional optimization failure without spectral param:
- Generator doing reasonably well
- Critic filters become low-rank
- Generator corrects it by breaking everything else
- Generator gets stuck

- , so that ,
- Works well for original GANs [Miyato+ ICLR-18]
- …but doesn't work at all as only constraint in a WGAN
- Limits representation too much
- In DiracGAN, only allows bandwidth 1

- means
- Can show
- By assumption on ,
- Because Leaky-ReLU, ,
- For Lebesgue-almost all ,

Target and model samples

Kernels from , early in optimization

Kernels from (early)

Critic gradients from (early)

Critic gradients from (early)

Kernels from , **late** in optimization

Kernels from (late)

Critic gradients from (late)