for Representing and Learning on Distributions

Danica J. Sutherland(she/her)

University of British Columbia + Amii

Lifting Inference with Kernel Embeddings (LIKE-23), June 2023

This talk: how to lift inference with kernel embeddings

Slides at `djsutherland.ml/slides/like23`

(Swipe or arrow keys to move through slides; for a menu to jump; for more.)

PDF version at `djsutherland.ml/slides/like23.pdf`

HTML version at `djsutherland.ml/slides/like23`

- Machine learning! …but how do we actually do it?
- Linear models! ,
- Extend …
- Kernels are basically a way to study doing this

with any, potentially very complicated, - Convenient way to make models on documents, graphs, videos, datasets,
__probability distributions__, … - will live in a
*reproducing kernel Hilbert space*

- A complete (real or complex) inner product space
- Inner product space: a vector space with an
**inner product**:- for ,

**norm**: - Complete: “well-behaved” (Cauchy sequences have limits in )

- Call our domain , some set
- , functions, distributions of graphs of images, …

- is a kernel on if there exists a Hilbert space and a
*feature map*so that - Roughly, is a notion of “similarity” between inputs
*Linear kernel*on :

- Our concept: "positive semi-definite kernel," "Mercer kernel," "RKHS kernel"
- Exactly the same: GP covariance function
- Semi-related: kernel density estimation
- , usually symmetric, like RKHS kernel
- Always requires , unlike RKHS kernel
- Often requires , unlike RKHS kernel
- Not required to be inner product, unlike RKHS kernel

- Unrelated:
- The kernel (null space) of a linear map
- The kernel of a probability density
- The kernel of a convolution
- CUDA kernels
- The Linux kernel
- Popcorn kernels

- Scaling: if , is a kernel
- Sum: is a kernel
- Is necessarily a kernel?
- Take , , .
- Then
- But .

- A symmetric function
i.e.

is*positive semi-definite*

if for all , , , - Equivalent:
__kernel matrix__is psd (eigenvalues ) - Hilbert space kernels are psd
- psd functions are Hilbert space kernels
- Moore-Aronszajn Theorem; we'll come back to this

- Limits: if exists, is psd
- Products: is psd
- Let , be independent
- Covariance matrices are psd, so is too

- Powers: is pd for any integer
, the polynomial kernel

- Exponents: is pd
- If , is pd
- Use the feature map

, the Gaussian kernel

- Recall original motivating example with
- Kernel is
- Classifier based on linear
- is the function itself; corresponds to vector in

is the function evaluated at a point - Elements of are functions,
- Reproducing property: for

- Every psd kernel on defines a (unique) Hilbert space, its RKHS ,

and a map where- Elements are functions on , with

- Combining the two, we sometimes write
- is the
**evaluation functional**

An RKHS is defined by it being*continuous*, or

- Building for a given psd :
- Start with
- Define from
- Take to be completion of in the metric from
- Get that the reproducing property holds for in
- Can also show uniqueness

- Theorem: is psd iff it's the reproducing kernel of an RKHS

- on
- “corresponds to”

- If , then
- Closure doesn't add anything here, since is closed
- So, linear kernel gives you RKHS of linear functions

- is
*infinite-dimensional* - Functions in are bounded:
- Choice of controls how fast functions can vary:
- Can say lots more with Fourier properties

Linear kernel gives normal ridge regression: Nonlinear kernels will give nonlinear regression!

How to find ? Representer Theorem:

- Let ,
and its
**orthogonal complement**in - Decompose with ,
- Minimizer needs , and so

Setting derivative to zero gives

satisfied by

- Compare to regression with prior, observation noise
- If we take , KRR is exactly the GP regression posterior mean
- Note that GP posterior samples
**are not**in , but are in a slightly bigger RKHS - Also a connection between posterior variance and KRR worst-case error
- For many more details:

- Representer theorem applies if is strictly increasing in
- Kernel methods can then train based on kernel matrix
- Classification algorithms:
- Support vector machines: is hinge loss
- Kernel logistic regression: is logistic loss

- Principal component analysis, canonical correlation analysis
- Many, many more…
- But
*not everything*works...e.g. Lasso regularizer

- Generalization: how close is my training set error to the population error?
- Say , consider , -Lipschitz loss
- Rademacher argument implies expected overfitting
- If “truth” has low RKHS norm, can learn efficiently

- Approximation: how big is RKHS norm of target function?
- For
*universal*kernels, can approximate any target with finite norm - Gaussian is universal 💪 (nothing finite-dimensional can be)
- But “finite” can be
*really really really*big

- For

- Generally bad at learning
*sparsity*- e.g. for large

- Provably statistically slower than deep learning for a few problems
- e.g. to learn a single ReLU, , need norm exponential in [Yehudai/Shamir NeurIPS-19]
- Also some hierarchical problems, etc [Kamath+ COLT-20]
- Generally apply to learning with
*any fixed kernel*

- computational complexity, memory
- Various approximations you can make

- Represent point as :
- Represent
*distribution*as :- Last step assumed
*(Bochner integrability)*

- Last step assumed
- Okay. Why?
- One reason: ML on distributions [Szabó+ JMLR-16]
- More common reason: comparing distributions

- Last line is Integral Probability Metric (IPM) form
- is called “witness function” or “critic”: high on , low on