# (Deep)KernelMean Embeddingsfor Representing and Learning on Distributions

Danica J. Sutherland(she/her)
University of British Columbia + Amii
Lifting Inference with Kernel Embeddings (LIKE-23), June 2023

This talk: how to lift inference with kernel embeddings

Slides at djsutherland.ml/slides/like23

(Swipe or arrow keys to move through slides; for a menu to jump; for more.)

HTML version at djsutherland.ml/slides/like23

# Part I: Kernels

## Why kernels?

• Machine learning! …but how do we actually do it?
• Linear models! ,
• Extend
• Kernels are basically a way to study doing this
with any, potentially very complicated,
• Convenient way to make models on documents, graphs, videos, datasets, probability distributions, …
• will live in a reproducing kernel Hilbert space

## Hilbert spaces

• A complete (real or complex) inner product space
• Inner product space: a vector space with an inner product:
• for ,
Induces a norm:
• Complete: “well-behaved” (Cauchy sequences have limits in )

## Kernel: an inner product between feature maps

• Call our domain , some set
• , functions, distributions of graphs of images, …
• is a kernel on if there exists a Hilbert space and a feature map so that
• Roughly, is a notion of “similarity” between inputs
• Linear kernel on :

## Aside: the name “kernel”

• Our concept: "positive semi-definite kernel," "Mercer kernel," "RKHS kernel"
• Exactly the same: GP covariance function
• Semi-related: kernel density estimation
• , usually symmetric, like RKHS kernel
• Always requires , unlike RKHS kernel
• Often requires , unlike RKHS kernel
• Not required to be inner product, unlike RKHS kernel
• Unrelated:
• The kernel (null space) of a linear map
• The kernel of a probability density
• The kernel of a convolution
• CUDA kernels
• The Linux kernel
• Popcorn kernels

## Building kernels from other kernels

• Scaling: if , is a kernel
• Sum: is a kernel
• Is necessarily a kernel?
• Take , , .
• Then
• But .

## Positive definiteness

• A symmetric function i.e.
is positive semi-definite
if for all , , ,
• Equivalent: kernel matrix is psd (eigenvalues )
• Hilbert space kernels are psd
• psd functions are Hilbert space kernels
• Moore-Aronszajn Theorem; we'll come back to this

## Some more ways to build kernels

• Limits: if exists, is psd
• Products: is psd
• Let , be independent
• Covariance matrices are psd, so is too
• Powers: is pd for any integer

, the polynomial kernel

• Exponents: is pd
• If , is pd
• Use the feature map

, the Gaussian kernel

## Reproducing property

• Recall original motivating example with
• Kernel is
• Classifier based on linear
• is the function itself; corresponds to vector in
is the function evaluated at a point
• Elements of are functions,
• Reproducing property: for

## Reproducing kernel Hilbert space (RKHS)

• Every psd kernel on defines a (unique) Hilbert space, its RKHS ,
and a map where
• Elements are functions on , with
• Combining the two, we sometimes write
• is the evaluation functional
An RKHS is defined by it being continuous, or

## Moore-Aronszajn Theorem

• Building for a given psd :
• Define from
• Take to be completion of in the metric from
• Get that the reproducing property holds for in
• Can also show uniqueness
• Theorem: is psd iff it's the reproducing kernel of an RKHS

## A quick check: linear kernels

• on
• “corresponds to”
• If , then
• Closure doesn't add anything here, since is closed
• So, linear kernel gives you RKHS of linear functions

## More complicated: Gaussian kernels

• is infinite-dimensional
• Functions in are bounded:
• Choice of controls how fast functions can vary:
• Can say lots more with Fourier properties

## Kernel ridge regression

Linear kernel gives normal ridge regression: Nonlinear kernels will give nonlinear regression!

How to find ? Representer Theorem:

• Let , and its orthogonal complement in
• Decompose with ,
• Minimizer needs , and so

Setting derivative to zero gives
satisfied by

## Kernel ridge regression and GP regression

• Compare to regression with prior, observation noise
• If we take , KRR is exactly the GP regression posterior mean
• Note that GP posterior samples are not in , but are in a slightly bigger RKHS
• Also a connection between posterior variance and KRR worst-case error
• For many more details:

## Other kernel algorithms

• Representer theorem applies if is strictly increasing in
• Kernel methods can then train based on kernel matrix
• Classification algorithms:
• Support vector machines: is hinge loss
• Kernel logistic regression: is logistic loss
• Principal component analysis, canonical correlation analysis
• Many, many more…
• But not everything works...e.g. Lasso regularizer

## Some very very quick theory

• Generalization: how close is my training set error to the population error?
• Say , consider , -Lipschitz loss
• Rademacher argument implies expected overfitting
• If “truth” has low RKHS norm, can learn efficiently
• Approximation: how big is RKHS norm of target function?
• For universal kernels, can approximate any target with finite norm
• Gaussian is universal 💪 (nothing finite-dimensional can be)
• But “finite” can be really really really big

## Limitations of kernel-based learning

• Generally bad at learning sparsity
• e.g. for large
• Provably statistically slower than deep learning for a few problems
• e.g. to learn a single ReLU, , need norm exponential in
• Also some hierarchical problems, etc
• Generally apply to learning with any fixed kernel
• computational complexity, memory
• Various approximations you can make

## Mean embeddings of distributions

• Represent point as :
• Represent distribution as :
• Last step assumed (Bochner integrability)
• Okay. Why?
• One reason: ML on distributions
• More common reason: comparing distributions

## Maximum Mean Discrepancy

• Last line is Integral Probability Metric (IPM) form
• is called “witness function” or “critic”: high on , low on