Why kernels?
- Machine learning! …but how do we actually do it?
- Linear models! ,
- Extend …
- Kernels are basically a way to study doing this
with any, potentially very complicated, - Convenient way to make models on documents, graphs, videos, datasets, probability distributions, …
- will live in a reproducing kernel Hilbert space
Hilbert spaces
- A complete (real or complex) inner product space
- Inner product space: a vector space with an inner product:
- for ,
Induces a norm: - Complete: “well-behaved” (Cauchy sequences have limits in )
Kernel: an inner product between feature maps
- Call our domain , some set
- , functions, distributions of graphs of images, …
- is a kernel on if there exists a Hilbert space and a feature map so that
- Roughly, is a notion of “similarity” between inputs
- Linear kernel on :
Aside: the name “kernel”
- Our concept: "positive semi-definite kernel," "Mercer kernel," "RKHS kernel"
- Exactly the same: GP covariance function
- Semi-related: kernel density estimation
- , usually symmetric, like RKHS kernel
- Always requires , unlike RKHS kernel
- Often requires , unlike RKHS kernel
- Not required to be inner product, unlike RKHS kernel
- Unrelated:
- The kernel (null space) of a linear map
- The kernel of a probability density
- The kernel of a convolution
- CUDA kernels
- The Linux kernel
- Popcorn kernels
Building kernels from other kernels
- Scaling: if , is a kernel
- Sum: is a kernel
- Is necessarily a kernel?
- Take , , .
- Then
- But .
Positive definiteness
- A symmetric function
i.e.
is positive semi-definite
if for all ,
,
,
- Equivalent: kernel matrix is psd (eigenvalues )
- Hilbert space kernels are psd
- psd functions are Hilbert space kernels
- Moore-Aronszajn Theorem; we'll come back to this
Some more ways to build kernels
Limits: if
exists,
is psd
- Products: is psd
- Let , be independent
- Covariance matrices are psd, so is too
- Powers: is pd for any integer
, the polynomial kernel
- Exponents: is pd
- If , is pd
- Use the feature map
,
the Gaussian kernel
Reproducing property
- Recall original motivating example with
- Kernel is
- Classifier based on linear
- is the function itself; corresponds to vector in
is the function evaluated at a point - Elements of are functions,
- Reproducing property:
for
Reproducing kernel Hilbert space (RKHS)
- Every psd kernel on defines a (unique) Hilbert space, its RKHS ,
and a map where- Elements are functions on , with
- Combining the two, we sometimes write
- is the evaluation functional
An RKHS is defined by it being continuous, or
Moore-Aronszajn Theorem
- Building for a given psd :
- Start with
- Define from
- Take to be completion of in the metric from
- Get that the reproducing property holds for in
- Can also show uniqueness
- Theorem: is psd iff it's the reproducing kernel of an RKHS
A quick check: linear kernels
- on
- “corresponds to”
- If , then
- Closure doesn't add anything here, since is closed
- So, linear kernel gives you RKHS of linear functions
More complicated: Gaussian kernels
- is infinite-dimensional
- Functions in are bounded:
- Choice of controls how fast functions can vary:
- Can say lots more with Fourier properties
Kernel ridge regression
Linear kernel gives normal ridge regression:
Nonlinear kernels will give nonlinear regression!
How to find ? Representer Theorem:
- Let ,
and its orthogonal complement in
- Decompose with ,
- Minimizer needs , and so
Setting derivative to zero gives
satisfied by
Kernel ridge regression and GP regression
- Compare to regression with prior, observation noise
- If we take , KRR is exactly the GP regression posterior mean
- Note that GP posterior samples are not in , but are in a slightly bigger RKHS
- Also a connection between posterior variance and KRR worst-case error
- For many more details:
Other kernel algorithms
- Representer theorem applies if is strictly increasing in
- Kernel methods can then train based on kernel matrix
- Classification algorithms:
- Support vector machines: is hinge loss
- Kernel logistic regression: is logistic loss
- Principal component analysis, canonical correlation analysis
- Many, many more…
- But not everything works...e.g. Lasso regularizer
Some very very quick theory
- Generalization: how close is my training set error to the population error?
- Say , consider , -Lipschitz loss
- Rademacher argument implies expected overfitting
- If “truth” has low RKHS norm, can learn efficiently
- Approximation: how big is RKHS norm of target function?
- For universal kernels, can approximate any target with finite norm
- Gaussian is universal 💪 (nothing finite-dimensional can be)
- But “finite” can be really really really big
Limitations of kernel-based learning
- Generally bad at learning sparsity
- e.g. for large
- Provably statistically slower than deep learning for a few problems
- e.g. to learn a single ReLU, , need norm exponential in [Yehudai/Shamir NeurIPS-19]
- Also some hierarchical problems, etc [Kamath+ COLT-20]
- Generally apply to learning with any fixed kernel
- computational complexity, memory
- Various approximations you can make
Part II: (Deep) Kernel Mean Embeddings
Mean embeddings of distributions
- Represent point as :
- Represent distribution as :
- Last step assumed (Bochner integrability)
- Okay. Why?
- One reason: ML on distributions [Szabó+ JMLR-16]
- More common reason: comparing distributions
Maximum Mean Discrepancy
- Last line is Integral Probability Metric (IPM) form
- is called “witness function” or “critic”: high on , low on
MMD properties
- , symmetry, triangle inequality
- If is characteristic, then iff
- i.e. is injective
- Makes MMD a metric on probability distributions
- Universal characteristic
- If we use a linear kernel:
- just Euclidean distance between means
- If we use ,
the squared MMD becomes the energy distance [Sejdinovic+ Annals-13]
Application: Kernel Herding
- Want a "super-sample" from : for all
- Letting , want for all
- Error
- Greedily minimize the MMD:
- Get approximation instead of with random samples
Estimating MMD from samples
MMD vs other distances
- MMD has easy estimator
- block or incomplete estimators are for , but noisier
- For bounded kernel, estimation error
- Independent of data dimension!
- But, no free lunch…the value of the MMD generally shrinks with growing dimension, so constant error gets worse relatively
GP view of MMD
Application: Two-sample testing
- Given samples from two unknown distributions
- Question: is ?
- Hypothesis testing approach:
- Reject if
- Do smokers/non-smokers get different cancers?
- Do Brits have the same friend network types as Americans?
- When does my laser agree with the one on Mars?
- Are storms in the 2000s different from storms in the 1800s?
- Does presence of this protein affect DNA binding? [MMDiff2]
- Do these dob and birthday columns mean the same thing?
- Does my generative model match ?
What's a hypothesis test again?
MMD-based testing
- : converges in distribution to…something
- Infinite mixture of s, params depend on and
- Can estimate threshold with permutation testing
- : asymptotically normal
- Any characteristic kernel gives consistent test…eventually
- Need enormous if kernel is bad for problem
Classifier two-sample tests
- is the accuracy of on the test set
- Under , classification impossible:
- With where ,
get
Deep learning and deep kernels
- is one form of deep kernel
- Deep models are usually of the form
- With a learned
- If we fix , have with
- Same idea as NNGP approximation
- Generalize to a deep kernel:
Normal deep learning deep kernels
- Take
- Final function in will be
- With logistic loss: this is Platt scaling
“Normal deep learning deep kernels” – so?
- This does not say that deep learning is (even approximately) a kernel method
- …despite what some people might want you to think
- We know theoretically deep learning can learn some things faster than any kernel method [see Malach+ ICML-21 + refs]
- But deep kernel learning ≠ traditional kernel models
- exactly like how usual deep learning ≠ linear models
Optimizing power of MMD tests
- Asymptotics of give us immediately that
, , are constants:
first term usually dominates
- Pick to maximize an estimate of
- Use from before, get from U-statistic theory
- Can show uniform convergence of estimator
- Get better tests (even after data splitting)
Application: (S)MMD GANs
- An implicit generative model:
- A generator net outputs samples from
- Minimize estimate of on a minibatch
- MMD GAN:
- SMMD GAN:
- Scaled MMD uses kernel properties to ensure smooth loss for
by making witness function smooth [Arbel+ NeurIPS-18] - Uses
- Standard WGAN-GP better thought of in kernel framework
Application: fair representation learning (MMD-B-FAIR) [Deka/Sutherland AISTATS-23]
- Want to find a representation where
- We can tell whether an applicant is “creditworthy”
- We can't distinguish applicants by race
- Find a good classifier with near-zero test power for race
- Minimizing the test power criterion turns out to be hard
- Workaround: minimize test power of a (theoretical) block test
Application: distribution regression/classification/…
Independence
- iff for all square-integrable ,
- Let's implement for RKHS functions , :where is
Cross-covariance operator and independence
- If , then
- If ,
- If , are characteristic:
- implies [Szabó/Sriperumbudur JMLR-18]
- iff
- iff (sum squared singular values)
- HSIC: "Hilbert-Schmidt Independence Criterion"
HSIC
- Linear case: is cross-covariance matrix, HSIC is squared Frobenius norm
- Default estimator (biased, but simple): ,
- Maximizes dependence between image features and its identity on a minibatch
- Using a learned deep kernel based on
Recap
- Point embedding : if then
- Mean embedding : if then
- is 0 iff (for characteristic kernels)
- is 0 iff
(for characteristic , ...or slightly weaker)
- Often need to learn a kernel for good performance on complicated data
- Can often do end-to-end for downstream loss, asymptotic test power, …