# Modern Kernel Methodsin Machine Learning:Part I

Danica J. Sutherland(she/her)
Computer Science, University of British Columbia
ETICS "summer" school, Oct 2022

## Motivation

• Machine learning! …but how do we actually do it?
• Linear models! ,
• Extend
• Kernels are basically a way to study doing this with any, potentially very complicated,
• Convenient way to make models on documents, graphs, videos, datasets, …
• will live in a reproducing kernel Hilbert space

## Hilbert spaces

• A complete (real or complex) inner product space.
• Inner product space: a vector space with an inner product:
• for ,
Induces a norm:
• Complete: “well-behaved” (Cauchy sequences have limits in )

## Kernel: an inner product between feature maps

• Call our domain , some set
• , functions, distributions of graphs of images, …
• is a kernel on if there exists a Hilbert space and a feature map so that
• Roughly, is a notion of “similarity” between inputs
• Linear kernel on :

## Aside: the name “kernel”

• Our concept: "positive semi-definite kernel," "Mercer kernel," "RKHS kernel"
• Semi-related: kernel density estimation
• , usually symmetric, like RKHS kernel
• Always requires , unlike RKHS kernel
• Often requires , unlike RKHS kernel
• Not required to be inner product, unlike RKHS kernel
• Unrelated:
• The kernel (null space) of a linear map
• The kernel of a probability density
• The kernel of a convolution
• CUDA kernels
• The Linux kernel
• Popcorn kernels

## Building kernels from other kernels

• Scaling: if , is a kernel
• Sum: is a kernel
• Is necessarily a kernel?
• Take , , .
• Then
• But .

## Positive definiteness

• A symmetric function (i.e. have ) is positive semi-definite (psd) if for all , , ,
• Equivalently: kernel matrix is PSD
• Hilbert space kernels are psd
• psd functions are Hilbert space kernels
• Moore-Aronszajn Theorem; we'll come back to this

## Some more ways to build kernels

• Limits: if exists, is psd
• Products: is psd
• Let , be independent
• Covariance matrices are psd, so is too
• Powers: is pd for any integer

, the polynomial kernel

• Exponents: is pd
• If , is pd
• Use the feature map

, the Gaussian kernel

## Reproducing property

• Recall original motivating example with
• Kernel is
• Classifier based on linear
• is the function itself; corresponds to vector in
is the function evaluated at a point
• Elements of are functions,
• Reproducing prop.: for

## Reproducing kernel Hilbert space (RKHS)

• Every psd kernel on defines a (unique) Hilbert space, its RKHS , and a map where
• Elements are functions on , with
• Combining the two, we sometimes write
• is the evaluation functional
An RKHS is defined by it being continuous, or

## Moore-Aronszajn Theorem

• Building for a given psd :
• Define from
• Take to be completion of in the metric from
• Get that the reproducing property holds for in
• Can also show uniqueness
• Theorem: is psd iff it's the reproducing kernel of an RKHS

## A quick check: linear kernels

• on
• "corresponds to"
• If , then
• Closure doesn't add anything here, since is closed
• So, linear kernel gives you RKHS of linear functions

## More complicated: Gaussian kernels

• is infinite-dimensional
• Functions in are bounded:
• Choice of controls how fast functions can vary:
• Can say lots more with Fourier properties

## Kernel ridge regression

Linear kernel gives normal ridge regression: Nonlinear kernels will give nonlinear regression!

How to find ? Representer Theorem:

• Let
its orthogonal complement in
• Decompose with ,
• Minimizer needs , and so

Setting derivative to zero gives
satisfied by

## Other kernel algorithms

• Representer theorem applies if is strictly increasing in
• Kernel methods can then train based on kernel matrix
• Classification algorithms:
• Support vector machines: is hinge loss
• Kernel logistic regression: is logistic loss
• Principal component analysis, canonical correlation analysis
• Many, many more…
• But not everything works...e.g. Lasso regularizer

## Some theory: generalization

• Rademacher complexity of is upper-bounded by if
• Implies for -Lipschitz losses that
• Same kind of rates with stability-based analyses
• Implies that, if the “truth” is low-norm, most kernel methods are suboptimal
• Difficulty of learning is controlled by RKHS norm of target

## Some theory: universality

• One definition: a continuous kernel on a compact metric space is universal if is -dense in :
for every continuous , for every , there is an with
• Implies that, on compact , can separate compact sets
• with for , for
• Which implies there are with arbitrarily small loss
• Might take arbitrarily large norm: approximation/estimation tradeoff
• Can prove via Stone-Weierstrass or Fourier properties
• Never true for finite-dim kernels: need

## Translation-invariant kernels on \R^d

• Assume is bounded, continuous, and translation invariant
• Then is proportional to the Fourier transform of a probability measure (Bochner's theorem)
• If , the measure has a density
• If that density is positive everywhere, is universal
• For all nonzero finite signed measures ,
• True for Gaussian
• and Laplace

## Limitations of kernel-based learning

• Generally bad at learning sparsity
• e.g. for large
• Provably statistically slower than deep learning for a few problems
• e.g. to learn a single ReLU, , need norm exponential in
• Also some hierarchical problems, etc
• computational complexity, memory
• Various approximations you can make

## Relationship to deep learning

• Deep models usually end as
• Can think of as learned kernel,
• Does this gain us anything?
• Random nets with trained last layer (NNGP) can be decent
• As width , nets become neural tangent kernel
• Widely used theoretical analysis...more tomorrow
• SVMs with NTK can be great on small data
• Inspiration: learn the kernel model end-to-end
• Ongoing area; good results in two-sample testing, GANs, density estimation, meta-learning, semi-supervised learning, …
• Explored a bit in interactive session!

## What's next

• After break: interactive session exploring w/ ridge regression
• Tomorrow: a subset of
• Representing distributions
• Uses for statistical testing + generative models
• Connections to Gaussian processes, probabilistic numerics
• Approximation methods for faster computation
• Deeper connection to deep learning
• More details on basics:
• Berlinet and Thomas-Agnan, RKHS in Probability and Statistics
• Steinwart and Christmann, Support Vector Machines