# Kernel Methods:From Basics to Modern Applications

Danica J. Sutherland (she/her)
Computer Science, University of British Columbia
Data Science Summer School, January 2021

## Motivation

• Machine learning! …but how do we actually do it?
• Linear models! ,
• Extend
• Kernels are basically a way to study doing this with any, potentially very complicated,
• Convenient way to make models on documents, graphs, videos, datasets, …
• will live in a reproducing kernel Hilbert space

## Hilbert spaces

• A complete (real or complex) inner product space.
• Inner product space: a vector space with an inner product:
• for ,
Induces a norm:
• Complete: “well-behaved” (Cauchy sequences have limits in )

## Kernel: an inner product between feature maps

• Call our domain , some set
• , functions, distributions of graphs of images, …
• is a kernel on if there exists a Hilbert space and a feature map so that
• Roughly, is a notion of “similarity” between inputs
• Linear kernel on :

## Building kernels from other kernels

• Scaling: if , is a kernel
• Sum: is a kernel
• Is necessarily a kernel?
• Take , , .
• Then
• But .

## Positive definiteness

• A symmetric function is positive semi-definite (psd) if for all , , ,
• Equivalently: kernel matrix is PSD
• Hilbert space kernels are psd
• psd functions are Hilbert space kernels
• Moore-Aronszajn Theorem; we'll come back to this

## Some more ways to build kernels

• Limits: if exists, is psd
• Products: is psd
• Let , be independent
• Covariance matrices are psd, so is too
• Powers: is pd for any integer

, the polynomial kernel

• Exponents: is pd
• If , is pd
• Use the feature map

, the Gaussian kernel

## Reproducing property

• Recall original motivating example with
• Kernel is
• Classifier based on linear
• is the function itself, represented by a vector in
is the function evaluated at a point
• Elements of correspond to functions,
• Reproducing prop.: for

## Reproducing kernel Hilbert space (RKHS)

• Every psd kernel on defines a (unique) Hilbert space, its RKHS , and a map where
• Elements are functions on , with
• Combining the two, we sometimes write
• is the evaluation functional
An RKHS is defined by it being continuous, or

## Moore-Aronszajn Theorem

• Building for a given psd :
• Define from
• Take to be completion of in the metric from
• Get that the reproducing property holds for in
• Can also show uniqueness
• Theorem: is psd iff it's the reproducing kernel of an RKHS

## A quick check: linear kernels

• on
• If , then
• Closure doesn't add anything here, since is closed
• So, linear kernel gives you RKHS of linear functions

## More complicated: Gaussian kernels

• is infinite-dimensional
• Functions in are bounded:
• Choice of controls how fast functions can vary:
• Can say lots more with Fourier properties

## Kernel ridge regression

Linear kernel gives normal ridge regression: Nonlinear kernels will give nonlinear regression!

How to find ? Representer Theorem:

• Let
its orthogonal complement in
• Decompose with ,
• Minimizer needs , and so

Setting derivative to zero gives
satisfied by

## Other kernel algorithms

• Representer theorem applies if strictly increasing:
• Classification algorithms:
• Support vector machines: is hinge loss
• Kernel logistic regression: is logistic loss
• Principal component analysis, canonical correlation analysis
• Many, many more…

## Some theory

• If universal, can approximate any continuous func
• for all nonzero finite signed measures
• True for Gaussian, many other common kernels (but no finite-dimensional ones!)
• Norm may go to as approximation gets better
• If RKHS norm is small, can learn quickly
• e.g. Rademacher complexity of is at most

## Limitations of kernel-based learning

• Generally bad at learning sparsity
• e.g. for large
• Provably slower than deep learning for a few problems
• e.g. to learn a single ReLU, , need norm exponential in
• Also some hierarchical problems, etc

## Relationship to deep learning

• Deep models usually end as
• Can think of as learned kernel,
• Does this gain us anything?
• Random nets with trained last layer (NNGP) can be decent
• As width , nets become neural tangent kernel
• Widely used theoretical analysis
• SVMs with NTK can be great on small data
• Inspiration: learn the kernel model end-to-end
• Ongoing area; good results in two-sample testing, GANs, density estimation, meta-learning, semi-supervised learning, …

## Mean embeddings of distributions

• Represent point as ,
• Represent distribution as ,
• Last step assumed e.g.
• Okay. Why?
• One reason: ML on distributions
• More common reason: comparing distributions

## Maximum Mean Discrepancy

• Last line is Integral Probability Metric (IPM) form
• is called “witness function” or “critic”: high on , low on

## More!

• Foundations: Berlinet and Thomas-Agnan, RKHS in Probability and Statistics
• Hardcore theoretical details: Steinwart and Christmann, Support Vector Machines
• Close connections to Gaussian processes
• Mean embeddings: survey of
• The practical sessions! Some pointers in there