# Kernel Theory

## The Kernel Trick

Many machine and statistical learning algorithms, such as support vector machines and principal components analysis, are based on **inner products**. These methods can often be generalized through use of the **kernel trick** to create anonlinear decision boundary without using an explicit mapping to another space.

The kernel trick makes use of **Mercer kernels** which operate on vectors in the input space but can be expressed as inner products in another space. In other words, if $\mathcal{X}$ is the input vector space and $\kappa$ is the Mercer kernel function, then for some vector space $\mathcal{V}$ there exists a function `\phi`

such that:

In machine learning, the vector space $\mathcal{X}$ is known as the feature space and the function $\phi$ is known as a feature map. A simple example of a feature map can be shown with the Polynomial Kernel:

In our example, we will use $n=2$, $d=2$, $a=1$ and $c=0$. Substituting these values in, we get the following kernel function:

Where the feature map $\phi : \mathbb{R}^2 \rightarrow \mathbb{R}^3$ is defined by:

The advantage of the implicit feature map is that we may transform non-linearly data into linearly separable data in the implicit space.

## Kernels

The kernel methods are a class of algorithms that are used for pattern analysis. These methods make use of **kernel** functions. A symmetric, real valued kernel function $\kappa: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is said to be **positive definite** or **Mercer** if and only:

for all $n \in \mathbb{N}$, $\{\mathbf{x}_1, \dots, \mathbf{x}_n\} \subseteq \mathcal{X}$ and $\{c_1, \dots, c_n\} \subseteq \mathbb{R}$. Similarly, a real valued kernel function is said to be **negative definite** if and only if:

for $n \geq 2$, $\{\mathbf{x}_1, \dots, \mathbf{x}_n\} \subseteq \mathcal{X}$ and $\{c_1, \dots, c_n\} \subseteq \mathbb{R}$. In machine learning literature, **conditionally positive definite** kernels are often studied instead. This is simply a reversal of the above inequality. Trivially, every negative definite kernel can be transformed into a conditionally positive definite kernel by negation.

## Further Reading

- Berg C, Christensen JPR, Ressel P. 1984.
*Harmonic Analysis on Semigroups*. Springer-Verlag New York. Chapter 3, General Results on Positive and Negative Definite Matrices and Kernels; p. 66-85. - Bouboulis P. 2014.
*Academic Press Library in Signal Processing, Volume 1: Array and Statistical Signal Processing (1st ed.)*. Academic Press. Chapter 17, Online Learning in Reproducing Kernel Hilbert Spaces; p. 883-987. - Genton M.G. 2002.
*Classes of kernels for machine learning: a statistics perspective*. The Journal of Machine Learning Research. Volume 2 (March 2002), 299-312. - Rasmussen C, Williams CKI. 2005.
*Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)*. The MIT Press. Chapter 4, Covariance Functions; p. 79-104.