I took a graduate class in machine learning last fall. In the past I've taken a fair number of courses that taught various topics and algorithms in artificial intelligence and machine learning, but this was the first graduate level course I've taken. In terms of content, it went over many of the same algorithms I've seen before. In terms of rigor, it went above and beyond in trying to hammer in the theoretical underpinnings of many of those algorithms while at the same time making us somewhat struggle to implement them.

During the first half of the course, we looked long and hard at support vector machines (SVMs) and kernel functions. Though I was familiar with them, I learned some pretty amazing intuitions on why they work so well.

In this post, I want to share the derivation behind one fairly amazing fact: SVMs can tractably work with a data set in an infinite dimensional space when using an appropriate kernel. Before the derivation, I'll give some definitions.

## Basis functions

For our purposes, a basis function is any function $$\phi$$ that takes a data point $$x$$ and returns a vector with real numbered entries. Formally, the function signature is $\phi: \mathcal{X} \to \mathbb{R}^{m}$ where $$\mathcal{X}$$ is the set of all possible data points, and $$m$$ is the dimensionality of the vector. The value of $$\phi(x)$$ serves as a representation that we can work with in subsequent steps. It is important that the representation encodes enough information to associate the data point with others of the same class and distinguish it from other data points from a different class. Imagine a basis function that outputted the same vector for every data point regardless of the features that data point exhibits. It would be useless because we've lost all the information that would've been otherwise useful in discriminating between different classes.

Discovering a good basis function is in itself an interesting question. They can be manually handcrafted using domain knowledge or automatically learned with techniques like neural networks. With SVMs, basis functions are not explicitly computed for each data point, but instead implicitly used during training and classification. The next section will explain what I mean.

## Kernel functions

In the context of SVMs, a kernel function computes the dot product of two data points in the range of the basis function. Again, more formally, a kernel $$K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$$ computes $K(x,z) = \left\langle \phi(x), \phi(z) \right\rangle$ Kernels are useful because we can compute the dot product without having to explicitly compute the representation of both data points and then compute the dot product. During training and classification, the SVM algorithm only touches the data when it needs to perform a dot product between two data points. We take advantage of this fact by simply using a kernel wherever we see a dot product, and capitalize on the improvements in efficiency.

Consider an example where every data point is simply a scalar, and our basis function is $\phi(x) = \begin{bmatrix} 1 \\ x \\ x^{2} \end{bmatrix}$ An explicit computation of the inner product is $K(x,z) = \left\langle \phi(x), \phi(z) \right\rangle = x^{2}z^{2} + xz + 1$ If addition and multiplication took 1 "step" each, the explicit computation takes 6 steps (treating the "$$+1$$" as a constant) on a single process. An implicit computation, however, is more efficient and computes $K(x,z) = (xz + 1)^{2}$ This expression takes only 3 steps. If I expanded the expression, it would be $K(x,z) = (xz + 1)^{2} = x^{2}z^{2} + 2xz + 1$ which is like the explicit expression save for the coefficient on the $$xz$$ term. The point of the example is that we worked in a polynomial space in both the explicit and implicit cases, but the implicit version led to a performance boost. The example worked in a degree-2 polynomial space,A degree-$$d$$ polynomial space would use $$K(x,z) = (xz + 1)^{d}$$ as its kernel, known as a $$d$$-polynomial kernel. but if the degree was much larger, the gains in performance would be more profound.

Now for the main question: can we work in a space of infinite dimensionality? That is, is there a kernel that is equivalent to using a basis function that maps each data point to an infinitely long vector, perhaps uncountably infinite? Explicit computation is impossible because we can't hold such a vector in memory,Not always true! There are ways to "hold" certain infinite sequences in memory, but practically speaking it's not something done when dealing with real world data. so the performance gain will be huge, to put it lightly. But it can be done implicitly using a special kernel we'll derive next.

## Derivation of a Gaussian-based kernel

Our data is in the space $$\mathbb{R}^{n}$$ (i.e. $$\mathcal{X} \subseteq \mathbb{R}^{n}$$). Let's define a basis function by first defining each entry individually. Let an entry be defined as $\phi_{\sigma,w}(x) = \mathcal{N}(w|x,\sigma^{2}I)$ The right hand side is a multivariate Gaussian distribution with mean $$x$$ and covariance matrix $$\sigma^{2}I$$ evaluated at some $$w$$. With this, we can imagine the following basis function: $\phi_{\sigma}(x) = [\phi_{\sigma,w}(x)]_{w \in \mathbb{R}^{n}}$ which is a vectorAnother interpretation: transform $$x$$ into a Gaussian distribution where $$x$$ is the mean. of the Gaussian evaluated at all points $$w$$. Note that this basis function is parameterized on $$\sigma$$.

Clearly this is an infinitely long vector. We can't evaluate and use it at present. However, we can determine a closed form that bypasses this. Let's compute the dot product between two data points $$x$$ and $$z$$. Because the basis function is defined over a continuous range, we'll do the computation using an integral over $$\mathbb{R}^{n}$$: \begin{align} \left\langle \phi_{\sigma}(x), \phi_{\sigma}(z) \right\rangle &= \int_{\mathbb{R}^{n}} \phi_{\sigma,w}(x) \phi_{\sigma,w}(z) dw \\ &= \int_{\mathbb{R}^{n}} \mathcal{N}(w|x,\sigma^{2}I) \mathcal{N}(w|z,\sigma^{2}I) dw \end{align} As is, it's an intractable computation. To proceed, we use the following algebraic trick with Gaussians: $\mathcal{N}(c|a,A) \cdot \mathcal{N}(c|b,B) = \mathcal{N}(a|b,A+B) \cdot \mathcal{N}(c|d,D)$ where \begin{align} D &= (A^{-1} + B^{-1})^{-1} \\ d &= D(A^{-1}a + B^{-1}b) \end{align} Armed with this, we continue: \begin{align} \int_{\mathbb{R}^{n}} \mathcal{N}(w|x,\sigma^{2}I) \mathcal{N}(w|z,\sigma^{2}I) dw &= \int_{\mathbb{R}^{n}} \mathcal{N}(x|z,2\sigma^{2}I) \mathcal{N}(w|d,D) dw \\ &= \mathcal{N}(x|z,2\sigma^{2}I) \int_{\mathbb{R}^{n}} \mathcal{N}(w|d,D) dw \\ &= \mathcal{N}(x|z,2\sigma^{2}I) \end{align} Wow. Using the trick, we moved all terms with $$w$$ into one Gaussian and factored out the other. The integral disappeared because at that point it was just a full interval integration of a probability distribution, which is always 1. We didn't even give two flying functions what $$d$$ or $$D$$ were. To finish, we just need to write out the final Gaussian and drop some constants: \begin{align} \mathcal{N}(x|z,2\sigma^{2}I) &= \frac{1}{\sigma \sqrt{4\pi}} \exp\left(-\frac{\lVert x-z \rVert^{2}}{4\sigma^{2}}\right) \\ &\propto \exp\left(-\frac{\lVert x-z \rVert^{2}}{2\sigma^{2}}\right) \end{align} The last expression is known as the radial basis function kernel (or RBF kernel for short), and it's a powerful (and very popular) kernel used with SVMs. Like the basis function we began with, the kernel is parameterized on $$\sigma$$, called the bandwidth of the kernel. The final equation is $K_{\sigma}(x,z) = \exp\left(-\frac{\lVert x-z \rVert^{2}}{2\sigma^{2}}\right)$

And there you have it. There's more to kernels than what's been written here. For one thing, SVMs aren't the only machine learning algorithms that use kernels. This short page shows how kernels can be used in the $$k$$-Nearest Neighbors algorithm.

Even with all the (well-deserved) praise for neural networks, there's little theory as to why they work so well. With SVMs, the theory is so well formulated that I just find it hard to not be amazed by it!