Before I start any name-dropping Hilbert spaces and Lagrangian multipliers, I feel the urge to mention that I wrote this blog post, not ChatGPT. AI is a valuable tool for many things, from boilerplate code generation to serving as my personal tutor. However, even beyond environmental and social concerns (I spend all day inside and couldn’t possibly care less about the actual state of the world), it’s important to get some brain reps in actually thinking for yourself and talking with other humans. Just like how walking around outside is important despite the existence of the internal combustion engine.

Why kernels?

Some ML problems have tons of datapoints available and you’ll get the best modeling performance by throwing a black-box function approximator at the problem. In a business data science-type context, you might have a ton of datapoints, which you can feed into XGBoost and get a ton of predictive performance.

These black box function approximation models generally need a lot of data to get great performance out of, though. Sometimes, you don’t have a lot of data, but you have a lot of human-type knowledge about the process you’re trying to model. Feature engineering is one answer: if you’re trying to model hourly temperature, you can add a new feature based on a sine function, stretched to fit a 24-hour period. This sort of thing, combined with XGBoost, is a recipe that’s won a lot of Kaggle competitions.

Other times, though, that kind of feature engineering won’t work, and all your prior knowledge about what you’re modeling is that it’s too complicated for manual feature engineering (like that basis change, or interaction terms for multivariate data) but is smooth. What you’d want is a model that learns a smooth but arbitrary curve (remember, XGBoost is decision trees, and learns complicated, jagged, piecewise constant/axis-aligned curves) arbitrarily about the datapoints.

You can do this with kernels. Today, we’ll be making a model that can accurately model a complicated regression function, using a bit of clever linear algebra and some 20 lines of Python.

What’s a kernel? What models can we use kernels with?

A kernel is a function that returns some kind of similarity between two things. It can be nonlinear, and it can operate on vectors (tabular datapoints), strings, or even graphs. You obviously use them in machine learning algorithms. They have a few caveats, but they’re really as flexible and powerful as they sound.

Every machine learning model can use the input datapoints themselves (duh). However, some machine learning models’ training objectives (and inference processes) can be formulated to use the inner (dot) products between all pairs of datapoints $\langle x_i, x_j \rangle$, without paying any attention to the datapoints themselves.

This isn’t at all useful on its own, but if you can mathematically formulate a model’s training objective to use only those inner products, you can replace $\langle x_i, x_j \rangle$ with $\text{kernel}(x_i, x_j)$, which can be any nonlinear kernel function you want, not just a simple (linear) inner product (which restricts you to a linear trendline, decision boundary, or mapping).

While a simple example is the linear kernel (which returns the inner product between two vectors, which isn’t interesting or illustrative at all), there are kernel functions that implicitly project the data into even infinite-dimensional space, in which you can draw trendlines or decision boundaries. These, back in feature space, correspond with arbitrarily complex smooth curves around data. You can’t do that with manual feature engineering/basis changes.

Okay, you actually can draw “arbitrary smooth” curves with things like splines or random Fourier features, too, but they work because they approximate kernels!

The only constraints are that the kernel must be symmetric (meaning that the similarity between $x_i$ and $x_j$ must be the similarity between $x_j$ and $x_i$), and the matrix $K \in \mathbb{R}^{n \times n}$ of pairwise (between the training datapoints) kernel evaluations must not have negative eigenvalues. This just makes sure that the kernel function’s evaluation corresponds with a valid inner product space where a vector’s squared length isn’t negative.

This is called the “kernel trick,” and it allows you to make a linear model that’s still able to draw a nonlinear trendline: you get expressive, nonlinear function approximation without using black-box function approximator-type models. The most popular kernel machine people use is probably a support vector machine (which is a classifier similar to a logistic regression); it actually has performance on MNIST just behind the huge convolutional neural networks, but that performance can be achieved with a few seconds of training time on a laptop– no GPUs necessary. You can kernelize a bunch of learning algorithms across tasks: there are kernelizable objectives for PCA and K-means clustering as well as SVMs. Today, we’ll be deriving, re-implementing, and benchmarking a kernelized learning algorithm for regression, called kernel ridge regression.

(You can also have a kernel between mathematical objects that aren’t vectors. I have another blog post locked and loaded where we use kernels to do data science on raw molecules.)

So, to recap, we can apply the kernel trick to any model whose training objective can be written as $XX^T$ (which is a $n \times n$ matrix of inner products (a linear operator) between data points), which we’ll replace with $K$, which is a $n \times n$ matrix of nonlinear kernel evaluations between datapoints. Now, let’s derive us a kernel regression learning algorithm, which is called kernel ridge regression.

Review: What is ridge regression?

Linear regression is a problem of trying to find “the best” trendline to fit your data. There are different answers of “the best,” though: if you want to minimize the squared error between your trendline and datapoints, it’s solving this problem, called “least squares” (all norms are Euclidean, and all opinions are my own):

\[\mathcal{L}_{\text{linear regression}} = \lVert \text{y\_train} - X\mathbf{w} \rVert^2\]

Ridge regression is similar, but it puts a penalty (a regularizer) on the squared sum of all the features’ weights. In its regular form here, its solution can help with overfitting, but the regularizer term will be important to deriving a kernelizable (is that a word?) formulation, as I’ll show in a second. It’s the solution to a very similar problem:

\[\mathcal{L}_{\text{ridge regression}} = \lVert \text{y\_train} - X\mathbf{w} \rVert^2 + \lambda \lVert \mathbf{w} \rVert^2\]

Both of these problems can be solved in closed-form by setting the derivative of $\mathcal{L}$ to 0, which lets you use (fast) matrix operations (in NumPy, for instance) instead of slower iterative processes like gradient descent variations.

Deriving kernelizable $XX^T$ terms in the ridge regression formulas

Remember, our goal is to derive a closed-form solution for $\mathbf{w}$ in $\mathcal{L}_{\text{ridge regression}}$, where every occurrence of $X$ is inside of $XX^T$, which we can replace with $K$ (whose entries are the kernel function between datapoints). I’ll try and keep this intuition-heavy and derivation-light, but this will involve a little bit of linear algebra. Here we go.

The key insight in deriving a form of $\lVert \text{y_train} - X\mathbf{w} \rVert^2 + \lambda \lVert \mathbf{w} \rVert^2$ that involves our $K = XX^T$ term is that the optimal $\mathbf{w}$ is a linear combination of training datapoints $X$. This is because if it wasn’t, it’d would have an entry orthogonal to the span of our datapoints. That means that an orthogonal entry in $\mathbf{w}$ wouldn’t affect the model’s predictions. That means that it couldn’t increase model’s accuracy, but it would increase the value of $\lambda \lVert \mathbf{w} \rVert^2$, so it couldn’t possibly be the optimal solution.

That line of reasoning allows us to write $\mathbf{w}$ as a linear combination of $X$:

\[\mathbf{w} = X^T \alpha\]

So we can rewrite the first term in $\mathcal{L}_{\text{ridge regression}}$ to be in our desired $XX^T$ form:

\[\lVert y - X \mathbf{w} \rVert^2 = \lVert y - XX^T \alpha \rVert^2\]

We also know that the squared norm of $\mathbf{w}$ is $\mathbf{w}^T \mathbf{w}$ from the definition of a norm, so we can rewrite the second term to also be in our desired form:

\[\lambda \lVert \mathbf{w} \rVert^2 = \lambda \mathbf{w}T^ \mathbf{w} = \lambda (X^T \alpha)^T X^T \alpha\] \[= \lambda \alpha^T XX^T \alpha\]

With those two things in mind, substituting $\mathbf{w} = X^T \alpha$ into every occurrence of $\mathbf{w}$ in $\mathcal{L}_{\text{ridge regression}}$ gives us:

\[\mathcal{L}_{\text{kernelizable ridgel}} = \lVert y - XX^T \alpha \rVert^2 + \lambda \alpha^T XX^T \alpha\] \[= \lVert y - K \alpha \rVert^2 + \lambda \alpha^T K \alpha\]

(This is why we used ridge instead of least squares or lasso regression: without that specific squared norm in the objective, we couldn’t write $\mathbf{w}$ as a linear combination of $X$, so we this derivation wouldn’t work.)

Then, with a little bit of calculus, you can find

\[\frac{\partial \mathcal{L}_{\text{kernelizable ridgel}}}{\partial \alpha} = -2K(y-K\alpha) + 2\lambda K\alpha\]

and set it to 0, which lets you easily solve for $\alpha$ that minimizes it:

\[\alpha = (\lambda I + K)^-1 y\]

Inference

$\alpha$ is our weight vector, but for how heavily we weight training datapoints (in terms of their kernel evaluations with test datapoints), rather than feature values. Inference is pretty simple: just build a kernel matrix $K_{\text{test}}$ between the training and test datapoints, and scale it by $\alpha$:

\[\mathbf{\hat{y}}_{\text{test}} = K_{\text{test}}^T \alpha\]

Kernel Ridge Implementation

The math was the hard part: the implementation is literally the formulas, translated into NumPy. There are code-level optimizations you can use to vectorize filling out the entries of K, of course, but the naive implementation is good enough for our purposes. I’ll also include a unit test to make sure it gets the same result as sklearn’s implementation.

import numpy as np
import scipy
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
from xgboost import XGBRegressor
import time
class KRR(sklearn.base.BaseEstimator, sklearn.base.RegressorMixin):
    def __init__(self, kernel_func=None, regularization_lambda=1.0):
        assert kernel_func is not None
        self.regularization_lambda = regularization_lambda
        self.kernel_func = kernel_func

    def fit(self, X, y):
        self.X = X
        n = X.shape[0]
        self.K = np.zeros((n, n))
        for i in range(n): 
            for j in range(n): 
                self.K[i, j] = self.kernel_func(X[i], X[j])
        self.alpha = np.linalg.inv(self.K + self.regularization_lambda * np.identity(n)) @ y
        return self

    def predict(self, X): # X is matrix of test points
        n_test = X.shape[0]
        n_train = self.K.shape[0]
        K_test = np.zeros((n_test, n_train))
        for i in range(n_test):
            for j in range(n_train):
                K_test[i, j] = self.kernel_func(X[i], self.X[j])
        return K_test @ self.alpha

Quick smoke test to make sure it gets close to a library implementation:

x = np.arange(0, 1, 0.01)
y = np.sin(x * 10)

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, random_state=69)
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)

krr_args = {
    'kernel': 'rbf',
    'gamma': 30.0,
}

sklearn_krr_preds = sklearn.kernel_ridge.KernelRidge(**krr_args).fit(x_train, y_train).predict(x_test)

rbf_gamma_30 = lambda x1, x2: np.exp(-30.0 * np.sum((x1 - x2)**2))
my_krr_preds = KRR(rbf_gamma_30, regularization_lambda=1.0).fit(x_train, y_train).predict(x_test)

np.allclose(sklearn_krr_preds, my_krr_preds)
True

KRR IRL Performance

I mentioned earlier that some datasets aren’t big enough for XGBoost to take advantage of, and the data-generating process is too complicated for manual feature engineering to beat intelligently choosing a learning algorithm with suitable inductive bias. An example of this is the NASA Airfoil Self-Noise dataset, which is a regression (duh!) problem trying to predict the amount of noise a wing generates in a wind tunnel. Modeling this can help engineers learn more about how to make wings, and offers a nice change of pace compared to Kaggle’s glut of churn prediction and other boring “shareholder value” problems.

# !pip install ucimlrepo
from ucimlrepo import fetch_ucirepo 
airfoil_self_noise = fetch_ucirepo(id=291) 
X = airfoil_self_noise.data.features 
y = airfoil_self_noise.data.targets 
# print(airfoil_self_noise.metadata) 
# print(airfoil_self_noise.variables) 
print(X.shape)
# X
(1503, 5)

All I’ll do for preprocessing and feature engineering (since there’s no missing features) is center the data and standardize it. We’ll use a RBF kernel, and tune In my experience, this has been important for kernel learning algorithms.

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.25, random_state=69)
scaler = sklearn.preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # data leakage if you scale all the X values before splitting into train and test sets!

I tuned hyperparameters off-camera, to avoid bloating up this blog post. The optimal hyperparameters came to me in a dream.

By dream I mean a quick random search that I stopped after a few minutes. I did all this locally, on an aging ThinkPad, so you would probably have better hyperparameters (and me too, if it wasn’t time for my dinner). I rounded to the closest integer for the same reason my truck’s volume control is always a multiple of 5.

You can trust me that I tuned them on the training data, though, to again avoid data leakage.

rbf_gamma = lambda x1, x2: np.exp(-1.0 * np.sum((x1 - x2)**2))
krr = KRR(kernel_func=rbf_gamma, regularization_lambda=0.1)
krr_rmse = sklearn.metrics.root_mean_squared_error(krr.fit(X_train, y_train).predict(X_test), y_test)
krr_rmse
3.6731030022588897
xgb_rmse = sklearn.metrics.root_mean_squared_error(XGBRegressor(
    eta=2.0,
    max_depth=4,
).fit(X_train, y_train).predict(X_test), y_test)
xgb_rmse
4.387397766113281

Conclusion

I’m not surprised Kernel Ridge was more performant than XGBoost, possibly for the inductive bias reasons I mentioned. What surprised me most was how fast XGBoost is. I suppose that’s because thousands and thousands of man-hours and many lines of optimized code have been put into its implementation, and a few minutes were spent on mine, optimized for readability and congruence with my derivation. Thanks for reading!