%matplotlib inline

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

from sklearn import gaussian_process
from sklearn.gaussian_process import kernels

Gaussian Processes: Part 1¶

Exercise 8.1: Sample from GP prior¶

In this exercises we will write the code needed to draw and plot samples of $f$ from a Gaussian process prior with squared exponential (or, equivalently, RBF) kernel

$$ f \sim \mathcal{GP}(m, \kappa), \qquad m(x) = 0 \text{ and } \kappa(x, x') = \sigma^2_f \exp{\Big(−\frac{1}{2l^2} {\|x - x'\|}^2\Big)}. $$

To implement this, we choose a vector of $m$ test input points $\mathbf{x}_*$. We will choose $\mathbf{x}_*$ to contain sufficiently many points, such that it will appear as a continuous line on the screen. We then evaluate the $m \times m$ covariance matrix $\kappa(\mathbf{x}_∗, \mathbf{x}_∗)$ and thereafter generate samples from the multivariate normal distribution

$$ f(\mathbf{x}_∗) \sim \mathcal{N}\big(m(\mathbf{x}_∗), \kappa(\mathbf{x}_∗, \mathbf{x}_∗)\big). $$

(a)¶

Use np.linspace to construct a vector $\mathbf{x}_*$ with $m = 101$ elements equally spaced from -5 to 5.

# test input points
m = 101
xs = np.linspace(-5, 5, m)

(b)¶

Construct a mean vector $m(\mathbf{x}_∗)$ with 101 elements all equal to zero and the $101 \times 101$ covariance matrix $\kappa(\mathbf{x}_*, \mathbf{x}_∗)$. The expression for $\kappa(\cdot, \cdot)$ is given above. Let the hyperparameters be $\ell^2 = 2$ and $\sigma^2_f = 1$.

# mean vector
mxs = np.zeros(m)

# covariance matrix
l = np.sqrt(2)
sf2 = 1
Kxs = sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - xs[None, :])**2)

(c)¶

Use stats.multivariate_normal from scipy (you might need to use the option allow_singular=True) to draw 25 samples $f^{(1)}(\mathbf{x}_*), \ldots, f^{(25)}(\mathbf{x}_*)$ from the multivariate normal distribution $f(\mathbf{x}_∗) \sim \mathcal{N}\big(m(\mathbf{x}_*), \kappa(\mathbf{x}_*, \mathbf{x}_*)\big)$.

fs = stats.multivariate_normal.rvs(
    mean=mxs, cov=Kxs, size=25, random_state=1234
)

(d)¶

Plot the samples $f^{(1)}(\mathbf{x}_∗), \ldots, f^{(25)}(\mathbf{x}_∗)$ versus the input vector $\mathbf{x}_*$.

plt.plot(xs, fs.T)
plt.show()

(e)¶

Try another value of $\ell$ and repeat (b)-(d). How do the two plots differ, and why?

# samples with l = 1/2
l = 1/2
Kxs = sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - xs[None, :])**2)
fs = stats.multivariate_normal.rvs(
    mean=mxs, cov=Kxs, size=25, random_state=1234
)
plt.plot(xs, fs.T)
plt.show()

Exercise 8.2: GP posterior I¶

In this exercise we will perform Gaussian process regression. That means, based on the $n$ observations $\mathcal{D} = \{x_i, f(x_i)\}^n_{i=1}$ and the prior belief $f \sim \mathcal{GP}\big(0, \kappa(x, x')\big)$, we want to find the posterior $p(f | \mathcal{D})$. (In the previous problem, we were only concerned with the prior $p(f)$, not conditioned on having observed the data $\mathcal{D}$.) We consider the same Gaussian process prior (same mean $m(x)$ and $\kappa(x, x')$ and hyperparameters) as in the previous exercise.

(a)¶

Construct two vectors $\mathbf{x} = [-4, -3, -1, 0, 2]^\mathsf{T}$ and $\mathbf{y} = [-2, 0, 1, 2, -1]^\mathsf{T}$, which will be our training data (that is, $n = 5$).

# observations
n = 5
x = np.array([-4, -3, -1, 0, 2])
y = np.array([-2, 0, 1, 2, -1])

(b)¶

Keep $\mathbf{x}_∗$ as in the previous problem. In addition to the $m \times m$ matrix $\kappa(\mathbf{x}_∗, \mathbf{x}_∗)$, now also compute the $n \times m$ matrix $\kappa(\mathbf{x}, \mathbf{x}_∗)$ and the $n \times n$ matrix $\kappa(\mathbf{x}, \mathbf{x})$.

Hint: You might find it useful to define a function that returns $\kappa(x, x')$, taking $x$ and $x'$ as arguments.

# function that computes the kernel matrix κ(x, y)
def k(xs, ys, l, sf2):
    return sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - ys[None, :])**2)

# set hyperparameters
l = np.sqrt(2)
sf2 = 1

# set test inputs
m = 101
xs = np.linspace(-5, 5, m)

# compute kernel matrices
Kx = k(x, x, l, sf2)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

(c)¶

Use the training data $\mathbf{x}$, $\mathbf{y}$ and the matrices constructed in (b) to compute the posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_∗$, by using the equations for conditional multivariate normal distribution.

The posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ is given by

$$ \boldsymbol{\mu}_{\mathrm{posterior}}(\mathbf{x}_*) = \kappa(\mathbf{x}, \mathbf{x}_*)^\mathsf{T} \kappa(\mathbf{x}, \mathbf{x})^{-1} \mathbf{y} = {\Big(\kappa(\mathbf{x}, \mathbf{x})^{-1} \kappa(\mathbf{x}, \mathbf{x}_*)\Big)}^\mathsf{T} \mathbf{y}, $$

and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ by

$$ \mathbf{K}_{\mathrm{posterior}}(\mathbf{x}_*, \mathbf{x}_*) = \kappa(\mathbf{x}_*, \mathbf{x}_*) - \kappa(\mathbf{x}, \mathbf{x}_*)^{\mathsf{T}}\kappa(\mathbf{x}, \mathbf{x})^{-1} \kappa(\mathbf{x}, \mathbf{x}_*). $$

A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

(d)¶

In a similar manner as in (c) and (d) in the previous problem, draw 25 samples from the multivariate distribution $f(\mathbf{x}_∗) \sim \mathcal{N}(\boldsymbol{\mu}_{\mathrm{posterior}}, \mathbf{K}_{\mathrm{posterior}})$ and plot these samples ($f^{(j)} (\mathbf{x}_∗)$ vs. $\mathbf{x}_∗$) together with the posterior mean ($\boldsymbol{\mu}_{\mathrm{posterior}}$ vs. $\mathbf{x}_∗$) and the actual measurements ($\mathbf{f}$ vs. $\mathbf{x}$). How do the samples in this plot differ from the prior samples in the previous problem?

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()

All posterior samples pass through the observed data points (the prior samples do not necessarily do that). This is natural since the posterior distribution of $f(\mathbf{x}_∗)$ is conditioned on the observations, and hence the posterior distribution must pass through them.

(e)¶

Instead of plotting samples, plot a credibility region. Here, a credibility region is based on the (marginal) posterior variance. The 68% credibility region, for example, is the area between $\boldsymbol{\mu}_{\mathrm{posterior}} - \sqrt{\mathbf{K}^d_{\mathrm{posterior}}}$ and $\boldsymbol{\mu}_{\mathrm{posterior}} + \sqrt{\mathbf{K}^d_{\mathrm{posterior}}}$, where $\mathbf{K}^d_{\mathrm{posterior}}$ is a vector with the diagonal elements of $\mathbf{K}_{\mathrm{posterior}}$. What is the connection between the credibility regions and the samples you drew previously?

The 68% credibility region, for example, contains 68% of the posterior samples.

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()

/tmp/ipykernel_21206/2193942380.py:6: RuntimeWarning: invalid value encountered in sqrt
  mu_post + i*np.sqrt(np.diag(K_post)),
/tmp/ipykernel_21206/2193942380.py:7: RuntimeWarning: invalid value encountered in sqrt
  mu_post - i*np.sqrt(np.diag(K_post)),

(f)¶

Now, consider the setting where the measurements are corrupted with noise, $y_i = f(\mathbf{x}_i) + \varepsilon$, $\varepsilon \sim \mathcal{N}(0, \sigma^2)$. Use $\sigma^2 = 0.1$ and repeat (c)-(e) with this modification of the model. What is the difference in comparison to the previous plot? What is the interpretation?

Now the posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ is given by

$$ \boldsymbol{\mu}_{\mathrm{posterior}}(\mathbf{x}_*) = \kappa(\mathbf{x}, \mathbf{x}_*)^\mathsf{T} [\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \mathbf{y} = {\Big([\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \kappa(\mathbf{x}, \mathbf{x}_*)\Big)}^\mathsf{T} \mathbf{y}, $$

and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ by

$$ \mathbf{K}_{\mathrm{posterior}}(\mathbf{x}_*, \mathbf{x}_*) = \kappa(\mathbf{x}_*, \mathbf{x}_*) - \kappa(\mathbf{x}, \mathbf{x}_*)^{\mathsf{T}}[\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \kappa(\mathbf{x}, \mathbf{x}_*). $$

# compute kernel matrices, with noise
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()

The posterior distribution does not pass exactly through all observations anymore, but the observations are now to some extent "explained" as noise.

(g)¶

Explore what happens with another length scale $\ell$.

# use length scale ℓ=1/2
l = 1/2

# compute kernel matrices, with noise
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()

Exercise 8.3: Other covariance functions/kernels¶

The squared exponential kernel/covariance function gives samples which are smooth and infinitely continuously differentiable. Other kernels make other assumptions. Now try the previous problems using the exponential kernel,

$$ \kappa(x, x') = \sigma^2_f \exp{\bigg(-\frac{1}{l} \|x - x' \|\bigg)}, $$

and the Matérn kernel with parameter $\nu = 3/2$,

$$ \kappa(x, x') = \sigma^2_f \bigg(1 + \frac{\sqrt{3}}{l} \|x - x' \|\bigg) \exp{\bigg(-\frac{\sqrt{3}}{l} \|x - x' \|\bigg)}. $$

For the exponential and the Matérn kernel, we define functions that compute the kernel matrix:

def k_exp(xs, ys, l, sf2):
    return sf2 * np.exp(-1/l * np.abs(xs[:, None] - ys[None, :]))

def k_matern(xs, ys, l, sf2):
    distmat = np.sqrt(3)/l * np.abs(xs[:, None] - ys[None, :])
    return sf2 * (1 + distmat) * np.exp(-distmat)

We repeat the analysis above for both kernels:

def plot_gp_posterior(k, l=np.sqrt(2), noise=False):
    # compute kernel matrices, no noise
    sf2 = 1
    if noise:
        s2 = 0.1
        Kx = k(x, x, l, sf2) + s2 * np.eye(n)
    else:
        Kx = k(x, x, l, sf2)
    Kxs = k(xs, xs, l, sf2)
    Kxxs = k(x, xs, l, sf2)

    # compute mean and covariance of posterior
    A = np.linalg.solve(Kx, Kxxs)
    mu_post = A.T@y
    K_post = Kxs - Kxxs.T@A

    # sample from the posterior for x*
    fs = stats.multivariate_normal.rvs(
        mean=mu_post, cov=K_post, size=25, random_state=1234
    )

    # create grid of plots
    fig, ax = plt.subplots(2, 1)

    # plot samples, posterior mean and observations
    ax[0].plot(xs, fs.T, color='0.6')
    ax[0].plot(xs, mu_post)
    ax[0].scatter(x, y)

    ax[1].plot(xs, mu_post)
    for j in [3, 2, 1]:
        # plot credibility regions (1, 2, and 3 standard deviations)
        ax[1].fill_between(
            xs,
            mu_post + j*np.sqrt(np.diag(K_post)),
            mu_post - j*np.sqrt(np.diag(K_post)),
            color=str(0.4 + 0.15*j), # '0' => black, '1' => white
        )
    ax[1].scatter(x, y)

    return fig, ax

# list of kernels and corresponding names
kernel_dict = { "exponential": k_exp, "Matérn": k_matern }

# l=sqrt(2), no noise
for (name, k) in kernel_dict.items():
    fig, ax = plot_gp_posterior(k)
    fig.suptitle("l = sqrt(2), no noise\n({} kernel)".format(name))
    plt.show()

# l=sqrt(2), with noise
for (name, k) in kernel_dict.items():
    fig, ax = plot_gp_posterior(k, noise=True)
    fig.suptitle("l = sqrt(2), with noise\n({} kernel)".format(name))
    plt.show()

# l=0.5, with noise
for (name, k) in kernel_dict.items():
    fig, ax = plot_gp_posterior(k, l=0.5, noise=True)
    fig.suptitle("l = 0.5, with noise\n({} kernel)".format(name))
    plt.show()

Exercise 8.4: GP posterior II¶

Repeat Exercise 8.2 using GaussianProcessRegressor from sklearn.gaussian_process, only plotting the credibility regions based on the posterior predictive variance (not drawing samples).

Some useful hints:

As all supervised machine learning methods in sklearn, you first have to construct an object from the model class (in this case GaussianProcessRegressor), and thereafter train it on data by using its member function fit(). To obtain predictions, use the member function predict(). To the latter, you will either have to pass return_std=True or return_cov=True in order to obtain information about the posterior predictive variance.
When you construct the model, you have to define a kernel. The kernels are available in sklearn.gaussian_process.kernel, where the squared exponential/RBF kernel is available as RBF.
The function fit() automatically optimizes the hyperparameters. To turn that feature off, you have to pass the argument optimizer=None to GaussianProcessRegressor.
To include the measurement noise, you can formulate it as part of the kernel by using the kernel WhiteKernel.

We start by defining the observations from Exercise 8.2:

x = np.array([-4, -3, -1, 0, 2])
y = np.array([-2, 0, 1, 2, -1])

We define the test inputs as in exercise 8.2:

xs = np.linspace(-5, 5, 101)

We define an RBF kernel with lengthscale $\ell^2 = 2$. Note that RBF uses $\sigma_f^2 = 1$, if you want to scale the kernel you would have to multiply it with a ConstantKernel.

l = np.sqrt(2)
kernel = kernels.RBF(length_scale=l)

We construct a Gaussian process with this RBF kernel. We specify optimizer=None to avoid any hyperparameter tuning and random_state to ensure that the results are reproducible. Note that GaussianProcessRegressor always uses a Gaussian process with zero mean.

gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, optimizer=None, random_state=100
)

We update the Gaussian process regression model by fitting it to the observations, i.e., we compute the posterior Gaussian process. fit() expects a matrix of observations $\mathbf{x}$ and therefore we reshape x.

gp.fit(x[:, None], y)

GaussianProcessRegressor(kernel=RBF(length_scale=1.41), optimizer=None,
                         random_state=100)

GaussianProcessRegressor(kernel=RBF(length_scale=1.41), optimizer=None,
                         random_state=100)

We compute the mean and the standard deviation of the normally distributed function values for the test inputs.

mu_post, std_post = gp.predict(xs[:, None], return_std=True)

We plot the data and the posterior distribution for the test inputs.

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()

We can add Gaussian measurement noise with variance $\sigma^2 = 0.1$ by constructing the kernel as a sum of an RBF kernel and a WhiteKernel.

l = np.sqrt(2)
s2 = 0.1
kernel = kernels.RBF(length_scale=l) + kernels.WhiteKernel(noise_level=s2)

We perform the same analysis again:

# construct Gaussian process (without hyperparameter tuning)
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, optimizer=None, random_state=100
)

# fit model to observations
gp.fit(x[:, None], y)

# posterior mean and standard deviation of function values for test inputs
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()

We repeat the study with length scale $\ell = 1/2$.

# construct kernel
l = 1/2
s2 = 0.1
kernel = kernels.RBF(length_scale=l) + kernels.WhiteKernel(noise_level=s2)

# construct GP regression model (without hyperparameter tuning)
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, optimizer=None, random_state=100
)

# fit model to observations
gp.fit(x[:, None], y)

# posterior mean and standard deviation of function values for test inputs
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()