%matplotlib inline
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn import gaussian_process
from sklearn.gaussian_process import kernels
In this exercises we will write the code needed to draw and plot samples of $f$ from a Gaussian process prior with squared exponential (or, equivalently, RBF) kernel
$$ f \sim \mathcal{GP}(m, \kappa), \qquad m(x) = 0 \text{ and } \kappa(x, x') = \sigma^2_f \exp{\Big(ā\frac{1}{2l^2} {\|x - x'\|}^2\Big)}. $$To implement this, we choose a vector of $m$ test input points $\mathbf{x}_*$. We will choose $\mathbf{x}_*$ to contain sufficiently many points, such that it will appear as a continuous line on the screen. We then evaluate the $m \times m$ covariance matrix $\kappa(\mathbf{x}_ā, \mathbf{x}_ā)$ and thereafter generate samples from the multivariate normal distribution
$$ f(\mathbf{x}_ā) \sim \mathcal{N}\big(m(\mathbf{x}_ā), \kappa(\mathbf{x}_ā, \mathbf{x}_ā)\big). $$Use np.linspace
to construct a vector $\mathbf{x}_*$ with $m = 101$ elements equally spaced from -5 to 5.
# test input points
m = 101
xs = np.linspace(-5, 5, m)
Construct a mean vector $m(\mathbf{x}_ā)$ with 101 elements all equal to zero and the $101 \times 101$ covariance matrix $\kappa(\mathbf{x}_*, \mathbf{x}_ā)$. The expression for $\kappa(\cdot, \cdot)$ is given above. Let the hyperparameters be $\ell^2 = 2$ and $\sigma^2_f = 1$.
# mean vector
mxs = np.zeros(m)
# covariance matrix
l = np.sqrt(2)
sf2 = 1
Kxs = sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - xs[None, :])**2)
Use stats.multivariate_normal
from scipy
(you might need to use the option allow_singular=True
) to draw 25 samples $f^{(1)}(\mathbf{x}_*), \ldots, f^{(25)}(\mathbf{x}_*)$ from the multivariate normal distribution $f(\mathbf{x}_ā) \sim \mathcal{N}\big(m(\mathbf{x}_*), \kappa(\mathbf{x}_*, \mathbf{x}_*)\big)$.
fs = stats.multivariate_normal.rvs(
mean=mxs, cov=Kxs, size=25, random_state=1234
)
Plot the samples $f^{(1)}(\mathbf{x}_ā), \ldots, f^{(25)}(\mathbf{x}_ā)$ versus the input vector $\mathbf{x}_*$.
plt.plot(xs, fs.T)
plt.show()
Try another value of $\ell$ and repeat (b)-(d). How do the two plots differ, and why?
# samples with l = 1/2
l = 1/2
Kxs = sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - xs[None, :])**2)
fs = stats.multivariate_normal.rvs(
mean=mxs, cov=Kxs, size=25, random_state=1234
)
plt.plot(xs, fs.T)
plt.show()
In this exercise we will perform Gaussian process regression. That means, based on the $n$ observations $\mathcal{D} = \{x_i, f(x_i)\}^n_{i=1}$ and the prior belief $f \sim \mathcal{GP}\big(0, \kappa(x, x')\big)$, we want to find the posterior $p(f | \mathcal{D})$. (In the previous problem, we were only concerned with the prior $p(f)$, not conditioned on having observed the data $\mathcal{D}$.) We consider the same Gaussian process prior (same mean $m(x)$ and $\kappa(x, x')$ and hyperparameters) as in the previous exercise.
Construct two vectors $\mathbf{x} = [-4, -3, -1, 0, 2]^\mathsf{T}$ and $\mathbf{y} = [-2, 0, 1, 2, -1]^\mathsf{T}$, which will be our training data (that is, $n = 5$).
# observations
n = 5
x = np.array([-4, -3, -1, 0, 2])
y = np.array([-2, 0, 1, 2, -1])
Keep $\mathbf{x}_ā$ as in the previous problem. In addition to the $m \times m$ matrix $\kappa(\mathbf{x}_ā, \mathbf{x}_ā)$, now also compute the $n \times m$ matrix $\kappa(\mathbf{x}, \mathbf{x}_ā)$ and the $n \times n$ matrix $\kappa(\mathbf{x}, \mathbf{x})$.
Hint: You might find it useful to define a function that returns $\kappa(x, x')$, taking $x$ and $x'$ as arguments.
# function that computes the kernel matrix Īŗ(x, y)
def k(xs, ys, l, sf2):
return sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - ys[None, :])**2)
# set hyperparameters
l = np.sqrt(2)
sf2 = 1
# set test inputs
m = 101
xs = np.linspace(-5, 5, m)
# compute kernel matrices
Kx = k(x, x, l, sf2)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)
Use the training data $\mathbf{x}$, $\mathbf{y}$ and the matrices constructed in (b) to compute the posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_ā$, by using the equations for conditional multivariate normal distribution.
The posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ is given by
$$ \boldsymbol{\mu}_{\mathrm{posterior}}(\mathbf{x}_*) = \kappa(\mathbf{x}, \mathbf{x}_*)^\mathsf{T} \kappa(\mathbf{x}, \mathbf{x})^{-1} \mathbf{y} = {\Big(\kappa(\mathbf{x}, \mathbf{x})^{-1} \kappa(\mathbf{x}, \mathbf{x}_*)\Big)}^\mathsf{T} \mathbf{y}, $$and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ by
$$ \mathbf{K}_{\mathrm{posterior}}(\mathbf{x}_*, \mathbf{x}_*) = \kappa(\mathbf{x}_*, \mathbf{x}_*) - \kappa(\mathbf{x}, \mathbf{x}_*)^{\mathsf{T}}\kappa(\mathbf{x}, \mathbf{x})^{-1} \kappa(\mathbf{x}, \mathbf{x}_*). $$A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A
In a similar manner as in (c) and (d) in the previous problem, draw 25 samples from the multivariate distribution $f(\mathbf{x}_ā) \sim \mathcal{N}(\boldsymbol{\mu}_{\mathrm{posterior}}, \mathbf{K}_{\mathrm{posterior}})$ and plot these samples ($f^{(j)} (\mathbf{x}_ā)$ vs. $\mathbf{x}_ā$) together with the posterior mean ($\boldsymbol{\mu}_{\mathrm{posterior}}$ vs. $\mathbf{x}_ā$) and the actual measurements ($\mathbf{f}$ vs. $\mathbf{x}$). How do the samples in this plot differ from the prior samples in the previous problem?
# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
mean=mu_post, cov=K_post, size=25, random_state=1234
)
# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()
All posterior samples pass through the observed data points (the prior samples do not necessarily do that). This is natural since the posterior distribution of $f(\mathbf{x}_ā)$ is conditioned on the observations, and hence the posterior distribution must pass through them.
Instead of plotting samples, plot a credibility region. Here, a credibility region is based on the (marginal) posterior variance. The 68% credibility region, for example, is the area between $\boldsymbol{\mu}_{\mathrm{posterior}} - \sqrt{\mathbf{K}^d_{\mathrm{posterior}}}$ and $\boldsymbol{\mu}_{\mathrm{posterior}} + \sqrt{\mathbf{K}^d_{\mathrm{posterior}}}$, where $\mathbf{K}^d_{\mathrm{posterior}}$ is a vector with the diagonal elements of $\mathbf{K}_{\mathrm{posterior}}$. What is the connection between the credibility regions and the samples you drew previously?
The 68% credibility region, for example, contains 68% of the posterior samples.
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
# plot credibility regions (1, 2, and 3 standard deviations)
plt.fill_between(
xs,
mu_post + i*np.sqrt(np.diag(K_post)),
mu_post - i*np.sqrt(np.diag(K_post)),
color=str(0.4 + 0.15*i), # '0' => black, '1' => white
)
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()
Now, consider the setting where the measurements are corrupted with noise, $y_i = f(\mathbf{x}_i) + \varepsilon$, $\varepsilon \sim \mathcal{N}(0, \sigma^2)$. Use $\sigma^2 = 0.1$ and repeat (c)-(e) with this modification of the model. What is the difference in comparison to the previous plot? What is the interpretation?
Now the posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ is given by
$$ \boldsymbol{\mu}_{\mathrm{posterior}}(\mathbf{x}_*) = \kappa(\mathbf{x}, \mathbf{x}_*)^\mathsf{T} [\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \mathbf{y} = {\Big([\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \kappa(\mathbf{x}, \mathbf{x}_*)\Big)}^\mathsf{T} \mathbf{y}, $$and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ by
$$ \mathbf{K}_{\mathrm{posterior}}(\mathbf{x}_*, \mathbf{x}_*) = \kappa(\mathbf{x}_*, \mathbf{x}_*) - \kappa(\mathbf{x}, \mathbf{x}_*)^{\mathsf{T}}[\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \kappa(\mathbf{x}, \mathbf{x}_*). $$# compute kernel matrices, with noise
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)
# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A
# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
mean=mu_post, cov=K_post, size=25, random_state=1234
)
# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
# plot credibility regions (1, 2, and 3 standard deviations)
plt.fill_between(
xs,
mu_post + i*np.sqrt(np.diag(K_post)),
mu_post - i*np.sqrt(np.diag(K_post)),
color=str(0.4 + 0.15*i), # '0' => black, '1' => white
)
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()
The posterior distribution does not pass exactly through all observations anymore, but the observations are now to some extent "explained" as noise.
Explore what happens with another length scale $\ell$.
# use length scale ā=1/2
l = 1/2
# compute kernel matrices, with noise
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)
# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A
# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
mean=mu_post, cov=K_post, size=25, random_state=1234
)
# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
# plot credibility regions (1, 2, and 3 standard deviations)
plt.fill_between(
xs,
mu_post + i*np.sqrt(np.diag(K_post)),
mu_post - i*np.sqrt(np.diag(K_post)),
color=str(0.4 + 0.15*i), # '0' => black, '1' => white
)
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()
The squared exponential kernel/covariance function gives samples which are smooth and infinitely continuously differentiable. Other kernels make other assumptions. Now try the previous problems using the exponential kernel,
$$ \kappa(x, x') = \sigma^2_f \exp{\bigg(-\frac{1}{l} \|x - x' \|\bigg)}, $$and the MatƩrn kernel with parameter $\nu = 3/2$,
$$ \kappa(x, x') = \sigma^2_f \bigg(1 + \frac{\sqrt{3}}{l} \|x - x' \|\bigg) \exp{\bigg(-\frac{\sqrt{3}}{l} \|x - x' \|\bigg)}. $$For the exponential and the MatƩrn kernel, we define functions that compute the kernel matrix:
def k_exp(xs, ys, l, sf2):
return sf2 * np.exp(-1/l * np.abs(xs[:, None] - ys[None, :]))
def k_matern(xs, ys, l, sf2):
distmat = np.sqrt(3)/l * np.abs(xs[:, None] - ys[None, :])
return sf2 * (1 + distmat) * np.exp(-distmat)
We repeat the analysis above for both kernels:
def plot_gp_posterior(k, l=np.sqrt(2), noise=False):
# compute kernel matrices, no noise
sf2 = 1
if noise:
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
else:
Kx = k(x, x, l, sf2)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)
# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A
# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
mean=mu_post, cov=K_post, size=25, random_state=1234
)
# create grid of plots
fig, ax = plt.subplots(2, 1)
# plot samples, posterior mean and observations
ax[0].plot(xs, fs.T, color='0.6')
ax[0].plot(xs, mu_post)
ax[0].scatter(x, y)
ax[1].plot(xs, mu_post)
for j in [3, 2, 1]:
# plot credibility regions (1, 2, and 3 standard deviations)
ax[1].fill_between(
xs,
mu_post + j*np.sqrt(np.diag(K_post)),
mu_post - j*np.sqrt(np.diag(K_post)),
color=str(0.4 + 0.15*j), # '0' => black, '1' => white
)
ax[1].scatter(x, y)
return fig, ax
# list of kernels and corresponding names
kernel_dict = { "exponential": k_exp, "MatƩrn": k_matern }
# l=sqrt(2), no noise
for (name, k) in kernel_dict.items():
fig, ax = plot_gp_posterior(k)
fig.suptitle("l = sqrt(2), no noise\n({} kernel)".format(name))
plt.show()
# l=sqrt(2), with noise
for (name, k) in kernel_dict.items():
fig, ax = plot_gp_posterior(k, noise=True)
fig.suptitle("l = sqrt(2), with noise\n({} kernel)".format(name))
plt.show()
# l=0.5, with noise
for (name, k) in kernel_dict.items():
fig, ax = plot_gp_posterior(k, l=0.5, noise=True)
fig.suptitle("l = 0.5, with noise\n({} kernel)".format(name))
plt.show()
Repeat Exercise 8.2 using GaussianProcessRegressor
from sklearn.gaussian_process
, only plotting the credibility regions based on the posterior predictive variance (not drawing samples).
Some useful hints:
sklearn
, you first have to construct an object from the model class (in this case GaussianProcessRegressor
), and thereafter train it on data by using its member function fit()
.
To obtain predictions, use the member function predict()
.
To the latter, you will either have to pass return_std=True
or return_cov=True
in order to obtain information about the posterior predictive variance.sklearn.gaussian_process.kernel
, where the squared exponential/RBF kernel is available as RBF
.fit()
automatically optimizes the hyperparameters.
To turn that feature off, you have to pass the argument optimizer=None
to GaussianProcessRegressor
.WhiteKernel
.We start by defining the observations from Exercise 8.2:
x = np.array([-4, -3, -1, 0, 2])
y = np.array([-2, 0, 1, 2, -1])
We define the test inputs as in exercise 8.2:
xs = np.linspace(-5, 5, 101)
We define an RBF kernel with lengthscale $\ell^2 = 2$.
Note that RBF
uses $\sigma_f^2 = 1$, if you want to scale the kernel you would have to multiply it with a ConstantKernel
.
l = np.sqrt(2)
kernel = kernels.RBF(length_scale=l)
We construct a Gaussian process with this RBF kernel.
We specify optimizer=None
to avoid any hyperparameter tuning and random_state
to ensure that the results are reproducible.
Note that GaussianProcessRegressor
always uses a Gaussian process with zero mean.
gp = gaussian_process.GaussianProcessRegressor(
kernel=kernel, optimizer=None, random_state=100
)
We update the Gaussian process regression model by fitting it to the observations, i.e., we compute the posterior Gaussian process.
fit()
expects a matrix of observations $\mathbf{x}$ and therefore we reshape x
.
gp.fit(x[:, None], y)
We compute the mean and the standard deviation of the normally distributed function values for the test inputs.
mu_post, std_post = gp.predict(xs[:, None], return_std=True)
We plot the data and the posterior distribution for the test inputs.
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
# plot credibility regions (1, 2, and 3 standard deviations)
plt.fill_between(
xs, mu_post + i*std_post, mu_post - i*std_post,
color=str(0.4 + 0.15*i), # '0' => black, '1' => white
)
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()
We can add Gaussian measurement noise with variance $\sigma^2 = 0.1$ by constructing the kernel as a sum of an RBF
kernel and a WhiteKernel
.
l = np.sqrt(2)
s2 = 0.1
kernel = kernels.RBF(length_scale=l) + kernels.WhiteKernel(noise_level=s2)
We perform the same analysis again:
# construct Gaussian process (without hyperparameter tuning)
gp = gaussian_process.GaussianProcessRegressor(
kernel=kernel, optimizer=None, random_state=100
)
# fit model to observations
gp.fit(x[:, None], y)
# posterior mean and standard deviation of function values for test inputs
mu_post, std_post = gp.predict(xs[:, None], return_std=True)
# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
# plot credibility regions (1, 2, and 3 standard deviations)
plt.fill_between(
xs, mu_post + i*std_post, mu_post - i*std_post,
color=str(0.4 + 0.15*i), # '0' => black, '1' => white
)
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()
We repeat the study with length scale $\ell = 1/2$.
# construct kernel
l = 1/2
s2 = 0.1
kernel = kernels.RBF(length_scale=l) + kernels.WhiteKernel(noise_level=s2)
# construct GP regression model (without hyperparameter tuning)
gp = gaussian_process.GaussianProcessRegressor(
kernel=kernel, optimizer=None, random_state=100
)
# fit model to observations
gp.fit(x[:, None], y)
# posterior mean and standard deviation of function values for test inputs
mu_post, std_post = gp.predict(xs[:, None], return_std=True)
# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
# plot credibility regions (1, 2, and 3 standard deviations)
plt.fill_between(
xs, mu_post + i*std_post, mu_post - i*std_post,
color=str(0.4 + 0.15*i), # '0' => black, '1' => white
)
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()