%matplotlib inline

import numpy as np
from scipy import stats, optimize
from matplotlib import pyplot as plt

from sklearn import gaussian_process
from sklearn.gaussian_process import kernels

import pickle
from urllib.request import urlopen

Gaussian Processes: Part 2¶

Exercise 9.1: Modeling CO₂ levels¶

The amount of carbon dioxide in the atmosphere has been measured continuously at the Mauna Loa observatory, Hawaii. In this problem, you should use a Gaussian process to model the data from 1958 to 2003, and see how well that model can be used for predicting the data from 2004-2019. They present their latest data at their homepage, but for your convenience you can use the data in the format available here. You can load the data with the following code snippet:

# download data
data = pickle.load(
    urlopen("https://github.com/gpschool/labs/raw/2019/.resources/mauna_loa")
)

# extract observations and test data
x = data['X'].flatten()
y = data['Y'].flatten()
xtest = data['Xtest'].flatten()
ytest = data['Ytest'].flatten()

Here, x and y contain your training data.

Start exploring some simple kernels, and thereafter you may have a look at page 118-122 of the book Gaussian processes for machine learning for some inspiration on how to design a more bespoke kernel for this problem.

We create a GP regression model with an RBF kernel and measurement noise and fit it to the data, including hyperparameter tuning.

kernel = 100 * kernels.RBF(length_scale=100) + kernels.WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp.fit(x[:, None], y)
print(gp.kernel_)

316**2 * RBF(length_scale=52.4) + WhiteKernel(noise_level=4.51)

/home/david/.venvs/course-apml-public/lib64/python3.10/site-packages/sklearn/gaussian_process/kernels.py:430: ConvergenceWarning: The optimal value found for dimension 0 of parameter k1__k1__constant_value is close to the specified upper bound 100000.0. Increasing the bound and calling fit again may find a better value.
  warnings.warn(

We plot posterior mean and credibility regions for years 1957 to 2040.

# posterior mean and standard deviation of function values for test inputs
xs = np.linspace(1957, 2040, 200)
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y, s=1)
plt.scatter(xtest, ytest, s=1)
plt.show()

More elaborate models are implemented e.g. here and in exercise 7 here (however using another GP package).

Exercise 9.2: Learning hyperparameters I¶

Until now, we have made GP regression using predefined hyperparameters, such as the lengthscale $\ell$ and noise variance $\sigma^2$. In this exercise, we will estimate $\ell$ and $\sigma^2$ from the data by maximizing the marginal likelihood. The logarithm of the marginal likelihood for a Gaussian process observed with Gaussian noise is

$$ \log p(\mathbf{y} \,|\, \mathbf{x}; \ell, \sigma_f^2, \sigma^2) = \log \mathcal{N}(\mathbf{y} \,|\, 0, \mathbf{K}_y) = -\frac{1}{2} \mathbf{y}^\mathsf{T} \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2} \log{|\mathbf{K}_y|} - \frac{n}{2}\log{(2\pi)} $$

where $\mathbf{K}_y = \kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 \mathbf{I}$.

(a)¶

Write a function that takes $\mathbf{x}$, $\mathbf{y}$, $\ell$, $\sigma_f^2$, and $\sigma^2$ as inputs and produces the logarithm of the marginal likelihood as output for the squared exponential covariance function.

# function that computes the kernel matrix κ(x, y)
def k(x1, x2, l, sf2):
    return sf2 * np.exp(-1/(2 * l**2) * np.abs(x1[:, None] - x2[None, :])**2)

def log_marginal_likelihood(x, y, l, sf2, s2):
    n = len(x)
    Ky = k(x, x, l, sf2) + s2 * np.eye(n)
    _, logdet = np.linalg.slogdet(Ky)
    return -(y.T@(np.linalg.solve(Ky, y)) + logdet + n * np.log(2 * np.pi)) / 2

(b)¶

Consider the same data as before. Use $\sigma_f^2 = 1$ and $\sigma^2 = 0$ and compute the logarithm of the marginal likelihood for values of $\ell$ between 0.1 and 1 and plot it. What seems to be the maximal value of the marginal likelihood on this interval? Do GP regression based on this value of $\ell$.

# observations
n = 5
x = np.array([-4, -3, -1, 0, 2])
y = np.array([-2, 0, 1, 2, -1])

# fixed hyperparameters
sf2 = 1
s2 = 0

# compute values of the marginal likelihood
ls = np.linspace(0.1, 1, 101)
logliks = [log_marginal_likelihood(x, y, l, sf2, s2) for l in ls]

# plot values
plt.plot(ls, logliks)
plt.show()

The marginal likelihood seems to be maximized for a value of $\ell$ around 0.6. We can determine the maximum more precisely with scipy.optimize.minimize. Since the length scale $\ell$ is positive we optimize $\log \ell$.

result = optimize.minimize(
    lambda log_l: -log_marginal_likelihood(x, y, np.exp(log_l), sf2, s2),
    np.log(np.array([0.6])),
)
result

      fun: 9.29788863452336
 hess_inv: array([[0.21693984]])
      jac: array([0.])
  message: 'Optimization terminated successfully.'
     nfev: 12
      nit: 4
     njev: 6
   status: 0
  success: True
        x: array([-0.46100059])

We use the optimal value of $\ell$ in our GP regression:

# compute kernel matrices, with noise
l = np.exp(result.x[0])
sf2 = 1
Kx = k(x, x, l, sf2)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, without noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, without noise")
plt.show()

We can also optimize both $\ell$ and $\sigma^2$ (we still assume $\sigma_f^2 = 1$). Again we perform optimization in log space.

# optimize hyperparameters
result = optimize.minimize(
    lambda params: -log_marginal_likelihood(x, y, np.exp(params[0]), sf2, np.exp(params[1])),
    np.log(np.array([np.sqrt(2), 0.1])),
)
print(result.message)

# compute kernel matrices, with noise
l = np.exp(result.x[0])
sf2 = 1
s2 = np.exp(result.x[1])
Kx = k(x, x, l, sf2) + s2 * np.eye(len(x))
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

Optimization terminated successfully.

Exercise 9.3: Learning hyperparameters II¶

In this exercise we investigate a setting where the marginal likelihood has multiple local minima.

(a)¶

Now, consider the following data

$$ \mathbf{x} = [−5, −3, 0, 0.1, 1, 4.9, 5]^\mathsf{T}, \qquad \mathbf{y} = [0, −0.5, 1, 0.7, 0, 1, 0.7]^\mathsf{T} $$

and compute the log marginal likelihood for both $\ell$ and $\sigma^2$. Use a logarithmic 2D-grid for values of $\ell$ spanning from $10^{−1}$ to $10^2$ and for $\sigma^2$ spanning from $10^{−2}$ to $10^0$. Visualize the marginal likelihood on that grid with a contour plot.

# observations
n = 7
x = np.array([-5, -3, 0, 0.1, 1, 4.9, 5])
y = np.array([0, -0.5, 1, 0.7, 0, 1, 0.7])

# function that computes the kernel matrix κ(x, y)
def k(x1, x2, l, sf2):
    return sf2 * np.exp(-1/(2 * l**2) * np.abs(x1[:, None] - x2[None, :])**2)

def log_marginal_likelihood(x, y, l, sf2, s2):
    n = len(x)
    Ky = k(x, x, l, sf2) + s2 * np.eye(n)
    _, logdet = np.linalg.slogdet(Ky)
    return -(y.T@(np.linalg.solve(Ky, y)) + logdet + n * np.log(2 * np.pi)) / 2    

# fixed hyperparameter
sf2 = 1

# range of values of ℓ and σ2 that are plotted
ls = np.logspace(-1, 1.1, 101)
s2s = np.logspace(-2, -0.5, 101)
logliks = [[log_marginal_likelihood(x, y, l, sf2, s2) for l in ls] for s2 in s2s]

# plot log marginal likelihood values
plt.contour(ls, s2s, np.array(logliks), 500, vmin=-7)
plt.colorbar()
plt.show()

(b)¶

Find the hyperparameters $\ell$ and $\sigma^2$ that correspond to the maximal marginal likelihood. Perform GP regression on the data using these hyperparameters.

# optimize hyperparameters
result = optimize.minimize(
    lambda params: -log_marginal_likelihood(x, y, np.exp(params[0]), sf2, np.exp(params[1])),
    np.log(np.array([10, 0.2])),
)
print(result.message)

# compute kernel matrices, with noise
l = np.exp(result.x[0])
sf2 = 1
s2 = np.exp(result.x[1])
Kx = k(x, x, l, sf2) + s2 * np.eye(len(x))
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

Optimization terminated successfully.

(d)¶

Perform GP regression for the hyperparameters that correspond to other possible local optima of the marginal likelihood. What differences do you see in your posterior?

# optimize hyperparameters
result = optimize.minimize(
    lambda params: -log_marginal_likelihood(x, y, np.exp(params[0]), sf2, np.exp(params[1])),
    np.log(np.array([1, 0.05])),
)
print(result.message)

# compute kernel matrices, with noise
l = np.exp(result.x[0])
sf2 = 1
s2 = np.exp(result.x[1])
Kx = k(x, x, l, sf2) + s2 * np.eye(len(x))
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

Optimization terminated successfully.

Exercise 9.4: Learning hyperparameters III¶

We repeat Exercise 9.3, but this time we estimate $\ell$ and $\sigma^2$ that maximize the marginal likelihood with the fit() function in scikit-learn automatically. Consider the same data as in Exercise 9.3(a) and use, as before, the RBF kernel and measurement noise together.

(a)¶

You still have to provide an initial value of the hyperparameters. Try $\ell = 1$ and $\sigma^2 = 0.1$. What hyperparameters do you get when optimizing? Plot the corresponding mean and credibility regions.

We define the data

x = np.array([-5, -3, 0, 0.1, 1, 4.9, 5])
y = np.array([0, -0.5, 1, 0.7, 0, 1, 0.7])

and the GP regression model with an RBF kernel with measurement noise with initial hyperparameters $\ell = 1$ and $\sigma^2 = 0.1$:

kernel = kernels.RBF(length_scale=1) + kernels.WhiteKernel(noise_level=0.1)
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, random_state=100
)

When we fit the GP regression model to the observations with fit(), the hyperparameters $\ell$ and $\sigma^2$ are optimized automatically. We see that the optimized kernel gp.kernel_ of the model has a length scale $\ell = 1$ and a noise variance $\sigma^2 = 0.032$.

gp.fit(x[:, None], y)
print(gp.kernel_)

RBF(length_scale=1) + WhiteKernel(noise_level=0.032)

We plot the posterior mean and credibility regions of the function values for 101 equally spaced test inputs between -6 and 6.

# posterior mean and standard deviation of function values for test inputs
xs = np.linspace(-6, 6, 101)
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.show()

(b)¶

Try instead to initialize with $\ell = 10$ and $\sigma^2 = 1$. What do you get now?

We repeat the analysis in part (a) but initialize the kernel with hyperparameters $\ell = 10$ and $\sigma^2 = 1$.

# initialize kernel and GP regression model
kernel = kernels.RBF(length_scale=10) + kernels.WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, random_state=100
)

# fit model to observations and display optimized hyperparameters
gp.fit(x[:, None], y)
print(gp.kernel_)

# posterior mean and standard deviation of function values for test inputs
xs = np.linspace(-6, 6, 101)
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.show()

RBF(length_scale=10.7) + WhiteKernel(noise_level=0.222)

(c)¶

Try to explain what happens by making a grid over different hyperparameter values, and inspect the marginal likelihood for each point in that grid. The GaussianProcessRegressor class has a member function log_marginal_likelihood() which you may use. (Do not forget to turn off the hyperparameter optimization!)

We implement a function that computes the marginal likelihood for a given lengthscale $\ell$ and noise variance $\sigma^2$.

def marginal_likelihood(l, s2):
    # construct kernel with measurement noise
    kernel = kernels.RBF(length_scale=l) + kernels.WhiteKernel(noise_level=s2)

    # construct GP regression model (without hyperparameter tuning!)
    gp = gaussian_process.GaussianProcessRegressor(kernel=kernel, optimizer=None)
    
    # compute posterior GP and evaluate marginal likelihood
    gp.fit(x[:, None], y)
    return np.exp(gp.log_marginal_likelihood())

Using this function we compute and plot the marginal likelihood for a logarithmically spaced grid of lengthscales $\ell$ between 0.1 and 100 and noise variances $\sigma^2$ between 0.01 and 1.

# compute marginal likelihood for a grid of lengthscales and noise variances
ls = np.logspace(-1, 2, 51)
s2s = np.logspace(-2, 0, 51)
marginal_likelihoods = [[marginal_likelihood(l, s2) for l in ls] for s2 in s2s]

# plot marginal likelihoods
plt.contour(ls, s2s, marginal_likelihoods, levels=20)
plt.xscale("log")
plt.yscale("log")
plt.xlabel("l")
plt.ylabel("s2")
plt.show()