%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.io as spio

Bayesian linear regression¶

In this exercise session we consider the supervised regression problem of finding a function $f(x)$ that describes the relationship between a scalar input $x$ and a scalar output $y$:

$$ y = f(x) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \beta^{-1}). $$

We model this with a Bayesian linear regression model

$$ f(x) = \boldsymbol{\phi}(x)^{\mathsf{T}} \mathbf{w}, \qquad \mathbf{w} \sim \mathcal{N}(\boldsymbol{\mu}_0, \mathbf{S}_0), $$

where $\boldsymbol{\phi}(x)$ is a vector of the input features. Note that we used the notation $\mathbf{x}$ for the input features in the lecture. We changed this notation to $\boldsymbol{\phi}(x)$ here in order to not mix it up with the scalar input $x$.

The Bayesian linear regression model is then given by

$$ \begin{aligned} p(\mathbf{y} \,|\, \mathbf{w}) &= \mathcal{N}(\mathbf{y}; \boldsymbol{\Phi}\mathbf{w}, \beta^{-1}\mathbf{I}_N) \qquad && \text{(likelihood)}, \\ p(\mathbf{w}) &= \mathcal{N}(\mathbf{w}; \mathbf{m}_0, \mathbf{S}_0) \qquad && \text{(prior)}, \end{aligned} $$

where

$$ \boldsymbol{\Phi} = \begin{bmatrix} \boldsymbol{\phi}(x_1)^{\mathsf{T}} \\ \vdots \\ \boldsymbol{\phi}(x_N)^{\mathsf{T}} \end{bmatrix} \qquad \text{and} \qquad \mathbf{y} = \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix}. $$

Given a set of training data of inputs and outputs $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$, we are interested in finding the posterior of the weights $p(\mathbf{w} \,|\, \mathbf{y})$ and also the predictive distribution $p(f(x_{\star}) \,|\, \mathbf{y})$ of unseen input $x_{\star}$. For further information about the Bayesian linear regression model see Lecture 3 and/or Christopher Bishop's book "Pattern recognition and machine learning".

Exercise 3.1: Understanding the code¶

Download the files `lindata.mat` and `nlindata.mat` and save them to the folder of this notebook. These datasets are borrowed from Philipp Hennig's course "Probabilistic machine learning", given at the University of Tübingen.

The following code cell loads inputs, outputs, and precision parameter from lindata.mat and plots the feature vector

$$ \boldsymbol{\phi}(x)^{\mathsf{T}} = [1, x]. $$

# Load data from disk
# File should be in the same folder as the Jupyter notebook,
# otherwise you have to adjust the path
data = spio.loadmat("lindata.mat")
x = data["X"].flatten() # inputs
y = data["Y"].flatten() # outputs
beta = float(data["sigma"])**(-2) # measurement noise precision
N = x.size

# Define the feature vector
def Phi(a):  # Phi(a) = [1, a]
    return np.power(np.reshape(a, (-1, 1)), range(2))

# Plot the features
plt.plot(x, Phi(x), '-o')
plt.title('features')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

We then compute the posterior distribution of the weights $\mathbf{w}$ of a Bayesian linear regression model using these features.

# Define the prior on the weights
# p(w) = N(w; m0, S0)
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D

# Compute the posterior distribution of the Bayesian linear regression model
# p(w | y) = N(w; mN, SN)
SN = np.linalg.inv(np.linalg.inv(S0) + beta * Phi(x).T @ Phi(x))
mN = SN @ (np.linalg.inv(S0) @ m0 + beta * Phi(x).T @ y)

We visualize the posterior distribution by plotting the functions $f$ corresponding to different samples of $\mathbf{w}$.

# Generate grid of new inputs x* for plotting
n = 100  # number of grid-points
xs = np.linspace(-8, 8, n)

# Visualize the posterior p(w | y) = N(w; mN, SN)
# For samples of w, f(x) = phi(x)^T w is evaluated at inputs xs
# Draw samples of w from the posterior
samples = 5
seed = 100
ws = stats.multivariate_normal(mean=mN, cov=SN, allow_singular=True).rvs(samples, random_state=seed)

# Compute corresponding values f(x*)
fs = Phi(xs) @ ws.T

# Plot the samples
plt.plot(xs, fs, 'gray') # samples
plt.scatter(x, y, zorder=3)
plt.title('posterior - samples')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Next we plot samples from and credibility regions of the predictive distribution.

# Compute the predictive distribution of the outputs y*
# p(y* | y) = N(y*; m*, S*)
mstar = Phi(xs) @ mN
Sstar = Phi(xs) @ SN @ Phi(xs).T + beta**(-1) * np.eye(n)

# Extract standard deviation of predictive distribution
stdpred= np.sqrt(np.diag(Sstar))

# Plot credibility regions
plt.plot(xs, mstar, 'black') # predictive mean
plt.fill_between(xs, mstar + 3*stdpred, mstar - 3*stdpred, color='lightgray')
plt.fill_between(xs, mstar + 2*stdpred, mstar - 2*stdpred, color='darkgray')
plt.fill_between(xs, mstar + 1*stdpred, mstar - 1*stdpred, color='gray')
plt.scatter(x, y, zorder=3)
plt.title('predictive distribution')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

(a)¶

Go through the code and make sure that you can map the code to the model and regression method explained in Lecture 3. Also run the code. Make sure you understand the figures.

(b)¶

Reduce the training data to only the first 5 data points in the training data. What impact does this have on the predictive distribution?

# Define the inputs and outputs
N = 5
x = data["X"][:N].flatten() # inputs
y = data["Y"][:N].flatten() # outputs

# We use the same feature vector and prior as above

# Compute the posterior distribution of the Bayesian linear regression model
# p(w | y) = N(w; mN, SN)
SN = np.linalg.inv(np.linalg.inv(S0) + beta * Phi(x).T @ Phi(x))
mN = SN @ (np.linalg.inv(S0) @ m0 + beta * Phi(x).T @ y)

# Compute the predictive distribution of the outputs y*
# p(y* | y) = N(y*; m*, S*)
mstar = Phi(xs) @ mN
Sstar = Phi(xs) @ SN @ Phi(xs).T + beta**(-1) * np.eye(n)

# Extract standard deviation of predictive distribution
stdpred= np.sqrt(np.diag(Sstar))

# Plot credibility regions
plt.plot(xs, mstar, 'black') # predictive mean
plt.fill_between(xs, mstar + 3*stdpred, mstar - 3*stdpred, color='lightgray')
plt.fill_between(xs, mstar + 2*stdpred, mstar - 2*stdpred, color='darkgray')
plt.fill_between(xs, mstar + 1*stdpred, mstar - 1*stdpred, color='gray')
plt.scatter(x, y, zorder=3)
plt.title('predictive distribution')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Now the model becomes very uncertain in regions where we don't have any measurements.

Exercise 3.2: Feature vectors¶

(a)¶

Load nlindata.mat instead of lindata.mat and run the code for this data. Use all data, not only the first five data points as in Exercise 3.1 (b). Do you think the model performs well on this data?

# Load data from disk
# File should be in the same folder as the Jupyter notebook,
# otherwise you have to adjust the path
data = spio.loadmat("nlindata.mat")
x = data["X"].flatten() # inputs
y = data["Y"].flatten() # outputs
beta = float(data["sigma"])**(-2) # measurement noise precision
N = x.size

# Plot the features
# We use the same feature vector as above
plt.plot(x, Phi(x), '-o')
plt.title('features')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Compute the posterior distribution of the Bayesian linear regression model
# p(w | y) = N(w; mN, SN)
# We use the same prior as above
SN = np.linalg.inv(np.linalg.inv(S0) + beta * Phi(x).T @ Phi(x))
mN = SN @ (np.linalg.inv(S0) @ m0 + beta * Phi(x).T @ y)

# Visualize the posterior p(w | y) = N(w; mN, SN)
# For samples of w, f(x) = phi(x)^T w is evaluated at inputs xs
# Draw samples of w from the posterior
ws = stats.multivariate_normal(mean=mN, cov=SN, allow_singular=True).rvs(samples, random_state=seed)

# Compute corresponding values f(x*)
fs = Phi(xs) @ ws.T

# Plot the samples
plt.plot(xs, fs, 'gray') # samples
plt.scatter(x, y, zorder=3)
plt.title('posterior - samples')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Compute the predictive distribution of the outputs y*
# p(y* | y) = N(y*; m*, S*)
mstar = Phi(xs) @ mN
Sstar = Phi(xs) @ SN @ Phi(xs).T + beta**(-1) * np.eye(n)

# Extract standard deviation of predictive distribution
stdpred= np.sqrt(np.diag(Sstar))

# Plot credibility regions
plt.plot(xs, mstar, 'black') # predictive mean
plt.fill_between(xs, mstar + 3*stdpred, mstar - 3*stdpred, color='lightgray')
plt.fill_between(xs, mstar + 2*stdpred, mstar - 2*stdpred, color='darkgray')
plt.fill_between(xs, mstar + 1*stdpred, mstar - 1*stdpred, color='gray')
plt.scatter(x, y, zorder=3)
plt.title('predictive distribution')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

For this data, a feature vector with only a constant and linear term does not perform that well.

(b)¶

In order to improve the performance, consider instead a feature vector with an additional quadratic term

$$ \boldsymbol{\phi}(x)^{\mathsf{T}} = [1, x, x^2]. $$

Change the code accordingly and run it.

Hint: Only a very minor modification in the code is required to accommodate for this change.

Only the definition of the features Phi has to be changed. An additional column for the quadratic term is required.

# Define the feature vector with an additional quadratic term
def Phi(a):  # Phi(a) = [1, a, a**2]
    return np.power(np.reshape(a, (-1, 1)), range(3))

# Plot the features
plt.plot(x, Phi(x), '-o')
plt.title('features')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# We have to redefine the prior on the weights since the
# dimension of the feature vector changed
# p(w) = N(w; m0, S0)
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D

# Compute the posterior distribution of the Bayesian linear regression model
# p(w | y) = N(w; mN, SN)
SN = np.linalg.inv(np.linalg.inv(S0) + beta * Phi(x).T @ Phi(x))
mN = SN @ (np.linalg.inv(S0) @ m0 + beta * Phi(x).T @ y)

# Visualize the posterior p(w | y) = N(w; mN, SN)
# For samples of w, f(x) = phi(x)^T w is evaluated at inputs xs
# Draw samples of w from the posterior
ws = stats.multivariate_normal(mean=mN, cov=SN, allow_singular=True).rvs(samples, random_state=seed)

# Compute corresponding values f(x*)
fs = Phi(xs) @ ws.T

# Plot the samples
plt.plot(xs, fs, 'gray') # samples
plt.scatter(x, y, zorder=3)
plt.title('posterior - samples')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Compute the predictive distribution of the outputs y*
# p(y* | y) = N(y*; m*, S*)
mstar = Phi(xs) @ mN
Sstar = Phi(xs) @ SN @ Phi(xs).T + beta**(-1) * np.eye(n)

# Extract standard deviation of predictive distribution
stdpred= np.sqrt(np.diag(Sstar))

# Plot credibility regions
plt.plot(xs, mstar, 'black') # predictive mean
plt.fill_between(xs, mstar + 3*stdpred, mstar - 3*stdpred, color='lightgray')
plt.fill_between(xs, mstar + 2*stdpred, mstar - 2*stdpred, color='darkgray')
plt.fill_between(xs, mstar + 1*stdpred, mstar - 1*stdpred, color='gray')
plt.scatter(x, y, zorder=3)
plt.title('predictive distribution')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

(c)¶

Use step functions

$$ h(x) = \begin{cases} 1, \qquad x \geq 0,\\ 0, \qquad x < 0, \end{cases} $$

as features and change the code accordingly. Place in total 9 of these features with two steps apart between $x = -8$ and $x = 8$. The feature vector is then

$$ \boldsymbol{\phi}(x)^{\mathsf{T}} = [h(x-8), h(x - 6), \ldots, h(x + 8)]. $$

Again only the definition of the features Phi has to be changed.

# Define the feature vector with step functions
def Phi(a):  # Phi(a) = [h(a - 8), h(a - 6), ..., h(a + 8)]
    return (np.reshape(a, (-1, 1)) > np.linspace(-8, 8, 9))

# Plot the features
plt.plot(x, Phi(x), '-o')
plt.title('features')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# We have to redefine the prior on the weights since the
# dimension of the feature vector changed
# p(w) = N(w; m0, S0)
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D

# Compute the posterior distribution of the Bayesian linear regression model
# p(w | y) = N(w; mN, SN)
SN = np.linalg.inv(np.linalg.inv(S0) + beta * Phi(x).T @ Phi(x))
mN = SN @ (np.linalg.inv(S0) @ m0 + beta * Phi(x).T @ y)

# Visualize the posterior p(w | y) = N(w; mN, SN)
# For samples of w, f(x) = phi(x)^T w is evaluated at inputs xs
# Draw samples of w from the posterior
ws = stats.multivariate_normal(mean=mN, cov=SN, allow_singular=True).rvs(samples, random_state=seed)

# Compute corresponding values f(x*)
fs = Phi(xs) @ ws.T

# Plot the samples
plt.plot(xs, fs, 'gray') # samples
plt.scatter(x, y, zorder=3)
plt.title('posterior - samples')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Compute the predictive distribution of the outputs y*
# p(y* | y) = N(y*; m*, S*)
mstar = Phi(xs) @ mN
Sstar = Phi(xs) @ SN @ Phi(xs).T + beta**(-1) * np.eye(n)

# Extract standard deviation of predictive distribution
stdpred= np.sqrt(np.diag(Sstar))

# Plot credibility regions
plt.plot(xs, mstar, 'black') # predictive mean
plt.fill_between(xs, mstar + 3*stdpred, mstar - 3*stdpred, color='lightgray')
plt.fill_between(xs, mstar + 2*stdpred, mstar - 2*stdpred, color='darkgray')
plt.fill_between(xs, mstar + 1*stdpred, mstar - 1*stdpred, color='gray')
plt.scatter(x, y, zorder=3)
plt.title('predictive distribution')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

(d)¶

Can you come up with any other features that improve performance even further?

One can try, e.g., piecewise linear features:

# Define the feature vector with piecewise linear functions
def Phi(a):  # Phi(a) = [|a + 8| + 8, |a + 7| + 7, ..., |a - 8| - 8]
    return np.abs(np.reshape(a, (-1, 1)) - np.linspace(-8, 8, 17)) - np.linspace(-8, 8, 17)

# Plot the features
plt.plot(x, Phi(x), '-o')
plt.title('features')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# We have to redefine the prior on the weights since the
# dimension of the feature vector changed
# p(w) = N(w; m0, S0)
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D

# Compute the posterior distribution of the Bayesian linear regression model
# p(w | y) = N(w; mN, SN)
SN = np.linalg.inv(np.linalg.inv(S0) + beta * Phi(x).T @ Phi(x))
mN = SN @ (np.linalg.inv(S0) @ m0 + beta * Phi(x).T @ y)

# Visualize the posterior p(w | y) = N(w; mN, SN)
# For samples of w, f(x) = phi(x)^T w is evaluated at inputs xs
# Draw samples of w from the posterior
ws = stats.multivariate_normal(mean=mN, cov=SN, allow_singular=True).rvs(samples, random_state=seed)

# Compute corresponding values f(x*)
fs = Phi(xs) @ ws.T

# Plot the samples
plt.plot(xs, fs, 'gray') # samples
plt.scatter(x, y, zorder=3)
plt.title('posterior - samples')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Compute the predictive distribution of the outputs y*
# p(y* | y) = N(y*; m*, S*)
mstar = Phi(xs) @ mN
Sstar = Phi(xs) @ SN @ Phi(xs).T + beta**(-1) * np.eye(n)

# Extract standard deviation of predictive distribution
stdpred= np.sqrt(np.diag(Sstar))

# Plot credibility regions
plt.plot(xs, mstar, 'black') # predictive mean
plt.fill_between(xs, mstar + 3*stdpred, mstar - 3*stdpred, color='lightgray')
plt.fill_between(xs, mstar + 2*stdpred, mstar - 2*stdpred, color='darkgray')
plt.fill_between(xs, mstar + 1*stdpred, mstar - 1*stdpred, color='gray')
plt.scatter(x, y, zorder=3)
plt.title('predictive distribution')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Exercise 3.3: Marginal likelihood¶

To get a quantitative measure of the performance of the proposed feature vectors, we want to compare them by computing the marginal likelihood $p(\mathbf{y})$ for each of the models. Refer to Exercise 2.12(a) for the expression of the marginal likelihood of the Bayesian linear regression model.

(a)¶

Extend the code to also compute the logarithm of the marginal likelihood. Which one of the three feature vectors in Exercise 3.2 gives the largest log marginal likelihood on the data nlindata.mat?

The following function computes the log marginal likelihood:

# Compute the log-likelihood of the marginal distribution
# p(y) = N(y; my, Sy)
# for the inputs X and outputs y in a Bayesian regression model
# p(y | w, X) = N(y; Xw, I/beta)
# p(w) = N(w; m0, S0)
def marginal_loglik(m0, S0, beta, X, y):
    N = X.shape[0]
    my = X @ m0
    Sy = X @ S0 @ X.T + beta**(-1) * np.eye(N)
    rv = stats.multivariate_normal(mean=my, cov=Sy, allow_singular=True)
    return rv.logpdf(y)

# Load data from disk
# File should be in the same folder as the Jupyter notebook,
# otherwise you have to adjust the path
data = spio.loadmat("nlindata.mat")
x = data["X"].flatten() # inputs
y = data["Y"].flatten() # outputs
beta = float(data["sigma"])**(-2) # measurement noise precision

# Feature vector and prior used in 3.2 (a)
def Phi(a):  # Phi(a) = [1, a]
    return np.power(np.reshape(a, (-1, 1)), range(2))
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D
print("Linear (3.2 (a)):\t\t{:8.1f}".format(marginal_loglik(m0, S0, beta, Phi(x), y)))

# Feature vector and prior used in 3.2 (b)
def Phi(a):  # Phi(a) = [1, a]
    return np.power(np.reshape(a, (-1, 1)), range(3))
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D
print("Quadratic (3.2 (b)):\t\t{:8.1f}".format(marginal_loglik(m0, S0, beta, Phi(x), y)))

# Feature vector and prior used in 3.2 (c)
def Phi(a):  # Phi(a) = [h(a - 8), h(a - 6), ..., h(a + 8)]
    return (np.reshape(a, (-1, 1)) > np.linspace(-8, 8, 9))
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D
print("Step function (3.2 (c)):\t{:8.1f}".format(marginal_loglik(m0, S0, beta, Phi(x), y)))

# Define the feature vector with piecewise linear functions
def Phi(a):  # Phi(a) = [|a + 8| + 8, |a + 7| + 7, ..., |a - 8| - 8]
    return np.abs(np.reshape(a, (-1, 1)) - np.linspace(-8, 8, 17)) - np.linspace(-8, 8, 17)
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D
print("Piecewise linear (3.2 (d)):\t{:8.1f}".format(marginal_loglik(m0, S0, beta, Phi(x), y)))

Linear (3.2 (a)):		  -861.3
Quadratic (3.2 (b)):		  -202.5
Step function (3.2 (c)):	  -382.6
Piecewise linear (3.2 (d)):	   -87.1

We see that linear features perform worst in terms of marginal likelihood and that the piecewise linear features show the largest marginal likelihood.

(b)¶

Perform the same comparison on the data lindata.mat. What are your conclusions?

For this data, linear features perform best in terms of marginal likelihood. Note that the model with the linear features is a special case of the models with the quadratic and with the piecewise linear features. Hence we see that the marginal likelihood also penalizes model complexity.

# Load data from disk
# File should be in the same folder as the Jupyter notebook,
# otherwise you have to adjust the path
data = spio.loadmat("lindata.mat")
x = data["X"].flatten() # inputs
y = data["Y"].flatten() # outputs
beta = float(data["sigma"])**(-2) # measurement noise precision

# Feature vector and prior used in 3.2 (a)
def Phi(a):  # Phi(a) = [1, a]
    return np.power(np.reshape(a, (-1, 1)), range(2))
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D
print("Linear (3.2 (a)):\t\t{:8.1f}".format(marginal_loglik(m0, S0, beta, Phi(x), y)))

# Feature vector and prior used in 3.2 (b)
def Phi(a):  # Phi(a) = [1, a]
    return np.power(np.reshape(a, (-1, 1)), range(3))
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D
print("Quadratic (3.2 (b)):\t\t{:8.1f}".format(marginal_loglik(m0, S0, beta, Phi(x), y)))

# Feature vector and prior used in 3.2 (c)
def Phi(a):  # Phi(a) = [h(a - 8), h(a - 6), ..., h(a + 8)]
    return (np.reshape(a, (-1, 1)) > np.linspace(-8, 8, 9))
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D
print("Step function (3.2 (c)):\t{:8.1f}".format(marginal_loglik(m0, S0, beta, Phi(x), y)))

# Define the feature vector with piecewise linear functions
def Phi(a):  # Phi(a) = [|a + 8| + 8, |a + 7| + 7, ..., |a - 8| - 8]
    return np.abs(np.reshape(a, (-1, 1)) - np.linspace(-8, 8, 17)) - np.linspace(-8, 8, 17)
D = Phi(0).size  # number of features
m0 = np.zeros(D)
S0 = 10*np.eye(D) / D
print("Piecewise linear (3.2 (d)):\t{:8.1f}".format(marginal_loglik(m0, S0, beta, Phi(x), y)))

Linear (3.2 (a)):		   -57.9
Quadratic (3.2 (b)):		   -61.1
Step function (3.2 (c)):	   -71.1
Piecewise linear (3.2 (d)):	   -62.5

(c)¶

Can you come up with any other feature vectors and/or values for prior/likelihood precisions that give an even larger log marginal likelihood?

Open-ended question...