{"cells": [{"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": "import numpy as np\nimport pandas as pd\nimport sklearn.linear_model as skl_lm\nimport matplotlib.pyplot as plt\n\n# To get nicer plots\nfrom IPython.display import set_matplotlib_formats\nset_matplotlib_formats('svg') # Output as svg. Else you can try png\nfrom IPython.core.pylabtools import figsize\nfigsize(10, 6) # Width and hight\nnp.set_printoptions(precision=3);"}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "# 2.1 Problem 1.1 using matrix multiplications\nImplement the linear regression problems from Exercises 1.1(a), (b), (c), (d) and (e) in Python using matrix multiplications.\nA matrix\n$$\n\\textbf{X} = \\begin{bmatrix}\n 1 & 2 \\\\\n 1 & 3 \\\\ \n\\end{bmatrix}\n$$\ncan be constructed with numpy as `X=np.array([[1, 2], [1, 3]])` (Make sure that `numpy` has been imported. Here it is imported as `np`). The commands for matrix multiplication and transpose in `numpy` are `@` or `np.matmul` and `.T` or `np.transpose()` respectively. A system of linear equations $\\textbf{A}x=\\textbf{b}$ can be solved using `np.linalg.solve(A,b)`. A $k \\times k$ unit matrix can be constructed with `np.eye(k)`.\n"}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (a) \nAssume that you record a scalar input $x$ and a scalar output $y$. First, you record $x_1 = 2, y_1 = -1$, and thereafter $x_2 = 3, y_2 = 1$. Assume a linear regression model $y = \\theta_0 + \\theta_1 x + \\epsilon$ and learn the parameters with maximum likelihood $\\widehat{\\boldsymbol{\\theta}}$ with the assumption $\\epsilon \\sim \\mathcal{N}(0,\\sigma_\\epsilon^2)$. Use the model to predict the output for the test input $x_\\star = 4$, and plot the data and the model."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (b) \n\nNow, assume you have made a third observation $y_3 = 2$ for $x_3 = 4$ (is that what you predicted in [(a)](#2.1-a)?). Update the parameters $\\widehat{\\boldsymbol{\\theta}}$ to all 3 data samples, add the new model to the plot (together with the new data point) and find the prediction for $x_\\star = 5$."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (c) \nRepeat [(b)](#2.1-b), but this time using a model without intercept term, i.e., $y = \\theta_1x + \\epsilon$."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (d) \nRepeat [(b)](#2.1-b), but this time using Ridge Regression with $\\gamma=1$ instead."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (e) \nYou realize that there are actually _two_ output variables in the problem you are studying. In total, you have made the following observations:\n\n| sample | input $x$ | first output $y_1$ | second output $y_2$ |\n|:------:|:---------:|:------------------:|:-------------------:|\n| (1) | 2 | -1 | 0 |\n| (2) | 3 | 1 | 2 |\n| (3) | 4 | 2 | -1 |\n\nYou want to model this as a linear regression with multidimensional outputs (without regularization), i.e.,\n$$\\begin{align}\n y_1 &= \\theta_{01}+\\theta_{11}x + \\epsilon_1\\\\\n y_2 &= \\theta_{02}+\\theta_{12}x + \\epsilon_2\n\\end{align}$$\nBy introducing, for the general case of $p$ inputs and $q$ outputs, the matrices\n$$\\begin{align}\n \\underbrace{\\begin{bmatrix}\n y_{11} & \\cdots & y_{1q} \\\\\n y_{21} & \\cdots & y_{2q} \\\\\n \\vdots & & \\vdots \\\\\n y_{n1} & \\cdots & y_{nq}\n \\end{bmatrix}}_{\\boldsymbol{\\mathrm{Y}}}\n &=\n \\underbrace{\\begin{bmatrix}\n 1 & x_{11} & x_{12} & \\cdots & x_{1p} \\\\\n 1 & x_{21} & x_{22} & \\cdots & x_{2p} \\\\\n \\vdots & \\vdots & \\vdots & \\vdots \\\\\n 1 & x_{n1} & x_{n2} & \\cdots & x_{np} \\\\\n \\end{bmatrix}}_{\\boldsymbol{\\mathrm{X}}}\n \\underbrace{\\begin{bmatrix}\n \\theta_{01} & \\theta_{02} & \\cdots & \\theta_{0q} \\\\\n \\theta_{11} & \\theta_{12} & \\cdots & \\theta_{1q} \\\\\n \\theta_{21} & \\theta_{22} & \\cdots & \\theta_{2q} \\\\\n \\vdots & \\vdots & & \\vdots \\\\\n \\theta_{p1} & \\theta_{p2} & \\cdots & \\theta_{pq}\n \\end{bmatrix}}_{\\boldsymbol{\\mathrm{\\Theta}}} + \\boldsymbol{\\epsilon}\n\\end{align}$$\n\ntry to make an educated guess how the normal equations can be generalized to the multidemsional output case. (A more thorough derivation is found in problem 1.5). Use your findings to compute the least square solution $\\widehat{\\boldsymbol{\\mathrm{\\Theta}}}$ to the problem now including both the first output $y_1$ and the second output $y_2$."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "# 2.2 Problem 1.1 using the linear_model.LinearRegression() command\nImplement the linear regression problem from Exercises 1.1(b) and (c) using the command `LinearRegression()` from `sklearn.linear_model`. "}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (b)\n[See above.](#2.1-b)"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (c)\n[See above.](#2.1-c)"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "# 2.3 The Auto data set"}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (a)\nLoad the dataset `'data/auto.csv'`. Familiarize yourself with the dataset using `auto.info()`. The dataset: \n\n**Description**: Gas mileage, horsepower, and other information for 392 vehicles. \n**Format**: A data frame with 392 observations on the following 9 variables. \n\n- `mpg`: miles per gallon \n- `cylinders`: Number of cylinders between 4 and 8\n- `displacement`: Engine displacement (cu. inches)\n- `horsepower`: Engine horsepower\n- `weight`: Vehicle weight (lbs.)\n- `acceleration`: Time to accelerate from 0 to 60 mph (sec.)\n- `year`: Model year (modulo 100)\n- `origin`: Origin of car (1. American, 2. European, 3. Japanese)\n- `name`: Vehicle name \n*The orginal data contained 408 observations but 16 observations with missing values were removed.*\n"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": "# Load library\n# The null values are '?' in the dataset. `na_values=\"?\"` recognize the null values. \n# There are null values that will mess up the computation. Easier to drop them by `dropna()`.\n\n# url = 'data/auto.csv'\nurl = 'https://uu-sml.github.io/course-sml-public/data/auto.csv'\n\nauto = pd.read_csv(url, na_values='?').dropna()"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (b)\nDivide the data set randomly into two approximately equally sized subsets, `train` and `test` by generating the random indices using `np.random.choice()`.\n"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (c)\nPerform linear regression with `mpg` as the output and all other variables except name as input. How well (in terms of root-mean-square-error) does the model perform on test data and training data, respectively?\n"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (d)\nNow, consider the input variable `origin`. What do the different numbers represent? By running `auto.origin.sample(30)` we see the 30 samples of the variable and that the input variables is quantitative. Does it really makes sense to treat it as a quantitative input? Use `pd.get_dummies()` to split it into dummy variables and do the linear regression again.\n"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (e)\nTry obtain a better RMSE on test data by removing some inputs (explore what happens if you remove, e.g, `year`, `weight` and `acceleration`)"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (f)\nTry to obtain a better RMSE on test data by adding some transformations of inputs, such as \n$log(x)$, $\\sqrt{x}$, $x_1x_2$ etc.\n"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "# 2.4 Nonlinear transformations of input variables"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": "#Start by running the following code to generate your training data\nnp.random.seed(1)\nx_train = np.random.uniform(0, 10, 100)\ny_train = .4 \\\n - .6 * x_train \\\n + 3. * np.sin(x_train - 1.2) \\\n + np.random.normal(0, 0.1, 100)"}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (a) \nPlot the training output `y_train` versus the training input `x_train`. "}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (b) \nLearn a model on the form \n$$\ny= a + bx + c \\sin(x + \\phi) + \\epsilon, \\qquad \\epsilon \\sim \\mathcal{N}(0, 0,1^2) \\qquad (2.1)\n$$\n\nwhere all parameters $a$, $b$, $c$ and $\\phi$ are to be learned from the training data `x_train` and `y_train`. Refrain from using the` linear_model()` command, but implement the normal equations yourself as in problem 2.1. Hint: Even though (2.1) is not a linear regression model, you can use the fact that $c \\sin(x + \\phi) = c \\cos(\\phi) \\sin(x) + c \\sin(\\phi) \\cos(x)$ to transform it into one. \n"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (c) \nConstruct 100 test inputs `x_test` in the span from 0 to 10 by using the `np.linspace()` function. Predict the outputs corresponding to these inputs and plot them together with the training data."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (d) \nDo a least squares fit by instead using the `linear_model()` function in `Python`. Check that you get the same estimates as in (b)."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "# 2.5 Regularization"}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "In this exercise we will apply Ridge regression and Lasso for fitting a polynomial to a scalar data set. We will have a setting where we first generate synthetic training data from \n$$\ny = x^3 + 2x^2 + 6 + \\epsilon, \\qquad (2.2)\n$$\nand later try to learn model for the data. "}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (a) \nWrite a function that implements the polynomial [(2.2)](#2.2), i.e., takes $x$ as argument and returns $x^3 + 2x^2 + 6$. "}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (b) \nUse `np.random.seed()` to set the random seed. Use the function `np.linspace()` to construct a vector `x` with `n = 12` elements equally spaced from $-2.3$ to $1$. Then use your function from [(a)](#2.5-a) to construct a vector $\\textbf{y} = [y_1, ..., y_n]^T$ with 12 elements, where $y = x^3 + 2x^2 + 6 + \\epsilon$, with $\\epsilon \\sim \\mathcal{N(0, 1^2)}$. This is our training data."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (c) \nPlot the training data $\\mathcal{T} = \\{x_i, y_i\\}_{i=1}^{12}$ together with the \"true\" function."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (d)\nFit a straight line to the data with $y$ as output and $x$ as input and plot the predicted output $\\hat{y}_{\\star}$ for densely spaced $x_{\\star}$ values between $-2.3$ and $1$. Plot these predictions in the same plot window."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (e) \nFit an 11th degree polynomial to the data with linear regression. Plot the corresponding predictions."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## (f) \n\nUse the fucntion `sklearn.linear_model.Ridge` and `sklearn.linear_model.Lasso` to fit a 11th degree polynomial. Also inspect the estimated coefficients. Try different values of penalty term $\\alpha$. What do you observe?\n"}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": false, "tags": []}, "outputs": [], "source": ""}], "metadata": {"@webio": {"lastCommId": null, "lastKernelId": null}, "celltoolbar": "Tags", "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6"}}, "nbformat": 4, "nbformat_minor": 2}