{"cells": [{"cell_type": "markdown", "metadata": {"tags": []}, "source": "# Bias-variance trade-off, model selection and cross validation \u2013 Computer exercises"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": "import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nimport sklearn.preprocessing as skl_pre\nimport sklearn.linear_model as skl_lm\nimport sklearn.discriminant_analysis as skl_da\nimport sklearn.neighbors as skl_nb\nimport sklearn.model_selection as skl_ms"}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## 6.1 Cross validation in $k$-NN\nIn this exercise we will return to the biopsy data set also used in Exercise 4.1 (Lesson 4). We will try to determine suitable value of $k$ in $k$-NN for this data. For simplicity, we will only consider the three attributes in columns V3, V4and V5 in this problem."}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### (a) \nConsider all data as training data. Investigate how the training error varies with different values of $k$ (hint: use a for-loop). Which $k$ gives the best result? Is it a good choice of $k$?"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": "# Load the data\n# url = 'data/biopsy.csv'\nurl = 'https://uu-sml.github.io/course-sml-public/data/biopsy.csv'\nbiopsy = pd.read_csv(url, dtype={'ID': str}).dropna().reset_index(drop=True)"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### (b) \nSplit the data randomly into a training and validation set, and see how well you perform on the validation set. (Previously, we have used the terminology \"training\" and \"test\" set. If the other set (not the training set) is used to make design decisions, such as choosing $k$, it is really not a test set, but rather a \"validation\" set. Hence the\nterminology.) Which $k$ gives the best result?"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### (c) \nPerform [(b)](#6.1-b) 10 times for different validation sets and average the result. Which $k$ gives the best result?"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### (d) \nPerform 10-fold cross-validation by first randomly permute the data set, divide the data set into 10 equally sized parts and loop through them by taking one part as validation set and the rest as training set each time. Which $k$ gives the best result?"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## 6.2 Cross validation for model choice\n\nIn this problem we will consider the data sets data/pima_tr.csv and data/pima_te.csv. Your task is to do as good prediction as possible for the test set pima_te, but you are only allowed to look at the true output in pima_te once (like in the real life, where you design and implement a method, and then hand it over to the ultimate test, namely the user). Hence, you will have to use pima_tr for both deciding which model to use and training the model.\n\nThe data set describes the prevalence of diabetes in women at least 21 years old of Pima Indian heritage, living near Phoenix, Arizona, USA. The data set describes, for each individual, whether she has diabetes or not, her age, the diabetes pedigree function (a summary of the diabetes history in her family), BMI, skin thickness, blood pressure, plasma glucose concentration and number of pregnancies.\n\nThe data frame contains the following columns: \nnpreg number of pregnancies. \nglu plasma glucose concentration in an oral glucose tolerance test. \nbp diastolic blood pressure (mm Hg). \nskin triceps skin fold thickness (mm). \nbmi body mass index (weight in kg/(height in m)\\^2). \nped diabetes pedigree function. \nage age in years. \ntype Yes or No, for diabetic according to WHO criteria.\n"}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### (a)\nLoad the library and familiarize yourself with pima_tr"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": "# Load the datasets\n# url = 'data/pima_tr.csv'\nurl = 'https://uu-sml.github.io/course-sml-public/data/pima_tr.csv'\npima_tr = pd.read_csv(url)\n\n# url = 'data/pima_tr.csv'\nurl = 'https://uu-sml.github.io/course-sml-public/data/pima_te.csv'\npima_te = pd.read_csv(url)"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### (b)\nSee how well you can fit the pima_tr with logistic regression, LDA, QDA and k-NN (k = 2). The output is whether an individual has diabetes or not, and the input the remaining variables. What error rate does each method have? Is it a good indicator of which method is preferable?"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### (c)\nInstead of [(b)](#6.2-b), perform 10-fold cross-validation by first randomly permute pima_tr and divide it in 10 parts. Then, in a loop with one of the 10 parts held out as validation data, fit logistic regression, LDA, QDA and k-NN (k = 2) to the training data and evaluate the performance on the validation data. Plot your results in a box plot with the error rates. Feel free to play around with the choice of inputs and other settings to improve the performance. Which method does this suggest us to use?"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### d) \nNow, decide which method to choose and train it on the entire data set pima_tr and predict pima_te. How well do you perform?"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### (e) \nNow, since we are in a simulated environment, we can cheat and break the rule that we were only allowed to look at the true output in pima_te once. That is, explore how well the other methods do when you train them on pima_tr and predict pima_te. Did you make the \"right\" choice in [(d)](#6.4-d)?"}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## 6.3 Implementing problem 5.3\nVerify your theoretical findings from problem 5.3 by repeating the experiment $N$ times and approximating all expected values with sums. Let $\\sigma^2=1$."}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### a)\nGenerate training data ($n=1$), estimate $\\theta_0$ and compute $\\widehat y(x_\\star;\\mathcal{T})$. Repeat $N$ times and store the results in a vector. Choose the regularization parameter yourself."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### b)\nEstimate $\\bar{f}(x_\\star)=\\mathbb{E}_\\mathcal{T}[y(x_\\star;\\mathcal{T})]$ from your vector of $\\widehat y(x_\\star,\\mathcal{T})$. Compare your result to your theoretical findings in 5.3b."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### c) \nEstimate the square bias $\\mathbb{E}_\\star[(\\bar{f}(x_\\star)-f_0(x_\\star))^2]$ using your result from b) and your knowledge about the true $f_0(x)$. Compare your result to your theoretical findings in 5.3c."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### d) \nEstimate the variance $\\mathbb{E}_\\star[\\mathbb{E}_\\mathcal{T}[(\\widehat y (x_\\star;\\mathcal{T}) - \\bar f(x_\\star))^2]]$ using your vector of $\\widehat y(x_\\star;\\mathcal{T})$ from a) and your result from b). Compare your result to your theoretical findings in 5.3d."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### e) \nEstimate the expected new data error $\\bar E_\\text{new} = \\mathbb{E}_\\mathcal{T}[E_\\text{new}] = \\mathbb{E}_\\mathcal{T}[\\mathbb{E}_\\star[(y(x_\\star;\\mathcal{T})-\\bar{f}(x_\\star))^2]]$ by, for each $\\widehat y(x_\\star;\\mathcal{T})$ in your vector from a), simulate $N$ copies of $y_\\star$. Compare your result to your theoretical findings in 5.3f."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "### f) \nMake a loop over different values for the regularization parameter $\\lambda$ and plot bias, variance and $\\bar{E}_\\text{new}$as a function of $\\lambda$. Also plot your theoretical findings from 5.3 in the same plot."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}, {"cell_type": "markdown", "metadata": {"tags": []}, "source": "## 6.4 Implementing problem 5.5\n\nDesign an experiment (similarly to 6.3) where you numerically confirm the results from problem 5.5."}, {"cell_type": "code", "execution_count": null, "metadata": {"tags": []}, "outputs": [], "source": ""}], "metadata": {"celltoolbar": "Tags", "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10"}}, "nbformat": 4, "nbformat_minor": 2}