L0: Introduction to Python

In this exercise session, we will review the fundamentals of three important Python libraries: Numpy, Pandas and Matplotlib. Throughout the course, you will need to get familiar with several Python libraries that provide convenient functionality for machine learning purposes. It is good to get into the habit of using the available documentation to your advantage. Some efficient ways of doing this are described below:

help() function

In Python, the help()-function can be used to display the documentation for a module, function, or object. When called with no arguments it opens an interactive help session. When called with a specific object as an argument, it displays the documentation for that object. For example, you can use help(print) to view the documentation for the built-in print function, or help(str) to view the documentation for the str class. Additionally, you can use the dir()-function to get all methods and properties of the object passed as an argument to it. It can be used to check all the attributes of a module or class. For example, dir(str) will give the methods and properties of the str class.

SHIFT + TAB

In Jupyter notebook, shift+tab is a keyboard shortcut that can be used to access the documentation for the function or object that appears immediately before the cursor. When you press shift+tab, a small pop-up window will appear that contains information about the function or object, including its arguments and their types. Pressing shift+tab multiple times will cycle through different levels of documentation. If nothing is selected it will show the tip of the current cell. When running notebooks on Google Colab, you can trigger the documentation by clicking the function and then hovering over it with the cursor.

Before getting started, we make sure that the libraries are properly imported in our current environment. Do this by running the cell below.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

0.1 Numpy Fundamentals

Basic data structures

(a)

Vectors, matrices and tensors can be represented as numpy arrays. Numpy arrays are often initialized from regular Python lists. For example, the Python list [1, 2, 3] can be converted into a 1D numpy array using the command np.array([1, 2, 3]). Create this numpy array in the cell below and print its shape. You can find the shape of a numpy array A using np.shape(A).

You can create 1D arrays with n elements using for example np.zeros(n), np.ones(n), np.arange(a, b), np.random.rand(n), np.linspace(a, b, n)... Try this out and figure out what the different functions do.

(b)

Similarly, a 2D numpy array can be created from a nested Python list (a list of lists). Convert the nested lists [ [1, 2, 3] ] and [ [1], [2], [3] ] into numpy arrays and print their shapes. Which one represents a column vector, and which one represents a row vector?

(c)

Create a 2D numpy array to represent the matrix D and inspect its shape. $$ \textbf{D} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \\ \end{bmatrix} $$

We can create higher dimensional ndarrays. in a similar way. You can create ndarrays of shape (n, m) using for example np.zeros((n, m)), np.ones((n, m), np.random.rand(n, m). You can also use np.eye(n) to create a diagonal matrix of shape (n,n). Try this out and make sure you understand what the different functions do.

In [3]:
# enter your code here
# a)
A = np.array([1, 2, 3])
print(f'The shape of A is: {np.shape(A)}')

A1 = np.zeros(3)
A2 = np.ones(4)
A3 = np.arange(1, 10)
A4 = np.linspace(0, 10, 5)
A5 = np.random.rand(3)

print(f'Array A1: {A1}, shape {np.shape(A1)}')
print(f'Array A2: {A2}, shape {np.shape(A2)}')
print(f'Array A3: {A3}, shape {np.shape(A3)}')
print(f'Array A4: {A4}, shape {np.shape(A4)}')
print(f'Array A5: {A5}, shape {np.shape(A5)}')

# b)
B = np.array([[1, 2, 3]])       # ROW VECTOR (1x3 matrix)
C = np.array([[1], [2], [3]])   # COLUMN VECTOR (3x1 matrix)
print(f'Shape of B: {np.shape(B)}, shape of C: {np.shape(C)}')

# c)
D = np.array([[1, 2, 3], [4, 5, 6], [7, 8 , 9], [10, 11, 12]])
print(f'The shape of the matrix D is: {np.shape(D)}')

E1 = np.zeros((3,2)) # Return a new array of given shape filled with zeros
E2 = np.ones((4,2)) # Return a new array of given shape filled with ones
E3 = np.random.rand(2,2) # Create an array of the given shape and populate it with random samples from a uniform distribution
E4 = np.eye(2) # Identity matrix of the given dimension

print(f'Array E1:\n {E1}, shape {np.shape(E1)}')
print(f'Array E2:\n {E2}, shape {np.shape(E2)}')
print(f'Array E3:\n {E3}, shape {np.shape(E3)}')
print(f'Array E4:\n {E4}, shape {np.shape(E4)}')
The shape of A is: (3,)
Array A1: [0. 0. 0.], shape (3,)
Array A2: [1. 1. 1. 1.], shape (4,)
Array A3: [1 2 3 4 5 6 7 8 9], shape (9,)
Array A4: [ 0.   2.5  5.   7.5 10. ], shape (5,)
Array A5: [0.08569126 0.40596271 0.39062273], shape (3,)
Shape of B: (1, 3), shape of C: (3, 1)
The shape of the matrix D is: (4, 3)
Array E1:
 [[0. 0.]
 [0. 0.]
 [0. 0.]], shape (3, 2)
Array E2:
 [[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]], shape (4, 2)
Array E3:
 [[0.8694118  0.92093279]
 [0.09714469 0.20863833]], shape (2, 2)
Array E4:
 [[1. 0.]
 [0. 1.]], shape (2, 2)

Slicing and Indexing

(d)

We can access elements and slice numpy arrays easily. The cell below defines a 1D array F and a 2D array G. Try the commands: F[0], F[-1], F[:-2], G[2,3], G[:,2], G[-1,:3] and figure out what they mean. Make sure you understand how to index and slice numpy arrays of different dimensions.

(e)

It is also easy to assign new values to specific elements or entire rows and columns of numpy arrays. For example, F[0] = 5 replaces the first value of F with 5. Figure out how to replace the last column of G with the array F.

In [4]:
# (d)
F = np.arange(4)
G = np.array([[1, 4, 5, 6, 1], [2, 5, 6, 6, 1], [2, 3, 1, 1, 1], [8, 12, 14, 20, 1]])

# enter your code here
print(f'The array F: {F}')
print(f'Shape of F: {np.shape(F)}')
print(f'F[0] picks out the first element in F: {F[0]}')
print(f'F[-1] picks out the last element in F: {F[-1]}')
print(f'F[:-2] creates a sliced 1D array containing all but the last 2: {F[:-2]}\n')

print(f'The array G:\n {G}')
print(f'Shape of G: {np.shape(G)}')
print(f'G[2,3] picks out the element in row 3, column 4: {G[2,3]} (remember that we start counting rows and columns from 0)')
print(f'G[:,2] picks out the third column in G and stores it as a 1D array: {G[:,2]}, shape: {np.shape(G[:,2])}')
print(f'G[-1:3] creates a sliced 1D array containing a slice of the last row in G and stores it as a 1D array: {G[-1,:3]}, shape: {np.shape(G[-1, :3])}')
# Note that G[-1,:3] is the same as G[3,:3] as 3 is the last row

# (e)
G[:, -1] = F
print(f'Updated array G with F in the last column:\n {G}')
The array F: [0 1 2 3]
Shape of F: (4,)
F[0] picks out the first element in F: 0
F[-1] picks out the last element in F: 3
F[:-2] creates a sliced 1D array containing all but the last 2: [0 1]

The array G:
 [[ 1  4  5  6  1]
 [ 2  5  6  6  1]
 [ 2  3  1  1  1]
 [ 8 12 14 20  1]]
Shape of G: (4, 5)
G[2,3] picks out the element in row 3, column 4: 1 (remember that we start counting rows and columns from 0)
G[:,2] picks out the third column in G and stores it as a 1D array: [ 5  6  1 14], shape: (4,)
G[-1:3] creates a sliced 1D array containing a slice of the last row in G and stores it as a 1D array: [ 8 12 14], shape: (3,)
Updated array G with F in the last column:
 [[ 1  4  5  6  0]
 [ 2  5  6  6  1]
 [ 2  3  1  1  2]
 [ 8 12 14 20  3]]

Reshaping and Stacking

(f)

If we have access to a 1D array H, there are multiple ways of converting this array into a 2D array, i.e. adding a dimension to the numpy array. This can be done using the function np.reshape(H, (n, m)), where (n, m) is the desired shape of the array. Alternatively, we can write H.reshape(n, m). Convert the array H given below into 2D arrays of different shapes (n,m) and inspect the result. What is the requirement on n and m? Can you use the reshape-function to convert H into a column vector and a row vector?

(g)

If we want to reshape a 1D array into a column- or row vector and we do not know the size of the array, we can use -1 in place of the unknown size, i.e. np.reshape(array, -1, 1). Use this method to convert the array M into a column- and a row vector, and confirm the dimensions using np.shape().

We can also use np.newaxis to expand the dimension of an array M: M2 = M[np.newaxis, :]. Use this method to convert the array M into a column and a row vector as well. This can be important for example when a function requires a 2D array and you have access to your data in a 1D array.

(h)

We can also stack numpy arrays to create new arrays, using for example np.vstack() and np.hstack(). In the example below, F and G are stacked vertically and horizontally. Inspect the results to understand how to stack numpy arrays. Then, create a new array X and add a row and a column of ones using the appropriate functions. Let X be a 3x3 diagonal matrix with 4's along the diagonal. Confirm the shape of the resulting array. Remember that you can create a digaonal matrix with np.eye().

In [5]:
# f)
H = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

# enter your code here
n = 5
m = 2
H2 = np.reshape(H, (m, n))
H3 = np.reshape(H, (n, m))
m = 10
n = 1
H4 = np.reshape(H, (m, n))
H5 = np.reshape(H, (n, m))
print(f'With (m, n) = (2, 5), H is converted into the 2D array:\n {H2}')
print(f'With (m, n) = (5, 2), H is converted into the 2D array:\n {H3}')
print(f'With (m, n) = (10, 1), H is converted into the 2D array [:\n {H4}') # column vector
print(f'With (m, n) = (1, 10), H is converted into the 2D array:\n {H5}')   # row vector

# g)
M = np.linspace(1, 17, 100)

# enter your code here
print(f'Original shape of M: {np.shape(M)}')

M1_column_vector = M.reshape(-1, 1)
M1_row_vector = M.reshape(1, -1)
M2_column_vector = M[:, np.newaxis]
M2_row_vector = M[np.newaxis, :]

print(f' New shape: np.reshape, column vector shape {np.shape(M1_column_vector)}')
print(f' New shape: np.reshape, row vector shape {np.shape(M1_row_vector)}')
print(f' New shape: np.newaxis, column vector shape {np.shape(M2_column_vector)}')
print(f' New shape: np.newaxis, row vector shape {np.shape(M2_row_vector)}')

# h)
F = np.zeros(4)
G = np.array([[2, 2, 2, 2], [3, 3, 3, 3], [4, 4, 4, 4], [5, 5, 5, 5]])
G_rowstack = np.vstack((G, F))  # adds a row of zeros to G at the bottom (vstack --> vertical additon)
F = F.reshape(-1,1)             # to add a column, we need to reshape F into the appropriate 2D array
G_colstack = np.hstack((G, F))  # adds a column of zeros to G at the right (hstack --> horizontal additon)
print(f'Row extended G:\n {G_rowstack}')
print(f'Column extended G:\n {G_colstack}')

# enter your code here
X = 4*np.eye(3)
c = np.ones(3)
X_row = np.vstack((X, c))  # adds a row of ones to X
c = c.reshape(-1,1)        # to add a column, we need to reshape c into the appropriate 2D array
X_col = np.hstack((X, c))  # adds a column of ones to X
print(f'Row extended X:\n {X_row}')
print(f'Column extended X:\n {X_col}')
With (m, n) = (2, 5), H is converted into the 2D array:
 [[ 10  20  30  40  50]
 [ 60  70  80  90 100]]
With (m, n) = (5, 2), H is converted into the 2D array:
 [[ 10  20]
 [ 30  40]
 [ 50  60]
 [ 70  80]
 [ 90 100]]
With (m, n) = (10, 1), H is converted into the 2D array [:
 [[ 10]
 [ 20]
 [ 30]
 [ 40]
 [ 50]
 [ 60]
 [ 70]
 [ 80]
 [ 90]
 [100]]
With (m, n) = (1, 10), H is converted into the 2D array:
 [[ 10  20  30  40  50  60  70  80  90 100]]
Original shape of M: (100,)
 New shape: np.reshape, column vector shape (100, 1)
 New shape: np.reshape, row vector shape (1, 100)
 New shape: np.newaxis, column vector shape (100, 1)
 New shape: np.newaxis, row vector shape (1, 100)
Row extended G:
 [[2. 2. 2. 2.]
 [3. 3. 3. 3.]
 [4. 4. 4. 4.]
 [5. 5. 5. 5.]
 [0. 0. 0. 0.]]
Column extended G:
 [[2. 2. 2. 2. 0.]
 [3. 3. 3. 3. 0.]
 [4. 4. 4. 4. 0.]
 [5. 5. 5. 5. 0.]]
Row extended X:
 [[4. 0. 0.]
 [0. 4. 0.]
 [0. 0. 4.]
 [1. 1. 1.]]
Column extended X:
 [[4. 0. 0. 1.]
 [0. 4. 0. 1.]
 [0. 0. 4. 1.]]

Aggregation and Linear Algebra

(i)

There is a sea of useful numpy functions that you may want to become familiar with. For example, you can find the minimum and maximum element of a numpy array z using np.min(z) and np.max(z). If z is a matrix, you can find the minimum and maximum across rows and columns (i.e. across each axis of the 2D array) using np.min(z, axis=0) and np.max(z, axis=1). You can also find the sum of an entire array, or the sum across columns or rows, using np.sum(). Find the minimum and maximum element of the matrix Z defined below, as well as the sum across the columns of the matrix. $$ \textbf{Z} = \begin{bmatrix} 10 & 0 & 0 \\ 1 & 11 & 1 \\ 2 & 2 & 12 \\ \end{bmatrix} $$

(j)

Arithmetic operations on numpy arrays are straightforward. For example, you may add two arrays A and B of appropriate size simply through A+B, or np.add(A, B). Many useful linear algebra operations are also available in numpy. For example, you can find the transpose of a matrix Z defined as a numpy array using np.linalg.transpose(Z) (or simply Z.T). You can find the inverse using np.linalg.inv(Z). Matrix multiplication of two matrices A and B can be performed using np.matmul(A, B) (or simply A@B, where the @-operator implements np.matmul). Note that $A*B$ returns the elementwise multiplication of A and B.

Another useful function is the linear system solver. A linear system of the form $Z\cdot x=b$ can be solved efficiently using np.linalg.solve($Z$, $b$). Solve the following linear system both using the matrix inverse and np.linalg.solve:

$$ \begin{bmatrix} 10 & 0 & 0 \\ 1 & 11 & 1 \\ 2 & 2 & 12 \\ \end{bmatrix} x = \begin{bmatrix} 2\\ 1\\ 10\\ \end{bmatrix} $$
In [7]:
# enter your code here
# i)
Z = np.array([[10, 0, 0], [1, 11, 1], [2, 2, 12]])
min_Z = np.min(Z)
max_Z = np.max(Z)
sum_col_Z = np.sum(Z, axis=1)
print(f'The minimum element of Z is: {min_Z}')
print(f'The maximum element of Z is: {max_Z}')
print(f'The column-wise sum of Z is: {sum_col_Z}')

# j)
b = np.array([2, 1, 10])
x = np.linalg.solve(Z, b)
x2 = np.linalg.inv(Z)@b
print(f'The solution using np.linalg.solve is: {x}')
print(f'The solution using np.linalg.inv is: {x2}')
The minimum element of Z is: 0
The maximum element of Z is: 12
The column-wise sum of Z is: [10 13 16]
The solution using np.linalg.solve is: [0.2 0.  0.8]
The solution using np.linalg.inv is: [ 2.00000000e-01 -1.38777878e-17  8.00000000e-01]

0.2 Pandas Fundamentals

Pandas dataframes can be used to store data tables, and contains functionality to analyze, explore and manipulate the data in these tables. Numpy arrays can be converted into dataframes, but in this course we will mostly load datasets from csv files to pandas dataframes.

(a)

We begin by exploring the auto-dataset. Run the cell below to load the dataset and store it in a pandas dataframe called 'Auto'. We also print the number of rows and columns in the dataset. The dataset contains information about a number of vehicles. The following features are observed:

  • mpg: miles per gallon
  • cylinders: Number of cylinders between 4 and 8
  • displacement: Engine displacement (cu. inches)
  • horsepower: Engine horsepower
  • weight: Vehicle weight (lbs.)
  • acceleration: Time to accelerate from 0 to 60 mph (sec.)
  • year: Model year (modulo 100)
  • origin: Origin of car (1. American, 2. European, 3. Japanese)
  • name: Vehicle name

To get an overview of the data, we can use the commands Auto.info(). Using Auto.describe() we get summaries of some important statistics for each column in the dataset. With Auto.head() we can take a look at the first five rows of the data. Use these functions and get an overview of the data. What information can we get from the dataset, and how many samples have we collected? Each entry in the dataframe is a sample (measurement point) that we can use to train our machine learning models.

In [8]:
url = 'https://github.com/uu-sml/course-sml-public/raw/master/data/auto.csv'
Auto = pd.read_csv(url)
print(f'Auto.shape: {Auto.shape}')

# enter your code here
Auto.info()
Auto.describe()
Auto.head()
Auto.shape: (397, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    397 non-null    object 
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
Out[8]:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino

(b)

If we only need a subset of the dataframe, we can create a new dataframe containing this subset. For example, we can create a dataframe X containing only the weight and acceleration features by running the cell below. Explore the new dataframe, and check the shape using X.shape.

In [11]:
X = Auto[["weight", "acceleration"]]

# enter your code here
print(f'X.shape: {X.shape}')
X.info()
X.describe()
X.head()
X.shape: (397, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   weight        397 non-null    int64  
 1   acceleration  397 non-null    float64
dtypes: float64(1), int64(1)
memory usage: 6.3 KB
Out[11]:
weight acceleration
0 3504 12.0
1 3693 11.5
2 3436 11.0
3 3433 12.0
4 3449 10.5

(c)

We can also slice the dataframe using index. For example, we can pick out the last column of the dataframe by running the cell below. Explore the new dataframe X2 as you did in (b). Create a new dataframe containing multiple columns using index.

In [12]:
X2 = Auto.iloc[:, -1]

# enter your code here
print(f'X2.shape: {X2.shape}')
X2.info()
X2.describe()
X2.head()

# new dataframe
X3 = Auto.iloc[200:, :3] # pick out the first three columns, and leave out the first 200 rows.
print(f'X3.shape: {X3.shape}')
X3.info()
X3.describe()
X2.shape: (397,)
<class 'pandas.core.series.Series'>
RangeIndex: 397 entries, 0 to 396
Series name: name
Non-Null Count  Dtype 
--------------  ----- 
397 non-null    object
dtypes: object(1)
memory usage: 3.2+ KB
X3.shape: (197, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 200 to 396
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           197 non-null    float64
 1   cylinders     197 non-null    int64  
 2   displacement  197 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 4.7 KB
Out[12]:
mpg cylinders displacement
count 197.000000 197.000000 197.000000
mean 27.301015 5.020305 163.807107
std 7.763882 1.498160 82.616728
min 13.000000 3.000000 70.000000
25% 20.200000 4.000000 98.000000
50% 27.200000 4.000000 135.000000
75% 33.500000 6.000000 225.000000
max 46.600000 8.000000 400.000000

(d)

In the course, we will often divide a dataset randomly into a train and a test set. This means that we want a random subset of the entries to go in each sliced dataset. In the cell below, we use numpy's random number generator to generate a numpy array containing indices of 80% of the entries in the Auto dataset, chosen randomly. The total number of samples is N, and 80% then corresponds to n samples that should go in our train set. We use np.random.choice() to pick out n out of N random indices, which is returned in a numpy array. Then, we use Auto.index.isin() on this array. This function returns a boolean array with element False if the index in Auto is not found among the random indices, and True if it is there.

Inspect both the array of random indices, random_index, and the boolean array, train_samples. Finally, we create a boolean array for the test set, which is True in each element where the train boolean array is False, and vice versa. Make sure you understand what is happening in every line of the code.

In [13]:
N = Auto.shape[0]                                               #  total number of samples in the dataset
n = round(0.8*N)                                                #  total number of samples in the train dataset
random_index = np.random.choice(N, size = n, replace = False)   #  replace=False is needed so that the same index does not appear twice in the final list
train_samples = Auto.index.isin(random_index) # boolean array containing True if the sample has been chosen or False otherwise
test_samples = ~train_samples # complementary boolean array

(e)

Create a new dataframe containing only the random indices generated. This can be done by passing the boolean array corresponding to the sliced datasets to Auto.iloc. Inspect the train and test sets. Are the shapes as you expect?

In [15]:
# enter your code here
train_set = Auto.iloc[train_samples]
test_set = Auto.iloc[test_samples]

test_set.info()
train_set.info()
print(f'Train set shape: {train_set.shape}')
print(f'Test set shape: {test_set.shape}')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79 entries, 3 to 395
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           79 non-null     float64
 1   cylinders     79 non-null     int64  
 2   displacement  79 non-null     float64
 3   horsepower    79 non-null     object 
 4   weight        79 non-null     int64  
 5   acceleration  79 non-null     float64
 6   year          79 non-null     int64  
 7   origin        79 non-null     int64  
 8   name          79 non-null     object 
dtypes: float64(3), int64(4), object(2)
memory usage: 6.2+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 318 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           318 non-null    float64
 1   cylinders     318 non-null    int64  
 2   displacement  318 non-null    float64
 3   horsepower    318 non-null    object 
 4   weight        318 non-null    int64  
 5   acceleration  318 non-null    float64
 6   year          318 non-null    int64  
 7   origin        318 non-null    int64  
 8   name          318 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 24.8+ KB
Train set shape: (318, 9)
Test set shape: (79, 9)

0.3 Matplotlib Fundamentals

(a)

Matplotlib is an extensive Python library for data visualization. Matplotlib is used along with numpy to provide an it is often used to create visualizations of data in machine learning, such as line plots, scatter plots, bar plots, histograms, and 3D plots. These visualizations can be useful for understanding the behavior of the data and the performance of machine learning models. Additionally, it is used also to visualize the performance of the model during the training process as well as its predictions with test data set.

There are tools for a variety of different kinds of plots. Check the documentation (https://matplotlib.org/) for information on design choices and different plotting options.

In the cell below you can find a simple example on how to plot numpy arrays.

In [16]:
dinosaur_fossils = np.array([5, 15, 34, 9, 122, 420, 850])
year = np.array([1820, 1860, 1900, 1940, 1980, 2000, 2020])
plt.figure(1)
plt.plot(year, dinosaur_fossils, 'g-*', label='Fossil Count by Year')
plt.legend()
plt.title('Dinosaur Fossils Found over Time')
plt.xlabel('Year')
plt.ylabel('Fossil Count')
#plt.savefig('dinosaur_fossil.png')   # you can use this command to save a figure to the main project folder
plt.show()

(b)

In the cell below, we have returned to the Auto dataframe. Auto.groupby('Year') returns a Pandas DataFrameGroupBy-object containing information summarized by the Model Year. Auto.groupby('year').mean() takes the mean of each entry in the remaining feature columns, grouped by the model year. Inspect the resulting dataframe and plot the mean acceleration as a function of time. You can convert a dataframe A to a numpy array using A.to_numpy()

In [17]:
year_data = Auto.groupby('year').mean()

# enter your code here
year_data, year_data.info()

mean_acceleration = year_data[['acceleration']].to_numpy()
model_year = np.arange(70, 83, 1) # values for the x-axis

plt.figure(2)
plt.plot(model_year[:, np.newaxis], mean_acceleration, 'bo-', label = 'mean acceleration')
plt.title('Mean Acceleration over Time')
plt.xlabel('Year')
plt.ylabel('Mean Acceleration [s]')
plt.legend()
# plt.savefig('mean_acc.png')
plt.show()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13 entries, 70 to 82
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           13 non-null     float64
 1   cylinders     13 non-null     float64
 2   displacement  13 non-null     float64
 3   weight        13 non-null     float64
 4   acceleration  13 non-null     float64
 5   origin        13 non-null     float64
dtypes: float64(6)
memory usage: 728.0 bytes

(c) Subplot examples

In matplotlib, a figure can contain multiple subplots, which are organized in a grid-like pattern. You can create a new figure and add subplots to it using the plt.figure() and plt.subplot() functions. The figure() function creates a new figure, and the subplot() function is used to add subplots to the figure.

In the following cell, there is an example that creates a figure with 2 rows and 2 columns of subplots, and then plots a sine wave in each subplot. Inspect the code and, if you wish, play around with the plt.subplots-command, for example by plotting data from the Auto-dataframe, if you want further practice.

In [18]:
# Create a new figure with 2x2 subplots
fig, axs = plt.subplots(2, 2)

# Generate data
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x)

# Plot a sine wave in each subplot
axs[0, 0].plot(x, y)
axs[0, 1].plot(x, y)
axs[1, 0].plot(x, y)
axs[1, 1].plot(x, y)

# Add labels and titles
axs[0, 0].set_title('Sine wave 1')
axs[0, 1].set_title('Sine wave 2')
axs[1, 0].set_title('Sine wave 3')
axs[1, 1].set_title('Sine wave 4')

# Show the figure
plt.subplots_adjust(wspace= 0.5, hspace= 0.5) # function that allows us to adjust the spacing between subplots in a figure
plt.show()

You can also use plt.subplots(nrows, ncols, sharex = True, sharey = True), which creates a figure with nrows x ncols subplots in it, with sharing x and y axis. In the following example, we create a figure with 2 rows and 1 columns of subplots. The sharex and sharey arguments are set to True when creating the subplots with plt.subplots(). This means that the x-axis of the first subplot (ax1) will be shared with the x-axis of the second subplot (ax2), and the y-axis of the first subplot (ax1) will be shared with the y-axis of the second subplot (ax2).

Note that this way of creating the subplots is useful when you want to compare two plots that share the same axis scales, as it ensures that the x and y axis will be consistent across the subplots, regardless of the data that is being plotted.

In [20]:
# Create a new figure with 2x1 subplots
fig, (ax1, ax2) = plt.subplots(2, 1, sharex = True, sharey =True)

# Generate data
x = np.linspace(0, 2 * np.pi, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Plot the data
ax1.plot(x, y1, label='sin(x)')
ax2.plot(x, y2, label='cos(x)')

# Add labels and titles
ax1.set_title('Sine wave')
ax2.set_title('Cosine wave')

# Add a legend
ax1.legend()
ax2.legend()

# Show the figure
plt.show()

(d)

Create a bar chart that displays the population of four different cities: New York, Los Angeles, Chicago, and Houston. Use the function plt.bar(). Check the documentation if you need information on how this function is used. You have access to the following data:

New York: 8.4 million

Los Angeles: 4.0 million

Chicago: 2.7 million

Houston: 2.3 million

Make sure to add appropriate titles and labels to your figure.

In [21]:
# enter your code here
# Data
cities = ['New York', 'Los Angeles', 'Chicago', 'Houston']
population = [8.4, 4.0, 2.7, 2.3]

# Create the bar chart
plt.bar(cities, population)

# Add labels and titles
plt.xlabel('City')
plt.ylabel('Population (millions)')
plt.title('Population of major cities in the US')

# Show the chart
plt.show()

(e)

Create a plot that shows the number of registered cars in Denmark, Norway, and Sweden from 2000 to 2020. Be sure to include axis labels and a legend. Use the following simulated data:

denmark_cars = [390000, 390000, 410000, 425000, 430000, 450000, 450000, 450000, 450000, 450000, 460000, 470000, 480000, 490000, 490000, 500000, 510000, 520000, 530000, 550000, 560000]

norway_cars = [200000, 200000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 270000, 290000, 300000, 370000, 320000, 330000, 340000, 350000, 360000, 370000, 380000]

sweden_cars = [300000, 310000, 310000, 300000, 300000, 350000, 360000, 370000, 380000, 390000, 400000, 410000, 420000, 410000, 440000, 450000, 460000, 470000, 440000, 490000, 500000]

In [12]:
# enter your code here
# Data
years = np.arange(2000, 2021)
denmark_cars = [390000, 390000, 410000, 425000, 430000, 450000, 450000, 450000,
                450000, 450000, 460000, 470000, 480000, 490000, 490000, 500000,
                510000, 520000, 530000, 550000, 560000]
norway_cars = [200000, 200000, 200000, 210000, 220000, 230000, 240000, 250000,
               260000, 270000, 270000, 290000, 300000, 370000, 320000, 330000,
               340000, 350000, 360000, 370000, 380000]
sweden_cars = [300000, 310000, 310000, 300000, 300000, 350000, 360000, 370000,
               380000, 390000, 400000, 410000, 420000, 410000, 440000, 450000,
               460000, 470000, 440000, 490000, 500000]

# Plotting
plt.plot(years, denmark_cars, label='Denmark')
plt.plot(years, norway_cars, label='Norway')
plt.plot(years, sweden_cars, label='Sweden')
plt.xticks(range(years[0],years[-1]+1,5))
plt.xlabel('Year')
plt.ylabel('Number of registered cars')
plt.legend()
plt.show()

(f) Advanced example

This problem is intended to show that much more advanced plots can be created with matplotlib, compared to the basic applications we have seen so far.

Example: Create a 3D surface plot of the function $z = -(sin(x)\cdot cos(y)\cdot exp(|(1 - \sqrt{(x^2 + y^2)}/\pi)|))$ over the range $-5 \leq x \leq 5$ and $-5 \leq y \leq 5$. Use a color map to indicate the value of $z$ and include proper axis labels and a colorbar.

In [24]:
from mpl_toolkits.mplot3d import Axes3D # library to create a 3D plot

# Data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
z = -(np.sin(x) * np.cos(y) * np.exp(np.abs(1 - np.sqrt(x**2 + y**2)/np.pi)))

# Plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
surf = ax.plot_surface(x, y, z, cmap = 'coolwarm')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
cbar = fig.colorbar(surf, shrink = 0.5, aspect = 5)
plt.show()

In this graph, $x_1$ and $x_2$ could be the parameters of our machine learning model and $z$ (the vertical axis) could be the cost function we want to optimize. Later in the course, we will see exactly what the cost function is and we will also study the gradient descent method which will allow us to find a local minimum of a differentiable function.

0.4 Conditional statements and for-loops

This section contains some examples on how to use if-statements, for-loops and function definitions in Python.

(a)

For example, the maximum score of the SML exam if 50, and the limits for the grades 3, 4, and 5 are clearly specified. The following exam scores are listed for four students: 47, 33, 24 and 22. The following code assigns the correct grade to each student:

In [25]:
def assign_grade(exam_score):
    grade = 0
    if exam_score >= 43:
        grade = 5
    elif exam_score >= 33:
        grade = 4
    elif exam_score >= 23:
        grade = 3
    else:
        grade = 'U' # Failed!
    return grade

# We have 4 scores: 43, 33, 23, 22
score1 = 47
grade1 = assign_grade(score1)
print(f'Student 1: score {score1}, grade {grade1}')

score2 = 33
grade2 = assign_grade(score2)
print(f'Student 2: score {score2}, grade {grade2}')

score3 = 24
grade3 = assign_grade(score3)
print(f'Student 3: score {score3}, grade {grade3}')

score4 = 22
grade4 = assign_grade(score4)
print(f'Student 4: score {score4}, grade {grade4}')
Student 1: score 47, grade 5
Student 2: score 33, grade 4
Student 3: score 24, grade 3
Student 4: score 22, grade U

(b)

The grades can also be assigned using a for-loop:

In [26]:
# Alternative 1: Loop over list
print('Loop over list:')
scores = [47, 33, 24, 22]
cnt = 0
for score in scores:
    grade = assign_grade(score)
    print(f'Student {cnt+1}: score {score}, grade {grade}')
    cnt+=1

# Alternative 2: Loop using range
print('Loop using range:')
num_students = len(scores)
for i in range(num_students):
    grade = assign_grade(scores[i])
    print(f'Student {i+1}: score {scores[i]}, grade {grade}')
Loop over list:
Student 1: score 47, grade 5
Student 2: score 33, grade 4
Student 3: score 24, grade 3
Student 4: score 22, grade U
Loop using range:
Student 1: score 47, grade 5
Student 2: score 33, grade 4
Student 3: score 24, grade 3
Student 4: score 22, grade U

(c)

If we want to save the converted grades to a list, there is a short syntax (called list comprehension) to do that with one line:

In [28]:
grades = [assign_grade(score) for score in scores]

print('List of scores: ', scores)
print('List of grades: ', grades)
List of scores:  [47, 33, 24, 22]
List of grades:  [5, 4, 3, 'U']