"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
},
"language_info": {
   "codemirror_mode": {
   "name": "ipython",
   "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
},
"colab": {
   "name": "assignment_10.ipynb",
   "provenance": [],
   "collapsed_sections": []
}
},
"cells": [
{
   "cell_type": "code",
   "metadata": {
   "id": "sR-faGKaT2Hp",
   "colab_type": "code",
   "colab": {}
   },
   "source": [
   "#-*- coding: utf-8 -*-"
   ],
   "execution_count": 0,
   "outputs": []
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "G5rUG6pLT2Hu",
   "colab_type": "text"
   },
   "source": [
   "<img align=\"right\" style=\"max-width: 200px; height: auto\" src=\"https://i.imgur.com/iNjt9Ic.png\">\n",
   "\n",
   "# Exercise 10 - \"Machine Learning II: <br /> Supervised Learning\"\n",
   "\n",
   "Fundamentals and Methods of Computer Science, University of St. Gallen, Autumn Term 2019"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "g73A6G8aT2Hv",
   "colab_type": "text"
   },
   "source": [
   "## Introduction\n",
   "In this assignment we will continue where we left off with the previous one. Last week, you have learned to evaluate performance of Machine Learning models, now we will actually train such models. In this exercise we will cover training and evaluation of two common **classifiers**: Naive Bayes and k-Nearest Neighbors (kNN). Next week in our last exercise, we will train **Neural Networks** to classify images.\n",
   "\n",
   "<img align=\"center\" style=\"max-width: 800px; height: auto\" src=\"https://i.imgur.com/iv6NSf0.png\">"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "ZzORWC1QT2Hw",
   "colab_type": "text"
   },
   "source": [
   "Before we start let's watch another motivational video:"
   ]
},
{
   "cell_type": "code",
   "metadata": {
   "id": "Ji8Quf0gT2Hw",
   "colab_type": "code",
   "colab": {}
   },
   "source": [
   "from IPython.display import YouTubeVideo\n",
   "# Google AI: \"Detecting cancer in real-time with machine learning\"\n",
   "YouTubeVideo('9Mz84cwVmS0', width=1024, height=576)"
   ],
   "execution_count": 0,
   "outputs": []
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "m7PWqdeDT2H0",
   "colab_type": "text"
   },
   "source": [
   "## What classifiers are we going to use?\n",
   "\n",
   "The **Naive Bayes (NB)** classifier belongs to the family of simple \"probabilistic classifiers\" based on applying Bayes' theorem with a strong (naive) independence assumptions between the features. Naive Bayes has been studied extensively since the 1950s and remains an accessible (baseline) method for text categorization as well as other domains.\n",
   "\n",
   "\n",
   "The **k-Nearest Neighbors (kNN)** is a simple, easy to understand, versatile, but powerful machine learning algorithm. Until recently (prior to the advent of deep learning approaches) it was used in a variety of applications such as finance, healthcare, political science, handwriting detection, image recognition and video recognition. In Credit ratings, financial institutes will predict the credit rating of customers. "
   ]
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "L-aHiOs7T2H1",
   "colab_type": "text"
   },
   "source": [
   "## Exercise structure\n",
   "The following exercise is structured according to the following tasks:\n",
   "\n",
   "**Task 1:** Gaussian Naive Bayes Classification - 4 points\n",
   "> 1.1 Calculation of the Prior Probabilities $P(y)$ of each Class \n",
   "> 1.2 Calculation of the Evidence $P(x)$ of each Feature \n",
   "> 1.3 Calculation of the likelihood $P(x|y)$ of each Feature \n",
   "> 1.4 Calculation of the Posterior Probabilities $P(y|x)$ of sample $x$ belonging to the given classes \n",
   "\n",
   "**Task 2:** k-Nearest-Neighbors Classification - 4 points\n",
   "> 2.1 Dataset Pre-Processing \n",
   "> 2.2 Distance Between (potential) Neighbors \n",
   "> 2.3 Choosing the Class from the Neighbors \n",
   "> 2.4 k-Nearest-Neighbor (kNN) Classification \n",
   "> 2.5 kNN Performance over Different `k` Values\n",
   "\n",
   "**Task 3:** Supervised Learning - Understanding (Multiple Choice) - 2 points"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "v-wKdNd2T2H1",
   "colab_type": "text"
   },
   "source": [
   "## Setup of the Assignment Environment"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "paJtVtlBT2H2",
   "colab_type": "text"
   },
   "source": [
   "Similar to the previous labs, we need to import a couple of Python libraries that allow for data analysis and data visualization. In this assignment will use [pandas](https://pandas.pydata.org/), [numpy](https://numpy.org/), [sklearn](https://scikit-learn.org/), [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) libraries. Let's import the libraries by the execution of the statements below:"
   ]
},
{
   "cell_type": "code",
   "metadata": {
   "id": "qXENPFXQT2H3",
   "colab_type": "code",
   "colab": {}
   },
   "source": [
   "# import the numpy, scipy and pandas data science library\n",
   "import pandas as pd\n",
   "import numpy as np\n",
   "from scipy.stats import norm as nm\n",
   "\n",
   "# import sklearn data and data pre-processing libraries\n",
   "#import sklearn\n",
   "from sklearn import datasets\n",
   "\n",
   "# import sklearn naive.bayes and k-nearest neighbor classifier library\n",
   "from sklearn.naive_bayes import GaussianNB\n",
   "from sklearn.neighbors import KNeighborsClassifier\n",
   "\n",
   "# import sklearn classification evaluation library\n",
   "from sklearn import metrics\n",
   "from sklearn.metrics import classification_report, confusion_matrix\n",
   "from sklearn.model_selection import train_test_split\n",
   "\n",
   "# import matplotlib data visualization library\n",
   "import matplotlib.pyplot as plt\n",
   "import seaborn as sns\n",
   "\n",
   "from IPython.display import display # displays data nicely in Notebook environment\n",
   "\n",
   "#Enable inline Jupyter notebook plotting:\n",
   "plt.ion()\n",
   "\n",
   "# set grading variable to False to test your code it will be set to True during grading\n",
   "if __name__ == '__main__':\n",
   " grading = False\n",
   "else:\n",
   " grading = True"
   ],
   "execution_count": 0,
   "outputs": []
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "nb2lEzqCT2H5",
   "colab_type": "text"
   },
   "source": [
   "## Task 1: Gaussian \"Naive Bayes\" (NB) Classification"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "Jia3zlPYT2H5",
   "colab_type": "text"
   },
   "source": [
   "One popular (and remarkably simple) algorithm is the **Naive Bayes Classifier**. Note that one natural way to address a given classification task is via the probabilistic question: **What is the most likely class $\\hat{y}$ given a set of observations $x$?** Formally, we wish to output a prediction for $y$ by calculating its posterior probabilities $P(y|x)$ given the expression:"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "5k6UFsVYT2H6",
   "colab_type": "text"
   },
   "source": [
   "$$\\hat{y} = \\arg \\max_{y} P(y|x)$$"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {
   "id": "AFHNKfeiT2H7",
   "colab_type": "text"
   },
   "source

Q1Why we need Python virtual environments? Please write no more tha...