{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# for Google Colab\n",
    "import os\n",
    "if 'COLAB_JUPYTER_IP' in os.environ:\n",
    "    !git clone https://github.com/bokulich-lab/DataVisualizationBook.git book\n",
    "\n",
    "    from book.utils import utils\n",
    "    utils.ensure_packages('book/requirements.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2023-05-05T10:30:23.334001Z",
     "iopub.status.busy": "2023-05-05T10:30:23.333085Z",
     "iopub.status.idle": "2023-05-05T10:30:25.556341Z",
     "shell.execute_reply": "2023-05-05T10:30:25.555330Z",
     "shell.execute_reply.started": "2023-05-05T10:30:23.333926Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "tags": [
     "remove-output"
    ]
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "\n",
    "# this is to silence pandas' warnings\n",
    "import warnings\n",
    "warnings.simplefilter(action='ignore')\n",
    "%config InlineBackend.figure_format='svg'\n",
    "\n",
    "FONT_FAMILY = 'DejaVu Sans'\n",
    "FONT_SCALE = 1.3\n",
    "\n",
    "# for Google Colab\n",
    "if 'COLAB_JUPYTER_IP' in os.environ:\n",
    "    data_dir = 'book/chapters/data'\n",
    "else:\n",
    "    data_dir = '../data'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data pre-processing\n",
    "\n",
    "All the exercises are based on the Kaggle [**cereals**](https://www.kaggle.com/code/hiralmshah/nutrition-data-analysis-from-80-cereals) dataset, which has 77 records and 16 columns containing nutritional information on different brands of breakfast cereals. The columns are:\n",
    "- **name** - name of the cereal\n",
    "- **mfr** - manufacturer of the cereals. You can find the association of the letter in the dataset with the real name in the `manufacturers_df` we have loaded below.\n",
    "- **type** - hot or cold, the preferred way of eating\n",
    "- **calories** - amount of calories\n",
    "- **fat** - grams of fat\n",
    "- **sodium** - milligrams of sodium\n",
    "- **fiber** - amount in grams\n",
    "- **carbo** - amount of carbohydrates in grams\n",
    "- **sugars** - amount in gram\n",
    "- **potass** - amount in milligrams\n",
    "- **vitamins** - vitamins and minerals (0, 25, 100) as a percentage of the Recommended Dietary Intake\n",
    "- **shelf** - shelf they appear in supermarket (1, 2 or 3 from the floor)\n",
    "- **weight** - weight in ounces\n",
    "- **cups** - number of cups\n",
    "- **rating** - rating of the cereals\n",
    "\n",
    "````{admonition} Note\n",
    "All the values are expresed per 100g portion.\n",
    "````"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>mfr</th>\n",
       "      <th>type</th>\n",
       "      <th>calories</th>\n",
       "      <th>protein</th>\n",
       "      <th>fat</th>\n",
       "      <th>sodium</th>\n",
       "      <th>fiber</th>\n",
       "      <th>carbo</th>\n",
       "      <th>sugars</th>\n",
       "      <th>potass</th>\n",
       "      <th>vitamins</th>\n",
       "      <th>shelf</th>\n",
       "      <th>weight</th>\n",
       "      <th>cups</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100% Bran</td>\n",
       "      <td>N</td>\n",
       "      <td>C</td>\n",
       "      <td>70</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>130</td>\n",
       "      <td>10.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>6</td>\n",
       "      <td>280</td>\n",
       "      <td>25</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.33</td>\n",
       "      <td>68.402973</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>100% Natural Bran</td>\n",
       "      <td>Q</td>\n",
       "      <td>C</td>\n",
       "      <td>120</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>15</td>\n",
       "      <td>2.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>8</td>\n",
       "      <td>135</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>33.983679</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>All-Bran</td>\n",
       "      <td>K</td>\n",
       "      <td>C</td>\n",
       "      <td>70</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>260</td>\n",
       "      <td>9.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>5</td>\n",
       "      <td>320</td>\n",
       "      <td>25</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.33</td>\n",
       "      <td>59.425505</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>All-Bran with Extra Fiber</td>\n",
       "      <td>K</td>\n",
       "      <td>C</td>\n",
       "      <td>50</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>140</td>\n",
       "      <td>14.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0</td>\n",
       "      <td>330</td>\n",
       "      <td>25</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>93.704912</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Almond Delight</td>\n",
       "      <td>R</td>\n",
       "      <td>C</td>\n",
       "      <td>110</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>200</td>\n",
       "      <td>1.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>8</td>\n",
       "      <td>-1</td>\n",
       "      <td>25</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>34.384843</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        name mfr type  calories  protein  fat  sodium  fiber  \\\n",
       "0                  100% Bran   N    C        70        4    1     130   10.0   \n",
       "1          100% Natural Bran   Q    C       120        3    5      15    2.0   \n",
       "2                   All-Bran   K    C        70        4    1     260    9.0   \n",
       "3  All-Bran with Extra Fiber   K    C        50        4    0     140   14.0   \n",
       "4             Almond Delight   R    C       110        2    2     200    1.0   \n",
       "\n",
       "   carbo  sugars  potass  vitamins  shelf  weight  cups     rating  \n",
       "0    5.0       6     280        25      3     1.0  0.33  68.402973  \n",
       "1    8.0       8     135         0      3     1.0  1.00  33.983679  \n",
       "2    7.0       5     320        25      3     1.0  0.33  59.425505  \n",
       "3    8.0       0     330        25      3     1.0  0.50  93.704912  \n",
       "4   14.0       8      -1        25      3     1.0  0.75  34.384843  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load main dataset\n",
    "cereals_df = pd.read_csv(f'{data_dir}/cereal.csv', sep=',')\n",
    "cereals_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>company_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>American Home Food Products</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>G</td>\n",
       "      <td>General Mills</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K</td>\n",
       "      <td>Kelloggs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>N</td>\n",
       "      <td>Nabisco</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>P</td>\n",
       "      <td>Post</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q</td>\n",
       "      <td>Quaker Oats</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>R</td>\n",
       "      <td>Ralston Purina</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter                 company_name\n",
       "0      A  American Home Food Products\n",
       "1      G                General Mills\n",
       "2      K                     Kelloggs\n",
       "3      N                      Nabisco\n",
       "4      P                         Post\n",
       "5      Q                  Quaker Oats\n",
       "6      R               Ralston Purina"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load dataset that maps manufacturer letter codes to their names\n",
    "manufacturers_df = pd.read_csv(f'{data_dir}/manufacturers.csv', index_col=0)\n",
    "manufacturers_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# merge the two datasets\n",
    "cereals = pd.merge(\n",
    "    cereals_df, manufacturers_df, left_on=cereals_df.mfr, right_index=True\n",
    ")\n",
    "# remove duplicated column\n",
    "cereals.drop('key_0', axis=1, inplace=True)\n",
    "cereals.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "Plot the **number of products per manufacturer** by displaying the manufacturer's name instead of the letter that appears in the `cereals_df` dataframe. All the data you need is found in the `cereals` DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "Plot the **distribution of ratings per company** checking at the same time if there are any **outliers**. You can find the necessary data in the `data` DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = cereals[['company_name', 'rating']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "Find and visualize the **ratings per product**. You will find the necessary data in the `data` DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = cereals[['name', 'rating']].groupby('name').mean().reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4\n",
    "Find if there is a **correlation between any of the numerical features** we have in the dataset. Again, you will find the data needed in the `data` DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = cereals[['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'rating']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5\n",
    "Your next task is to find and visualize these correlations in a more **quantitative** way. The data will be ready for you in the `data` dataframe, you will only have to find the correct visualization method and supply the correct arguments to the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = cereals[['fiber', 'potass', 'sugars', 'calories','rating']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6\n",
    "Using a scatterplot, show how the **potassium amount changes w.r.t. the fiber amount and the rating**. Notice that this requires you to plot three numerical variables at the same time. The data to be used is ready for you in the `data` DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = cereals[['potass', 'fiber', 'rating']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 7\n",
    "Using a scatterplot, plot **the potassium amount w.r.t. to the fiber amount, the sugar amount and the rating**. Notice that this will required you to find a visualization allowing to display four variables at once. The data to be used is ready for you in the `data` DataFrame. \n",
    "\n",
    "You might find some useful information [here](https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html#sphx-glr-gallery-lines-bars-and-markers-scatter-with-legend-py) and [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = cereals[['potass', 'fiber', 'sugars', 'rating']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  },
  "vscode": {
   "interpreter": {
    "hash": "3ebf5285390510d61c6f267955324cfe7a2b5dc9b1b60afc1c88049367e6b2cd"
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}