{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environment setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# for Google Colab\n",
"import os\n",
"if 'COLAB_JUPYTER_IP' in os.environ:\n",
" !git clone https://github.com/bokulich-lab/DataVisualizationBook.git book\n",
"\n",
" from book.utils import utils\n",
" utils.ensure_packages('book/requirements.txt')"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2023-05-05T10:30:23.334001Z",
"iopub.status.busy": "2023-05-05T10:30:23.333085Z",
"iopub.status.idle": "2023-05-05T10:30:25.556341Z",
"shell.execute_reply": "2023-05-05T10:30:25.555330Z",
"shell.execute_reply.started": "2023-05-05T10:30:23.333926Z"
},
"jupyter": {
"outputs_hidden": false
},
"tags": [
"remove-output"
]
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import numpy as np\n",
"\n",
"# this is to silence pandas' warnings\n",
"import warnings\n",
"warnings.simplefilter(action='ignore')\n",
"%config InlineBackend.figure_format='svg'\n",
"\n",
"FONT_FAMILY = 'DejaVu Sans'\n",
"FONT_SCALE = 1.3\n",
"\n",
"# for Google Colab\n",
"if 'COLAB_JUPYTER_IP' in os.environ:\n",
" data_dir = 'book/chapters/data'\n",
"else:\n",
" data_dir = '../data'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data pre-processing\n",
"\n",
"All the exercises are based on the Kaggle [**cereals**](https://www.kaggle.com/code/hiralmshah/nutrition-data-analysis-from-80-cereals) dataset, which has 77 records and 16 columns containing nutritional information on different brands of breakfast cereals. The columns are:\n",
"- **name** - name of the cereal\n",
"- **mfr** - manufacturer of the cereals. You can find the association of the letter in the dataset with the real name in the `manufacturers_df` we have loaded below.\n",
"- **type** - hot or cold, the preferred way of eating\n",
"- **calories** - amount of calories\n",
"- **fat** - grams of fat\n",
"- **sodium** - milligrams of sodium\n",
"- **fiber** - amount in grams\n",
"- **carbo** - amount of carbohydrates in grams\n",
"- **sugars** - amount in gram\n",
"- **potass** - amount in milligrams\n",
"- **vitamins** - vitamins and minerals (0, 25, 100) as a percentage of the Recommended Dietary Intake\n",
"- **shelf** - shelf they appear in supermarket (1, 2 or 3 from the floor)\n",
"- **weight** - weight in ounces\n",
"- **cups** - number of cups\n",
"- **rating** - rating of the cereals\n",
"\n",
"````{admonition} Note\n",
"All the values are expresed per 100g portion.\n",
"````"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" mfr | \n",
" type | \n",
" calories | \n",
" protein | \n",
" fat | \n",
" sodium | \n",
" fiber | \n",
" carbo | \n",
" sugars | \n",
" potass | \n",
" vitamins | \n",
" shelf | \n",
" weight | \n",
" cups | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 100% Bran | \n",
" N | \n",
" C | \n",
" 70 | \n",
" 4 | \n",
" 1 | \n",
" 130 | \n",
" 10.0 | \n",
" 5.0 | \n",
" 6 | \n",
" 280 | \n",
" 25 | \n",
" 3 | \n",
" 1.0 | \n",
" 0.33 | \n",
" 68.402973 | \n",
"
\n",
" \n",
" 1 | \n",
" 100% Natural Bran | \n",
" Q | \n",
" C | \n",
" 120 | \n",
" 3 | \n",
" 5 | \n",
" 15 | \n",
" 2.0 | \n",
" 8.0 | \n",
" 8 | \n",
" 135 | \n",
" 0 | \n",
" 3 | \n",
" 1.0 | \n",
" 1.00 | \n",
" 33.983679 | \n",
"
\n",
" \n",
" 2 | \n",
" All-Bran | \n",
" K | \n",
" C | \n",
" 70 | \n",
" 4 | \n",
" 1 | \n",
" 260 | \n",
" 9.0 | \n",
" 7.0 | \n",
" 5 | \n",
" 320 | \n",
" 25 | \n",
" 3 | \n",
" 1.0 | \n",
" 0.33 | \n",
" 59.425505 | \n",
"
\n",
" \n",
" 3 | \n",
" All-Bran with Extra Fiber | \n",
" K | \n",
" C | \n",
" 50 | \n",
" 4 | \n",
" 0 | \n",
" 140 | \n",
" 14.0 | \n",
" 8.0 | \n",
" 0 | \n",
" 330 | \n",
" 25 | \n",
" 3 | \n",
" 1.0 | \n",
" 0.50 | \n",
" 93.704912 | \n",
"
\n",
" \n",
" 4 | \n",
" Almond Delight | \n",
" R | \n",
" C | \n",
" 110 | \n",
" 2 | \n",
" 2 | \n",
" 200 | \n",
" 1.0 | \n",
" 14.0 | \n",
" 8 | \n",
" -1 | \n",
" 25 | \n",
" 3 | \n",
" 1.0 | \n",
" 0.75 | \n",
" 34.384843 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name mfr type calories protein fat sodium fiber \\\n",
"0 100% Bran N C 70 4 1 130 10.0 \n",
"1 100% Natural Bran Q C 120 3 5 15 2.0 \n",
"2 All-Bran K C 70 4 1 260 9.0 \n",
"3 All-Bran with Extra Fiber K C 50 4 0 140 14.0 \n",
"4 Almond Delight R C 110 2 2 200 1.0 \n",
"\n",
" carbo sugars potass vitamins shelf weight cups rating \n",
"0 5.0 6 280 25 3 1.0 0.33 68.402973 \n",
"1 8.0 8 135 0 3 1.0 1.00 33.983679 \n",
"2 7.0 5 320 25 3 1.0 0.33 59.425505 \n",
"3 8.0 0 330 25 3 1.0 0.50 93.704912 \n",
"4 14.0 8 -1 25 3 1.0 0.75 34.384843 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load main dataset\n",
"cereals_df = pd.read_csv(f'{data_dir}/cereal.csv', sep=',')\n",
"cereals_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" letter | \n",
" company_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" A | \n",
" American Home Food Products | \n",
"
\n",
" \n",
" 1 | \n",
" G | \n",
" General Mills | \n",
"
\n",
" \n",
" 2 | \n",
" K | \n",
" Kelloggs | \n",
"
\n",
" \n",
" 3 | \n",
" N | \n",
" Nabisco | \n",
"
\n",
" \n",
" 4 | \n",
" P | \n",
" Post | \n",
"
\n",
" \n",
" 5 | \n",
" Q | \n",
" Quaker Oats | \n",
"
\n",
" \n",
" 6 | \n",
" R | \n",
" Ralston Purina | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" letter company_name\n",
"0 A American Home Food Products\n",
"1 G General Mills\n",
"2 K Kelloggs\n",
"3 N Nabisco\n",
"4 P Post\n",
"5 Q Quaker Oats\n",
"6 R Ralston Purina"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load dataset that maps manufacturer letter codes to their names\n",
"manufacturers_df = pd.read_csv(f'{data_dir}/manufacturers.csv', index_col=0)\n",
"manufacturers_df"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# merge the two datasets\n",
"cereals = pd.merge(\n",
" cereals_df, manufacturers_df, left_on=cereals_df.mfr, right_index=True\n",
")\n",
"# remove duplicated column\n",
"cereals.drop('key_0', axis=1, inplace=True)\n",
"cereals.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1\n",
"Plot the **number of products per manufacturer** by displaying the manufacturer's name instead of the letter that appears in the `cereals_df` dataframe. All the data you need is found in the `cereals` DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2\n",
"Plot the **distribution of ratings per company** checking at the same time if there are any **outliers**. You can find the necessary data in the `data` DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"data = cereals[['company_name', 'rating']]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 3\n",
"Find and visualize the **ratings per product**. You will find the necessary data in the `data` DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"data = cereals[['name', 'rating']].groupby('name').mean().reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 4\n",
"Find if there is a **correlation between any of the numerical features** we have in the dataset. Again, you will find the data needed in the `data` DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"data = cereals[['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'rating']]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# write your code here\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 5\n",
"Your next task is to find and visualize these correlations in a more **quantitative** way. The data will be ready for you in the `data` dataframe, you will only have to find the correct visualization method and supply the correct arguments to the function."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"data = cereals[['fiber', 'potass', 'sugars', 'calories','rating']]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 6\n",
"Using a scatterplot, show how the **potassium amount changes w.r.t. the fiber amount and the rating**. Notice that this requires you to plot three numerical variables at the same time. The data to be used is ready for you in the `data` DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"data = cereals[['potass', 'fiber', 'rating']]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 7\n",
"Using a scatterplot, plot **the potassium amount w.r.t. to the fiber amount, the sugar amount and the rating**. Notice that this will required you to find a visualization allowing to display four variables at once. The data to be used is ready for you in the `data` DataFrame. \n",
"\n",
"You might find some useful information [here](https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html#sphx-glr-gallery-lines-bars-and-markers-scatter-with-legend-py) and [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter)."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"data = cereals[['potass', 'fiber', 'sugars', 'rating']]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# write your code here"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
},
"vscode": {
"interpreter": {
"hash": "3ebf5285390510d61c6f267955324cfe7a2b5dc9b1b60afc1c88049367e6b2cd"
}
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}