{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# for Google Colab\n", "import os\n", "if 'COLAB_JUPYTER_IP' in os.environ:\n", " !git clone https://github.com/bokulich-lab/DataVisualizationBook.git book\n", "\n", " from book.utils import utils\n", " utils.ensure_packages('book/requirements.txt')" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2023-05-05T10:30:23.334001Z", "iopub.status.busy": "2023-05-05T10:30:23.333085Z", "iopub.status.idle": "2023-05-05T10:30:25.556341Z", "shell.execute_reply": "2023-05-05T10:30:25.555330Z", "shell.execute_reply.started": "2023-05-05T10:30:23.333926Z" }, "jupyter": { "outputs_hidden": false }, "tags": [ "remove-output" ] }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "import numpy as np\n", "\n", "# this is to silence pandas' warnings\n", "import warnings\n", "warnings.simplefilter(action='ignore')\n", "%config InlineBackend.figure_format='svg'\n", "\n", "FONT_FAMILY = 'DejaVu Sans'\n", "FONT_SCALE = 1.3\n", "\n", "# for Google Colab\n", "if 'COLAB_JUPYTER_IP' in os.environ:\n", " data_dir = 'book/chapters/data'\n", "else:\n", " data_dir = '../data'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data pre-processing\n", "\n", "All the exercises are based on the Kaggle [**cereals**](https://www.kaggle.com/code/hiralmshah/nutrition-data-analysis-from-80-cereals) dataset, which has 77 records and 16 columns containing nutritional information on different brands of breakfast cereals. The columns are:\n", "- **name** - name of the cereal\n", "- **mfr** - manufacturer of the cereals. You can find the association of the letter in the dataset with the real name in the `manufacturers_df` we have loaded below.\n", "- **type** - hot or cold, the preferred way of eating\n", "- **calories** - amount of calories\n", "- **fat** - grams of fat\n", "- **sodium** - milligrams of sodium\n", "- **fiber** - amount in grams\n", "- **carbo** - amount of carbohydrates in grams\n", "- **sugars** - amount in gram\n", "- **potass** - amount in milligrams\n", "- **vitamins** - vitamins and minerals (0, 25, 100) as a percentage of the Recommended Dietary Intake\n", "- **shelf** - shelf they appear in supermarket (1, 2 or 3 from the floor)\n", "- **weight** - weight in ounces\n", "- **cups** - number of cups\n", "- **rating** - rating of the cereals\n", "\n", "````{admonition} Note\n", "All the values are expresed per 100g portion.\n", "````" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemfrtypecaloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
0100% BranNC704113010.05.062802531.00.3368.402973
1100% Natural BranQC12035152.08.08135031.01.0033.983679
2All-BranKC70412609.07.053202531.00.3359.425505
3All-Bran with Extra FiberKC504014014.08.003302531.00.5093.704912
4Almond DelightRC110222001.014.08-12531.00.7534.384843
\n", "
" ], "text/plain": [ " name mfr type calories protein fat sodium fiber \\\n", "0 100% Bran N C 70 4 1 130 10.0 \n", "1 100% Natural Bran Q C 120 3 5 15 2.0 \n", "2 All-Bran K C 70 4 1 260 9.0 \n", "3 All-Bran with Extra Fiber K C 50 4 0 140 14.0 \n", "4 Almond Delight R C 110 2 2 200 1.0 \n", "\n", " carbo sugars potass vitamins shelf weight cups rating \n", "0 5.0 6 280 25 3 1.0 0.33 68.402973 \n", "1 8.0 8 135 0 3 1.0 1.00 33.983679 \n", "2 7.0 5 320 25 3 1.0 0.33 59.425505 \n", "3 8.0 0 330 25 3 1.0 0.50 93.704912 \n", "4 14.0 8 -1 25 3 1.0 0.75 34.384843 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load main dataset\n", "cereals_df = pd.read_csv(f'{data_dir}/cereal.csv', sep=',')\n", "cereals_df.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettercompany_name
0AAmerican Home Food Products
1GGeneral Mills
2KKelloggs
3NNabisco
4PPost
5QQuaker Oats
6RRalston Purina
\n", "
" ], "text/plain": [ " letter company_name\n", "0 A American Home Food Products\n", "1 G General Mills\n", "2 K Kelloggs\n", "3 N Nabisco\n", "4 P Post\n", "5 Q Quaker Oats\n", "6 R Ralston Purina" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load dataset that maps manufacturer letter codes to their names\n", "manufacturers_df = pd.read_csv(f'{data_dir}/manufacturers.csv', index_col=0)\n", "manufacturers_df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# merge the two datasets\n", "cereals = pd.merge(\n", " cereals_df, manufacturers_df, left_on=cereals_df.mfr, right_index=True\n", ")\n", "# remove duplicated column\n", "cereals.drop('key_0', axis=1, inplace=True)\n", "cereals.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "Plot the **number of products per manufacturer** by displaying the manufacturer's name instead of the letter that appears in the `cereals_df` dataframe. All the data you need is found in the `cereals` DataFrame." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "Plot the **distribution of ratings per company** checking at the same time if there are any **outliers**. You can find the necessary data in the `data` DataFrame." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "data = cereals[['company_name', 'rating']]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "Find and visualize the **ratings per product**. You will find the necessary data in the `data` DataFrame." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "data = cereals[['name', 'rating']].groupby('name').mean().reset_index()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4\n", "Find if there is a **correlation between any of the numerical features** we have in the dataset. Again, you will find the data needed in the `data` DataFrame." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "data = cereals[['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'rating']]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# write your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5\n", "Your next task is to find and visualize these correlations in a more **quantitative** way. The data will be ready for you in the `data` dataframe, you will only have to find the correct visualization method and supply the correct arguments to the function." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "data = cereals[['fiber', 'potass', 'sugars', 'calories','rating']]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6\n", "Using a scatterplot, show how the **potassium amount changes w.r.t. the fiber amount and the rating**. Notice that this requires you to plot three numerical variables at the same time. The data to be used is ready for you in the `data` DataFrame." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "data = cereals[['potass', 'fiber', 'rating']]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 7\n", "Using a scatterplot, plot **the potassium amount w.r.t. to the fiber amount, the sugar amount and the rating**. Notice that this will required you to find a visualization allowing to display four variables at once. The data to be used is ready for you in the `data` DataFrame. \n", "\n", "You might find some useful information [here](https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html#sphx-glr-gallery-lines-bars-and-markers-scatter-with-legend-py) and [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "data = cereals[['potass', 'fiber', 'sugars', 'rating']]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# write your code here" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" }, "vscode": { "interpreter": { "hash": "3ebf5285390510d61c6f267955324cfe7a2b5dc9b1b60afc1c88049367e6b2cd" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }