Exercises
Contents
7.1. Exercises#
7.1.1. Environment setup#
# for Google Colab
import os
if 'COLAB_JUPYTER_IP' in os.environ:
!git clone https://github.com/bokulich-lab/DataVisualizationBook.git book
from book.utils import utils
utils.ensure_packages('book/requirements.txt')
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
# this is to silence pandas' warnings
import warnings
warnings.simplefilter(action='ignore')
%config InlineBackend.figure_format='svg'
FONT_FAMILY = 'DejaVu Sans'
FONT_SCALE = 1.3
# for Google Colab
if 'COLAB_JUPYTER_IP' in os.environ:
data_dir = 'book/chapters/data'
else:
data_dir = '../data'
7.1.2. Data pre-processing#
All the exercises are based on the Kaggle cereals dataset, which has 77 records and 16 columns containing nutritional information on different brands of breakfast cereals. The columns are:
name - name of the cereal
mfr - manufacturer of the cereals. You can find the association of the letter in the dataset with the real name in the
manufacturers_df
we have loaded below.type - hot or cold, the preferred way of eating
calories - amount of calories
fat - grams of fat
sodium - milligrams of sodium
fiber - amount in grams
carbo - amount of carbohydrates in grams
sugars - amount in gram
potass - amount in milligrams
vitamins - vitamins and minerals (0, 25, 100) as a percentage of the Recommended Dietary Intake
shelf - shelf they appear in supermarket (1, 2 or 3 from the floor)
weight - weight in ounces
cups - number of cups
rating - rating of the cereals
Note
All the values are expresed per 100g portion.
# load main dataset
cereals_df = pd.read_csv(f'{data_dir}/cereal.csv', sep=',')
cereals_df.head()
name | mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | C | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
1 | 100% Natural Bran | Q | C | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
2 | All-Bran | K | C | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 | All-Bran with Extra Fiber | K | C | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
4 | Almond Delight | R | C | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
# load dataset that maps manufacturer letter codes to their names
manufacturers_df = pd.read_csv(f'{data_dir}/manufacturers.csv', index_col=0)
manufacturers_df
company_name | |
---|---|
letter | |
A | American Home Food Products |
G | General Mills |
K | Kelloggs |
N | Nabisco |
P | Post |
Q | Quaker Oats |
R | Ralston Purina |
# merge the two datasets
cereals = pd.merge(
cereals_df, manufacturers_df, left_on=cereals_df.mfr, right_index=True
)
# remove duplicated column
cereals.drop('key_0', axis=1, inplace=True)
cereals.head()
name | mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | company_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | C | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1.00 | 0.33 | 68.402973 | Nabisco |
20 | Cream of Wheat (Quick) | N | H | 100 | 3 | 0 | 80 | 1.0 | 21.0 | 0 | -1 | 0 | 2 | 1.00 | 1.00 | 64.533816 | Nabisco |
63 | Shredded Wheat | N | C | 80 | 2 | 0 | 0 | 3.0 | 16.0 | 0 | 95 | 0 | 1 | 0.83 | 1.00 | 68.235885 | Nabisco |
64 | Shredded Wheat 'n'Bran | N | C | 90 | 3 | 0 | 0 | 4.0 | 19.0 | 0 | 140 | 0 | 1 | 1.00 | 0.67 | 74.472949 | Nabisco |
65 | Shredded Wheat spoon size | N | C | 90 | 3 | 0 | 0 | 3.0 | 20.0 | 0 | 120 | 0 | 1 | 1.00 | 0.67 | 72.801787 | Nabisco |
7.1.3. Exercises#
7.1.3.1. Exercise 1#
Plot the number of products per manufacturer by displaying the manufacturer’s name instead of the letter that appears in the cereals_df
dataframe. All the data you need is found in the cereals
DataFrame.
# write your code here
7.1.3.2. Exercise 2#
Plot the distribution of ratings per company checking at the same time if there are any outliers. You can find the necessary data in the data
DataFrame.
data = cereals[['company_name', 'rating']]
# write your code here
7.1.3.3. Exercise 3#
Find and visualize the ratings per product. You will find the necessary data in the data
DataFrame.
data = cereals[['name', 'rating']].groupby('name').mean().reset_index()
# write your code here
7.1.3.4. Exercise 4#
Find if there is a correlation between any of the numerical features we have in the dataset. Again, you will find the data needed in the data
DataFrame.
data = cereals[['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'rating']]
# write your code here
7.1.3.5. Exercise 5#
Your next task is to find and visualize these correlations in a more quantitative way. The data will be ready for you in the data
dataframe, you will only have to find the correct visualization method and supply the correct arguments to the function.
data = cereals[['fiber', 'potass', 'sugars', 'calories','rating']]
# write your code here
7.1.3.6. Exercise 6#
Using a scatterplot, show how the potassium amount changes w.r.t. the fiber amount and the rating. Notice that this requires you to plot three numerical variables at the same time. The data to be used is ready for you in the data
DataFrame.
data = cereals[['potass', 'fiber', 'rating']]
# write your code here
7.1.3.7. Exercise 7#
Using a scatterplot, plot the potassium amount w.r.t. to the fiber amount, the sugar amount and the rating. Notice that this will required you to find a visualization allowing to display four variables at once. The data to be used is ready for you in the data
DataFrame.
You might find some useful information here and here.
data = cereals[['potass', 'fiber', 'sugars', 'rating']]
# write your code here