7.1. Exercises#

7.1.1. Environment setup#

# for Google Colab
import os
if 'COLAB_JUPYTER_IP' in os.environ:
    !git clone https://github.com/bokulich-lab/DataVisualizationBook.git book

    from book.utils import utils
    utils.ensure_packages('book/requirements.txt')
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

# this is to silence pandas' warnings
import warnings
warnings.simplefilter(action='ignore')
%config InlineBackend.figure_format='svg'

FONT_FAMILY = 'DejaVu Sans'
FONT_SCALE = 1.3

# for Google Colab
if 'COLAB_JUPYTER_IP' in os.environ:
    data_dir = 'book/chapters/data'
else:
    data_dir = '../data'

7.1.2. Data pre-processing#

All the exercises are based on the Kaggle cereals dataset, which has 77 records and 16 columns containing nutritional information on different brands of breakfast cereals. The columns are:

  • name - name of the cereal

  • mfr - manufacturer of the cereals. You can find the association of the letter in the dataset with the real name in the manufacturers_df we have loaded below.

  • type - hot or cold, the preferred way of eating

  • calories - amount of calories

  • fat - grams of fat

  • sodium - milligrams of sodium

  • fiber - amount in grams

  • carbo - amount of carbohydrates in grams

  • sugars - amount in gram

  • potass - amount in milligrams

  • vitamins - vitamins and minerals (0, 25, 100) as a percentage of the Recommended Dietary Intake

  • shelf - shelf they appear in supermarket (1, 2 or 3 from the floor)

  • weight - weight in ounces

  • cups - number of cups

  • rating - rating of the cereals

Note

All the values are expresed per 100g portion.

# load main dataset
cereals_df = pd.read_csv(f'{data_dir}/cereal.csv', sep=',')
cereals_df.head()
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
0 100% Bran N C 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973
1 100% Natural Bran Q C 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679
2 All-Bran K C 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
3 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
4 Almond Delight R C 110 2 2 200 1.0 14.0 8 -1 25 3 1.0 0.75 34.384843
# load dataset that maps manufacturer letter codes to their names
manufacturers_df = pd.read_csv(f'{data_dir}/manufacturers.csv', index_col=0)
manufacturers_df
company_name
letter
A American Home Food Products
G General Mills
K Kelloggs
N Nabisco
P Post
Q Quaker Oats
R Ralston Purina
# merge the two datasets
cereals = pd.merge(
    cereals_df, manufacturers_df, left_on=cereals_df.mfr, right_index=True
)
# remove duplicated column
cereals.drop('key_0', axis=1, inplace=True)
cereals.head()
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating company_name
0 100% Bran N C 70 4 1 130 10.0 5.0 6 280 25 3 1.00 0.33 68.402973 Nabisco
20 Cream of Wheat (Quick) N H 100 3 0 80 1.0 21.0 0 -1 0 2 1.00 1.00 64.533816 Nabisco
63 Shredded Wheat N C 80 2 0 0 3.0 16.0 0 95 0 1 0.83 1.00 68.235885 Nabisco
64 Shredded Wheat 'n'Bran N C 90 3 0 0 4.0 19.0 0 140 0 1 1.00 0.67 74.472949 Nabisco
65 Shredded Wheat spoon size N C 90 3 0 0 3.0 20.0 0 120 0 1 1.00 0.67 72.801787 Nabisco

7.1.3. Exercises#

7.1.3.1. Exercise 1#

Plot the number of products per manufacturer by displaying the manufacturer’s name instead of the letter that appears in the cereals_df dataframe. All the data you need is found in the cereals DataFrame.

# write your code here

7.1.3.2. Exercise 2#

Plot the distribution of ratings per company checking at the same time if there are any outliers. You can find the necessary data in the data DataFrame.

data = cereals[['company_name', 'rating']]
# write your code here

7.1.3.3. Exercise 3#

Find and visualize the ratings per product. You will find the necessary data in the data DataFrame.

data = cereals[['name', 'rating']].groupby('name').mean().reset_index()
# write your code here

7.1.3.4. Exercise 4#

Find if there is a correlation between any of the numerical features we have in the dataset. Again, you will find the data needed in the data DataFrame.

data = cereals[['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'rating']]
# write your code here

7.1.3.5. Exercise 5#

Your next task is to find and visualize these correlations in a more quantitative way. The data will be ready for you in the data dataframe, you will only have to find the correct visualization method and supply the correct arguments to the function.

data = cereals[['fiber', 'potass', 'sugars', 'calories','rating']]
# write your code here

7.1.3.6. Exercise 6#

Using a scatterplot, show how the potassium amount changes w.r.t. the fiber amount and the rating. Notice that this requires you to plot three numerical variables at the same time. The data to be used is ready for you in the data DataFrame.

data = cereals[['potass', 'fiber', 'rating']]
# write your code here

7.1.3.7. Exercise 7#

Using a scatterplot, plot the potassium amount w.r.t. to the fiber amount, the sugar amount and the rating. Notice that this will required you to find a visualization allowing to display four variables at once. The data to be used is ready for you in the data DataFrame.

You might find some useful information here and here.

data = cereals[['potass', 'fiber', 'sugars', 'rating']]
# write your code here