7.1. Exercises#

7.1.1. Environment setup#

# for Google Colab
import os
if 'COLAB_JUPYTER_IP' in os.environ:
    !git clone https://github.com/bokulich-lab/DataVisualizationBook.git book

    from book.utils import utils
    utils.ensure_packages('book/requirements.txt')

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

# this is to silence pandas' warnings
import warnings
warnings.simplefilter(action='ignore')
%config InlineBackend.figure_format='svg'

FONT_FAMILY = 'DejaVu Sans'
FONT_SCALE = 1.3

# for Google Colab
if 'COLAB_JUPYTER_IP' in os.environ:
    data_dir = 'book/chapters/data'
else:
    data_dir = '../data'

7.1.2. Data pre-processing#

All the exercises are based on the Kaggle cereals dataset, which has 77 records and 16 columns containing nutritional information on different brands of breakfast cereals. The columns are:

name - name of the cereal
mfr - manufacturer of the cereals. You can find the association of the letter in the dataset with the real name in the manufacturers_df we have loaded below.
type - hot or cold, the preferred way of eating
calories - amount of calories
fat - grams of fat
sodium - milligrams of sodium
fiber - amount in grams
carbo - amount of carbohydrates in grams
sugars - amount in gram
potass - amount in milligrams
vitamins - vitamins and minerals (0, 25, 100) as a percentage of the Recommended Dietary Intake
shelf - shelf they appear in supermarket (1, 2 or 3 from the floor)
weight - weight in ounces
cups - number of cups
rating - rating of the cereals

Note

All the values are expresed per 100g portion.

# load main dataset
cereals_df = pd.read_csv(f'{data_dir}/cereal.csv', sep=',')
cereals_df.head()

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1.0	0.75	34.384843

# load dataset that maps manufacturer letter codes to their names
manufacturers_df = pd.read_csv(f'{data_dir}/manufacturers.csv', index_col=0)
manufacturers_df

	company_name
letter
A	American Home Food Products
G	General Mills
K	Kelloggs
N	Nabisco
P	Post
Q	Quaker Oats
R	Ralston Purina

# merge the two datasets
cereals = pd.merge(
    cereals_df, manufacturers_df, left_on=cereals_df.mfr, right_index=True
)
# remove duplicated column
cereals.drop('key_0', axis=1, inplace=True)
cereals.head()

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating	company_name
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.00	0.33	68.402973	Nabisco
20	Cream of Wheat (Quick)	N	H	100	3	0	80	1.0	21.0	0	-1	0	2	1.00	1.00	64.533816	Nabisco
63	Shredded Wheat	N	C	80	2	0	0	3.0	16.0	0	95	0	1	0.83	1.00	68.235885	Nabisco
64	Shredded Wheat 'n'Bran	N	C	90	3	0	0	4.0	19.0	0	140	0	1	1.00	0.67	74.472949	Nabisco
65	Shredded Wheat spoon size	N	C	90	3	0	0	3.0	20.0	0	120	0	1	1.00	0.67	72.801787	Nabisco

7.1.3. Exercises#

7.1.3.1. Exercise 1#

Plot the number of products per manufacturer by displaying the manufacturer’s name instead of the letter that appears in the cereals_df dataframe. All the data you need is found in the cereals DataFrame.

# write your code here

7.1.3.2. Exercise 2#

Plot the distribution of ratings per company checking at the same time if there are any outliers. You can find the necessary data in the data DataFrame.

data = cereals[['company_name', 'rating']]

# write your code here

7.1.3.3. Exercise 3#

Find and visualize the ratings per product. You will find the necessary data in the data DataFrame.

data = cereals[['name', 'rating']].groupby('name').mean().reset_index()

# write your code here

7.1.3.4. Exercise 4#

Find if there is a correlation between any of the numerical features we have in the dataset. Again, you will find the data needed in the data DataFrame.

data = cereals[['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'rating']]

# write your code here

7.1.3.5. Exercise 5#

Your next task is to find and visualize these correlations in a more quantitative way. The data will be ready for you in the data dataframe, you will only have to find the correct visualization method and supply the correct arguments to the function.

data = cereals[['fiber', 'potass', 'sugars', 'calories','rating']]

# write your code here

7.1.3.6. Exercise 6#

Using a scatterplot, show how the potassium amount changes w.r.t. the fiber amount and the rating. Notice that this requires you to plot three numerical variables at the same time. The data to be used is ready for you in the data DataFrame.

data = cereals[['potass', 'fiber', 'rating']]

# write your code here

7.1.3.7. Exercise 7#

Using a scatterplot, plot the potassium amount w.r.t. to the fiber amount, the sugar amount and the rating. Notice that this will required you to find a visualization allowing to display four variables at once. The data to be used is ready for you in the data DataFrame.

You might find some useful information here and here.

data = cereals[['potass', 'fiber', 'sugars', 'rating']]

# write your code here

Data Visualization for Food Scientists

Exercises

Contents

7.1. Exercises#

7.1.1. Environment setup#

7.1.2. Data pre-processing#

7.1.3. Exercises#

7.1.3.1. Exercise 1#

7.1.3.2. Exercise 2#

7.1.3.3. Exercise 3#

7.1.3.4. Exercise 4#

7.1.3.5. Exercise 5#

7.1.3.6. Exercise 6#

7.1.3.7. Exercise 7#