{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Importance of data visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we dive into visualizing domain-specific data (e.g., food data), we must start with the bare fundamentals. To demonstrate the importance of visualization for data interpretation we can use a collection of four datasets known as **Anscombe's quartet** (after the English statistician [Francis Anscombe](https://en.wikipedia.org/wiki/Frank_Anscombe)). Those datasets comprise 11 data points with particular properties: while the points themselves differ between the sets, their summary statistics are (nearly) exactly the same, i.e.: all those datasets have the same mean, standard deviation and regression line with the same parameters and R2 metric. And yet, the shape of the data differs widely between those, as we will see below. This demonstrates beautifully why data visualization is essential to fully comprehend any analysis, as a complement to statistical and numerical methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment setup\n", "\n", "Here we will import all the required modules and preconfigure some variables, if necessary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# for Google Colab\n", "import os\n", "if 'COLAB_JUPYTER_IP' in os.environ:\n", " !git clone https://github.com/bokulich-lab/DataVisualizationBook.git book\n", "\n", " from book.utils import utils\n", " utils.ensure_packages('book/requirements.txt')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-05-05T10:28:57.174588Z", "iopub.status.busy": "2023-05-05T10:28:57.173727Z", "iopub.status.idle": "2023-05-05T10:28:57.322199Z", "shell.execute_reply": "2023-05-05T10:28:57.308738Z", "shell.execute_reply.started": "2023-05-05T10:28:57.174529Z" }, "tags": [] }, "outputs": [], "source": [ "import seaborn as sns\n", "\n", "from scipy import stats\n", "\n", "%config InlineBackend.figure_format='svg'\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the dataset and analyze it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *seaborn* library provides a method to read in this dataset directly:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-04-28T12:31:23.355001Z", "iopub.status.busy": "2023-04-28T12:31:23.354484Z", "iopub.status.idle": "2023-04-28T12:31:23.366288Z", "shell.execute_reply": "2023-04-28T12:31:23.365332Z", "shell.execute_reply.started": "2023-04-28T12:31:23.354954Z" } }, "outputs": [], "source": [ "quartet = sns.load_dataset(\"anscombe\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-04-28T12:31:24.716126Z", "iopub.status.busy": "2023-04-28T12:31:24.715562Z", "iopub.status.idle": "2023-04-28T12:31:24.739926Z", "shell.execute_reply": "2023-04-28T12:31:24.738569Z", "shell.execute_reply.started": "2023-04-28T12:31:24.716075Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datasetxy
0I10.08.04
1I8.06.95
2I13.07.58
3I9.08.81
4I11.08.33
\n", "
" ], "text/plain": [ " dataset x y\n", "0 I 10.0 8.04\n", "1 I 8.0 6.95\n", "2 I 13.0 7.58\n", "3 I 9.0 8.81\n", "4 I 11.0 8.33" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quartet.head() # display the first five records" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now explore the summary metrics of the data. Let's look at the mean, standard deviation, and linear regression calculations. To achieve this, we can use pandas' _groupby_ method, which allows us to \"group\" the data points according to the indicated variable - in our case, \"dataset\". We can then use the resulting object to calculate some summary statistics on all the groups." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-04-28T12:31:26.100665Z", "iopub.status.busy": "2023-04-28T12:31:26.100189Z", "iopub.status.idle": "2023-04-28T12:31:26.106975Z", "shell.execute_reply": "2023-04-28T12:31:26.105798Z", "shell.execute_reply.started": "2023-04-28T12:31:26.100627Z" } }, "outputs": [], "source": [ "quartet_grouped = quartet.groupby('dataset')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-04-28T12:31:27.307538Z", "iopub.status.busy": "2023-04-28T12:31:27.306890Z", "iopub.status.idle": "2023-04-28T12:31:27.333704Z", "shell.execute_reply": "2023-04-28T12:31:27.332322Z", "shell.execute_reply.started": "2023-04-28T12:31:27.307445Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
dataset
I9.07.500909
II9.07.500909
III9.07.500000
IV9.07.500909
\n", "
" ], "text/plain": [ " x y\n", "dataset \n", "I 9.0 7.500909\n", "II 9.0 7.500909\n", "III 9.0 7.500000\n", "IV 9.0 7.500909" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quartet_grouped.mean()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-04-28T12:31:29.726331Z", "iopub.status.busy": "2023-04-28T12:31:29.725757Z", "iopub.status.idle": "2023-04-28T12:31:29.742040Z", "shell.execute_reply": "2023-04-28T12:31:29.739927Z", "shell.execute_reply.started": "2023-04-28T12:31:29.726283Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
dataset
I3.3166252.031568
II3.3166252.031657
III3.3166252.030424
IV3.3166252.030579
\n", "
" ], "text/plain": [ " x y\n", "dataset \n", "I 3.316625 2.031568\n", "II 3.316625 2.031657\n", "III 3.316625 2.030424\n", "IV 3.316625 2.030579" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quartet_grouped.std()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-04-28T12:31:31.946563Z", "iopub.status.busy": "2023-04-28T12:31:31.945814Z", "iopub.status.idle": "2023-04-28T12:31:31.964679Z", "shell.execute_reply": "2023-04-28T12:31:31.963167Z", "shell.execute_reply.started": "2023-04-28T12:31:31.946517Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset I:\ty=0.5x+3.0\tR-coeff=0.667\tCorrelation-coeff=0.816\n", "Dataset II:\ty=0.5x+3.001\tR-coeff=0.666\tCorrelation-coeff=0.816\n", "Dataset III:\ty=0.5x+3.002\tR-coeff=0.666\tCorrelation-coeff=0.816\n", "Dataset IV:\ty=0.5x+3.002\tR-coeff=0.667\tCorrelation-coeff=0.817\n" ] } ], "source": [ "# fit a linear regression model for each dataset\n", "\n", "for ds in quartet['dataset'].unique():\n", " dataset = quartet[quartet['dataset'] == ds]\n", " res = stats.linregress(dataset['x'], dataset['y'])\n", " print(\n", " f'Dataset {ds}:'\n", " f'\\ty={round(res.slope, 3)}x+{round(res.intercept, 3)}'\n", " f'\\tR-coeff={round(res.rvalue**2, 3)}'\n", " f'\\tCorrelation-coeff={round(res.rvalue, 3)}'\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the numbers are nearly identical. Without visualizing the data it would be rather difficult to tell the datasets apart.\n", "\n", "Let's try to look at the data using a scatter plot, including a regression line (automatically calculated by the `lmplot` function):" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2023-04-28T12:44:34.641826Z", "iopub.status.busy": "2023-04-28T12:44:34.641249Z", "iopub.status.idle": "2023-04-28T12:44:35.763908Z", "shell.execute_reply": "2023-04-28T12:44:35.762325Z", "shell.execute_reply.started": "2023-04-28T12:44:34.641779Z" }, "tags": [] }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2023-04-28T14:44:35.651550\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.4.3, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with sns.plotting_context(\"notebook\", font_scale=1.2), sns.axes_style('white'):\n", " g = sns.lmplot(\n", " x=\"x\", y=\"y\", col=\"dataset\", data=quartet, ci=None,\n", " col_wrap=2, scatter_kws={\"s\": 100, 'color': '#0D62A3'},\n", " line_kws={'linewidth': 5, 'color': '#A30905', 'alpha': 0.5},\n", " )\n", " g.set(xlim=(2, 22))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, that is interesting. Each of those looks entirely differently - let's think about how we could interpret each of those datasets after visual inspection:\n", "\n", " 1. resembles a typical linear relationship where the y variable is correlated with x (with a lot of [Gaussian noise](https://en.wikipedia.org/wiki/Gaussian_noise))\n", " 2. there is a clear [correlation](https://en.wikipedia.org/wiki/Correlation) between variables but not a linear one\n", " 3. the relationship is evidently linear, with a single outlier (which is lowering the correlation coefficient from 1 to 0.816)\n", " 4. there does not seem to be a relationship between the two variables, however the outlier again skews the correlation coefficient significantly\n", "\n", "As you can see, according to the linear regression model parameters themselves those datasets look very similar, if not the same. It is only when we visualize them using a simple scatter plot we can see that those datasets differ significantly.\n", "\n", "````{admonition} Remember\n", ":class: tip\n", "Data visualization allows us to get the first glimpse into the properties of the dataset, without the need to calculate many different summary statistics and other metrics. \n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next sections, we will explore different visualization methods that will help us in the analysis of different datasets." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" }, "vscode": { "interpreter": { "hash": "52634da84371cba311ea128a5ea7cdc41ff074b781779e754b270ff9f8153cee" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }