Visualization

When used as an abstract noun, visualization refers to the process of making visible abstract data and connections. The result of this process is commonly called a (or the) visualization (visualization as a concrete noun).

In our daily lives, we often use standardized charts for visualization, which all feature different characteristics. Examples of charts are bar charts, pie charts, and line charts. Upon closer inspection, however, charts are only concrete instances of more abstract objects. These objects are called graphics or graphs, and they can be composed of a limited set of elements, using a limited set of rules (if you are interested in the conceptual foundations, read Bertin, Sémiologie graphique, or Wilkinson, The Grammar of Graphics). In programming jargon, graphics also go by the name of plots, and the process of making a plot is consequently referred to as plotting.


Visualization in Python

There are many libraries for data analysis in Python, which all have their strengths and weaknesses. Which library you should use depends both on the requirements of your project and your personal preferences. The libraries might be of interest (named in alphabetical order and without further comments):

Some questions you might ask when choosing your visualization library are:

  • Do the data structures in which I store my data already provide native access to a certain library?
  • Should the graphics be 2D or 3D? Interactive or static? Which types of graphics am I likely to need?
  • How much customization should be possible - or even required? At which level of abstraction do I want to customize my graphics?
  • How well is the library documented? Which learning and support resources are available?
  • What does the API look like? Do I find the syntax intuitive?

Specifically: matplotlib

matplotlib is one of the oldest and most widespread plotting libraries for Python. It has been modeled after the (commercial) MATLAB software, which is popular among natural scientists. Although the syntax is somewhat odd, once you have understood the library's fundamental principles, you will be able to produce decent graphics (and even animations) in little time.

matplotlib is designed for 2D plotting without interactions but there is also a 3D extension. If we are using matplotlib in Juptyter Notebook, we can also use widgets to include interactive elements.

If we want to view graphics within our Jupyter Notebook, we can use one of two modes:

  • %matplotlib inline for static inline plotting, and
  • %matplotlib notebook for inline plotting with basic interactions.

We will be using %matplotlib notebook. In this mode, we will always be drawing into the last open figure (unless we have explicitly closed it or called %matplotlib notebook again). With a right click on a single graphic, we can open a context menu which allows us to save the graphic.

In the following, the use of some important syntax elements for using matplotlib within the Jupyter Notebook is demonstrated. For many of the customizations shown, there are multiple ways to achieve them using matplotlib (as is often the case in programming, too!). Most matplotlib functions have many parameters, the majority of which is optional. Overall, the style of matplotlib is not very pythonic, which means that even an experienced matplotlib user will regularly find him- or herself looking for the right syntax. Luckily, most questions that come up have already been asked by someone else, so the classical research methods (documentation, search engine, StackOverflow) will carry you relatively far.

# Import and notebook mode setup
import matplotlib.pyplot as plt
%matplotlib notebook

# Random numbers
import numpy as np

# General display options
# Use plt.rcParams.keys() to look up parameters that may be set globally
plt.rcParams['figure.figsize'] = (8,12)           # Standard size of a graphic
plt.rcParams['font.size'] = 8                     # Standard font size
plt.rcParams['font.family'] = 'Times New Roman'   # Standard font family

# Create graphic and subplots - returns figure object and axes
fig, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)     

# Mock Data
xs1, ys1 = np.random.randint(1,200,10), np.random.randint(1,100,10)
xs2, ys2 = np.random.randint(1,200,10), np.random.randint(1,100,10)

# Scatter plot in the first subplot
ax1.scatter(xs1, ys1, color='k', marker='d')   # x and y coordinates passed separately
ax1.scatter(xs2, ys2, color='r', marker='*')   # color and marker are optional,
                                               # named parameters
ax1.set_title('Scatter Chart', fontsize=12)    # Set subplot title and adjust font size

# Line chart in the second subplot
ax2.plot(xs1, ys1, color='k')
ax2.plot(xs2, ys2, color='r')
ax2.set_title('Line Chart')

# Bar chart in the third subplot
ax3.bar(xs1, ys1, color='k', alpha=0.5, label='Dataset 1') # alpha determines opacity
ax3.bar(xs2, ys2, color='r', alpha=0.5, label='Dataset 2') # label is used for the legend
ax3.set_title('Bar Chart')

# Adjusting the axes
plt.xticks(range(0,201,10))                     # Marks on the x axis
plt.yticks(range(0,101,10))                     # Marks on the y axis
for ax in (ax1, ax2, ax3):                      # Iterate over all subplots
    ax.xaxis.set_ticks_position('both')         # x ticks on both sides
    ax.yaxis.set_ticks_position('both')         # y ticks on both sides
    ax.set_xlim(0,200)                          # Adjust x min and x max
    ax.set_ylim(0,100)                          # Adjust y min and y max

# Legend in subplot 3
ax3.legend(loc=2)                               # Adjust the legend

# Global title
plt.suptitle("Classic Charts", fontsize=14)

# Adjusting the layout
plt.tight_layout()                              # Tighten the layout
                                                # (important when using subplots)
plt.subplots_adjust(top=0.925, hspace=0.2)      # Further adjustments to improve
                                                # heading positions

plt.show()                                      # Show the graphic inline
                                                # (not needed in notebook mode)

Classic Charts


Specifically: seaborn

seaborn builds on matplotlib to offer a high-level API which allows us to create good-looking standard charts with greater ease. The official tutorial is instructive, and you'll learn some fundamentals of visualization and explorative statistics without even noticing. seaborn adds most value when our data is available in the form of pandas data structures. Like matplotlib, pandas is part of the SciPy open source software ecosystem. Getting familiar with this ecosystem is essential for anyone aiming to do serious data analysis in Python.

Here, we only demonstrate the power of seaborn using an example from the documentation. The dataset used therein (iris) consists of measurements of the blossoms belonging to three different species of orchids. This dataset is often used to illustrate data mining and machine learning concepts; it is a 'classic' for the data science community and thus provided by seaborn as an example dataset. To create a similar visualization using only matplotlib, we would need much more code.

# Imports with standard abbreviations
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np                # Not needed in the following but often relevant
import pandas as pd               # Not needed in the following but often relevant

# Notebook mode
%matplotlib notebook
sns.set(style="ticks")            # Select one of several standard styles
iris = sns.load_dataset("iris")   # Load the iris dataset into a pandas DataFrame
iris.head()                       # Inspect the first rows of the DataFrames
                                  # (method on DataFrame objects)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
g = sns.PairGrid(iris, hue="species")  # Grid to show pairwise correlations
                                       # between variables
g.map_diag(plt.hist)                   # Histograms on the diagonal
g.map_offdiag(plt.scatter);            # Scatter plots off the diagonal

Graphical Summary of the Iris Dataset


results matching ""

    No results matching ""