Matplotlib for Data Analysis

By James Olayinka

Jul 31

Introduction

Data visualization is a crucial aspect of data analysis, as it allows us to explore and communicate insights effectively. In this comprehensive guide, I will use the Titanic dataset to demonstrate various data visualization techniques using Matplotlib, a widely-used plotting library in Python.

In this article, I will explore the following sub-topics highlighted below in a bid to improve your understanding of the Matplotlib Library.

Installing Matplotlib and Loading the Dataset
Line Plots
Bar Plots
Histograms
Scatter Plots
Box Plots
Pie Charts
Heatmaps
Subplots
Advanced Techniques

Let’s get started…

Installing Matplotlib and Loading the Dataset

Before I get started, there is need to make sure that Matplotlib is installed. If you are using a Python distribution like Anaconda, Matplotlib should already be installed. If not, you can install it using pip:

!pip install matplotlib

Matplotlib is a popular data visualization library in Python. I will be leveraging the popular Titanic dataset in this tutorial, which contains information about the passengers who were aboard the Titanic when it sank.

Once you've downloaded the dataset, you can load it into a Pandas DataFrame like this:

Line Plots

Line plots are useful for visualizing trends and patterns over continuous data.

Let's create a basic line plot of passenger ages.

# Creating a basic line plot

plt.plot(df['Age'])
plt.xlabel('Passenger Index')
plt.ylabel('Age')
plt.title('Passenger Ages')
plt.show()

The resulting chart will look like this..

Customizing Line Styles and Colors

Matplotlib provides options to customize the line styles and colors in line plots.

Let's see an example of how to change the line style and color.


plt.plot(df['Age'], linestyle='--', color='r')
plt.xlabel('Passenger Index')
plt.ylabel('Age')
plt.title('Passenger Ages')
plt.show()

The resulting chart will look like this..

Adding Labels, Titles, and Legends

Labels, titles, and legends are essential components of a line plot to provide context and improve understanding.

plt.plot(df['Age'], label='Age')
plt.xlabel('Passenger Index')
plt.ylabel('Age')
plt.title('Passenger Ages')
plt.legend()
plt.show()

The resulting chart will look like this…

Bar Plots

Bar plots are great for comparing categorical data or aggregating data.

vertical and horizontal bar plots

Let's create vertical and horizontal bar plots to analyze passenger survival rates.


# Vertical Bar Plot
survived_counts = df['Survived'].value_counts()
plt.bar(survived_counts.index, survived_counts.values)
plt.xlabel('Survived')
plt.ylabel('Count')
plt.title('Survival Counts')
plt.show()

# Horizontal Bar Plot
class_counts = df['Pclass'].value_counts()
plt.barh(class_counts.index, class_counts.values)
plt.xlabel('Count')
plt.ylabel('Passenger Class')
plt.title('Passenger Class Counts')
plt.show()

The resulting chart will look like this…

Grouped Bar Plots

Grouped bar plots are useful for comparing categorical data across different groups or categories.

# Calculate survival counts based on passenger class
survival_counts = df.groupby('Pclass')['Survived'].value_counts().unstack()

# Create a grouped bar plot
survival_counts.plot(kind='bar', stacked=False)
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.title('Survival Counts by Passenger Class')
plt.legend(title='Survived', labels=['No', 'Yes'])
plt.show()

The resulting chart will look like this…

Stacked Bar Plots

Stacked bar plots are useful for comparing the contribution of different categories within a group.

# Calculate survival counts based on passenger class
survival_counts = df.groupby('Pclass')['Survived'].value_counts().unstack()

# Create a stacked bar plot
survival_counts.plot(kind='bar', stacked=True)
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.title('Survival Distribution by Passenger Class')
plt.legend(title='Survived', labels=['No', 'Yes'])
plt.show()

The resulting chart will look like this…

Histograms

Histograms allow us to visualize the distribution of numerical data.

Basic Histogram

Let's create a basic histogram of passenger ages.

# Creating a basic histogram
plt.hist(df['Age'], bins=20)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Passenger Age Distribution')
plt.show()

The resulting chart will look like this…

Customizing Bins and Histogram Appearance

Histograms are useful for visualizing the distribution of numerical data. Matplotlib allows us to customize the number of bins and the appearance of histograms to better represent the data.

# Create a basic histogram
plt.hist(df['Age'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Passenger Age Distribution')
plt.show()

The resulting chart will look like this…

Overlaying Multiple Histograms

Sometimes, it is useful to overlay multiple histograms to compare distributions between different groups or categories.

# Create histograms for different survival outcomes
survived = df[df['Survived'] == 1]['Age']
not_survived = df[df['Survived'] == 0]['Age']

plt.hist(survived, bins=20, color='skyblue', edgecolor='black', alpha=0.5, label='Survived')
plt.hist(not_survived, bins=20, color='salmon', edgecolor='black', alpha=0.5, label='Not Survived')

plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Passenger Age Distribution by Survival')
plt.legend()
plt.show()

The resulting chart will look like this…

Scatter Plots

Scatter plots help us understand the relationship between two variables.

Basic Scatter Plots

Let's create a scatter plot to explore the relationship between passenger age and fare.

# Creating a basic scatter plot
plt.scatter(df['Age'], df['Fare'])
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs Fare')
plt.show()

The resulting chart will look like this…

Customizing Marker Styles and Colors

Scatter plots are useful for visualizing the relationship between two numerical variables. Matplotlib allows us to customize the marker styles and colors to enhance the scatter plot's appearance and highlight different data points.

Let's create a basic scatter plot of passenger ages and fares from the Titanic dataset and explore marker style and color customizations.

# Create a basic scatter plot
plt.scatter(df['Age'], df['Fare'], marker='o', color='blue')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs Fare')
plt.show()

The resulting chart will look like this…

Adding a Regression Line

Sometimes, it is useful to visualize the overall trend or relationship between two variables using a regression line on a scatter plot. Matplotlib allows us to add a regression line using the plot() function.

Let's add a regression line to the scatter plot of passenger ages and fares

# Create a scatter plot
plt.scatter(df['Age'], df['Fare'], marker='o', color='blue', alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs Fare')

# Add a regression line
coefficients = np.polyfit(df['Age'], df['Fare'], deg=1)
x = np.linspace(df['Age'].min(), df['Age'].max(), 100)
y = np.polyval(coefficients, x)
plt.plot(x, y, color='red')

plt.show()

Box Plots

Box plots provide a summary of the distribution of data and help identify outliers.

Basic Box Plot

Let's create a basic box plot of passenger fares.

# Creating a basic box plot
plt.boxplot(df['Fare'])
plt.ylabel('Fare')
plt.title('Passenger Fare Distribution')
plt.show()

The resulting chart will look like this…

Grouped Box Plots

Grouped box plots are useful for comparing the distribution of a numerical variable across different groups or categories.

Let's create a grouped box plot to compare the fares of passengers in different passenger classes

# Create a grouped box plot
plt.boxplot([df[df['Pclass'] == 1]['Fare'],
             df[df['Pclass'] == 2]['Fare'],
             df[df['Pclass'] == 3]['Fare']],
            labels=['1st Class', '2nd Class', '3rd Class'])
plt.xlabel('Passenger Class')
plt.ylabel('Fare')
plt.title('Fare Distribution by Passenger Class')
plt.show()

The resulting chart will look like this…

Customizing Box Appearance

Matplotlib provides several options to customize the appearance of box plots, including the colors, line styles, and marker styles.

Here's an example of customizing the box appearance:

# Customizing box appearance
boxprops = dict(color='blue', linewidth=2)
whiskerprops = dict(color='red', linestyle='--')
medianprops = dict(color='green', linewidth=2)
flierprops = dict(marker='o', markersize=5, markerfacecolor='black')

# Create a grouped box plot
plt.boxplot([df[df['Pclass'] == 1]['Fare'],
             df[df['Pclass'] == 2]['Fare'],
             df[df['Pclass'] == 3]['Fare']],
            labels=['1st Class', '2nd Class', '3rd Class'],
            boxprops=boxprops, whiskerprops=whiskerprops,
            medianprops=medianprops, flierprops=flierprops)
plt.xlabel('Passenger Class')
plt.ylabel('Fare')
plt.title('Fare Distribution by Passenger Class')
plt.show()

The resulting chart will look like this…

Pie Charts

Pie charts are useful for displaying proportions or percentages.

Basic Pie Chart

Let's create a basic pie chart to visualize the proportion of male and female passengers.

# Creating a basic pie chart
sex_counts = df['Sex'].value_counts()
plt.pie(sex_counts.values, labels=sex_counts.index, autopct='%1.1f%%')
plt.title('Passenger Gender Distribution')
plt.show()

The resulting chart will look like this…

Exploding Slices and Customizing Colors

Pie charts are a popular way to represent proportions or percentages of a whole. Matplotlib allows us to customize various aspects of a pie chart, including exploding slices and customizing colors.

Let's create a pie chart to visualize the distribution of passenger classes in the Titanic dataset and explore exploding slices and custom color options.

# Count the number of passengers in each class
class_counts = df['Pclass'].value_counts()

# Define custom colors for each slice
colors = ['#FF9999', '#66B2FF', '#99FF99']

# Define the extent to which slices are exploded
explode = (0.1, 0, 0)

# Create a pie chart
plt.pie(class_counts, labels=class_counts.index, colors=colors,
        explode=explode, autopct='%1.1f%%', shadow=True, startangle=90)

plt.axis('equal')
plt.title('Passenger Class Distribution')
plt.show()

The resulting chart will look like this…

Heatmaps

Heatmaps are effective for visualizing correlations and patterns in large datasets. Let's create a basic heatmap to visualize the correlation between variables.

Basic Heatmap

# Creating a basic heatmap
correlation_matrix = df.corr()
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')
plt.colorbar()
plt.xticks(range(len(correlation_matrix)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix)), correlation_matrix.columns)
plt.title('Correlation Matrix')
plt.show()

The resulting chart will look like this…

Customizing Colormap and Axes Labels

Heatmaps are an effective way to visualize data in a matrix form, with colors representing the values. Matplotlib allows us to customize the colormap, axes labels, and other aspects of a heatmap.

Let's create a heatmap to visualize the correlation matrix of the numeric variables in the Titanic dataset and explore customizing the colormap and axes labels.

# Compute correlation matrix
corr_matrix = df.corr()

# Create a heatmap
plt.imshow(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1)
plt.colorbar(label='Correlation')
plt.xticks(np.arange(len(corr_matrix.columns)), corr_matrix.columns, rotation=45)
plt.yticks(np.arange(len(corr_matrix.columns)), corr_matrix.columns)
plt.title('Correlation Matrix of Numeric Variables')
plt.show()

The resulting chart will look like this…

Adding Annotations and Gridlines

Annotations and gridlines can enhance the readability and interpretability of a heatmap. Here's an example of adding annotations and gridlines to the heatmap:

# Compute correlation matrix
corr_matrix = df.corr()

# Create a heatmap with annotations and gridlines
plt.imshow(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1)
plt.colorbar(label='Correlation')
plt.xticks(np.arange(len(corr_matrix.columns)), corr_matrix.columns, rotation=45)
plt.yticks(np.arange(len(corr_matrix.columns)), corr_matrix.columns)
plt.title('Correlation Matrix of Numeric Variables')

# Add annotations
for i in range(len(corr_matrix.columns)):
    for j in range(len(corr_matrix.columns)):
        plt.text(j, i, f'{corr_matrix.iloc[i, j]:.2f}', ha='center', va='center', color='white')

# Add gridlines
plt.grid(visible=True, which='both', color='white', linestyle='-', linewidth=0.5)

plt.show()

The resulting chart will look like this…

Subplots

Subplots allow us to display multiple plots in a single figure.

Multiple Subplots

Let's create a figure with multiple subplots to showcase different visualizations.

# Creating a figure with multiple subplots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

# Line plot
axes[0, 0].plot(df['Age'])
axes[0, 0].set_xlabel('Passenger Index')
axes[0, 0].set_ylabel('Age')
axes[0, 0].set_title('Passenger Ages')

# Bar plot
axes[0, 1].bar(survived_counts.index, survived_counts.values)
axes[0, 1].set_xlabel('Survived')
axes[0, 1].set_ylabel('Count')
axes[0, 1].set_title('Survival Counts')

# Histogram
axes[1, 0].hist(df['Age'], bins=20)
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Passenger Age Distribution')

# Scatter plot
axes[1, 1].scatter(df['Age'], df['Fare'])
axes[1, 1].set_xlabel('Age')
axes[1, 1].set_ylabel('Fare')
axes[1, 1].set_title('Age vs Fare')

plt.tight_layout()
plt.show()

The resulting chart will look like this…

Customizing Subplot Layouts

Matplotlib provides flexibility in customizing subplot layouts, allowing you to arrange multiple subplots in various configurations. You can control the number of rows and columns, adjust spacing between subplots, and more.

Let's create a figure with multiple subplots to visualize different variables from the Titanic dataset and explore customizing the subplot layout.

# Create a figure and subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# First subplot: Histogram of fares
axs[0, 0].hist(df['Fare'], bins=30, color='blue')
axs[0, 0].set_xlabel('Fare')
axs[0, 0].set_ylabel('Frequency')
axs[0, 0].set_title('Distribution of Fares')

# Second subplot: Bar plot of passenger classes
class_counts = df['Pclass'].value_counts()
axs[0, 1].bar(class_counts.index, class_counts.values, color='green')
axs[0, 1].set_xlabel('Passenger Class')
axs[0, 1].set_ylabel('Count')
axs[0, 1].set_title('Passenger Class Distribution')

# Third subplot: Scatter plot of age and fare
axs[1, 0].scatter(df['Age'], df['Fare'], marker='o', color='red', alpha=0.5)
axs[1, 0].set_xlabel('Age')
axs[1, 0].set_ylabel('Fare')
axs[1, 0].set_title('Age vs Fare')

# Fourth subplot: Pie chart of survival proportions
survival_counts = df['Survived'].value_counts()
axs[1, 1].pie(survival_counts, labels=['Did not survive', 'Survived'], autopct='%1.1f%%', colors=['orange', 'purple'])
axs[1, 1].set_title('Survival Proportions')

# Adjust spacing between subplots
plt.tight_layout()

# Show the figure
plt.show()

The resulting chart will look like this…

Sharing Axes and Legends between Subplots

Matplotlib allows sharing axes and legends between subplots to enhance the visual coherence and avoid redundancy.

Let's update the code to demonstrate sharing the x-axis between two subplots and sharing a legend between subplots.

# Create a figure and subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8), sharex='col')

# First subplot: Histogram of fares
axs[0, 0].hist(df['Fare'], bins=30, color='blue')
axs[0, 0].set_ylabel('Frequency')
axs[0, 0].set_title('Distribution of Fares')

# Second subplot: Bar plot of passenger classes
class_counts = df['Pclass'].value_counts()
axs[0, 1].bar(class_counts.index, class_counts.values, color='green')
axs[0, 1].set_ylabel('Count')
axs[0, 1].set_title('Passenger Class Distribution')

# Third subplot: Scatter plot of age and fare
axs[1, 0].scatter(df['Age'], df['Fare'], marker='o', color='red', alpha=0.5)
axs[1, 0].set_xlabel('Age')
axs[1, 0].set_ylabel('Fare')
axs[1, 0].set_title('Age vs Fare')

# Fourth subplot: Pie chart of survival proportions
survival_counts = df['Survived'].value_counts()
wedges, texts, autotexts = axs[1, 1].pie(survival_counts, labels=['Did not survive', 'Survived'], autopct='%1.1f%%', colors=['orange', 'purple'])
axs[1, 1].set_title('Survival Proportions')

# Share the legend across subplots
axs[1, 1].legend(wedges, ['Did not survive', 'Survived'], loc='center')

# Adjust spacing between subplots
plt.tight_layout()

# Show the figure
plt.show()

The resulting chart will look like this…

Advanced Techniques

Matplotlib offers advanced techniques for specialized visualizations. Let's explore some of these techniques briefly.

3D Plots

Matplotlib provides functionality for creating 3D plots to visualize data in three-dimensional space.

Let's create a simple 3D scatter plot to visualize the relationship between age, fare, and survival status in the Titanic dataset.

# Create a 3D scatter plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

# Plot the data points
ax.scatter(df['Age'], df['Fare'], df['Survived'], c=df['Survived'], cmap='coolwarm')

# Set labels for each axis
ax.set_xlabel('Age')
ax.set_ylabel('Fare')
ax.set_zlabel('Survived')

# Show the plot
plt.show()

The resulting chart will look like this…

Polar Plots

Polar plots, also known as radar charts or spider charts, are useful for visualizing data that has multiple attributes or dimensions.

Let's create a polar plot to compare the survival rates across different passenger classes in the Titanic dataset.

# Group the data by passenger class and compute the survival rate
class_survival_rate = df.groupby('Pclass')['Survived'].mean()

# Create a polar plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, polar=True)

# Compute the angles and lengths of the spokes
angles = np.linspace(0, 2 * np.pi, len(class_survival_rate), endpoint=False)
lengths = class_survival_rate.values

# Plot the data
ax.plot(angles, lengths, marker='o')

# Set labels for each spoke
ax.set_xticks(angles)
ax.set_xticklabels(class_survival_rate.index)

# Set the radial axis limits
ax.set_ylim(0, 1)

# Set a title for the plot
ax.set_title('Survival Rates by Passenger Class')

# Show the plot
plt.show()

The resulting chart will look like this…

Conclusion

In this comprehensive guide, we have covered various data visualization techniques using Matplotlib with the Titanic dataset. You should now have a solid understanding of how to create line plots, bar plots, histograms, scatter plots, box plots, pie charts, heatmaps, and subplots. Additionally, we explored advanced techniques that open doors to even more specialized visualizations. Experiment with different options and customize your visualizations to effectively communicate insights from your data using Matplotlib.

Keep in mind that this is just a brief introduction to Matplotlib, and there are many more advanced features and functions available that can help you with more complex data analysis tasks. To learn more about Matplotlib, be sure to check out the official documentation and explore more resources available online here.

If you want to get started with data analytics and looking to improving your skills, you can check out our Learning Track

Table of contents

Introduction
Installing Matplotlib and Loading the Dataset
Line Plots
1. Customizing Line Styles and Colors
2. Adding Labels, Titles, and Legends
Bar Plots
1. vertical and horizontal bar plots
2. Grouped Bar Plots
3. Stacked Bar Plots
Histograms
1. Basic Histogram
2. Customizing Bins and Histogram Appearance
3. Overlaying Multiple Histograms
Scatter Plots
1. Basic Scatter Plots
2. Customizing Marker Styles and Colors
3. Adding a Regression Line
Box Plots
1. Basic Box Plot
2. Grouped Box Plots
3. Customizing Box Appearance
Pie Charts
1. Basic Pie Chart
2. Exploding Slices and Customizing Colors
Heatmaps
1. Basic Heatmap
2. Customizing Colormap and Axes Labels
3. Adding Annotations and Gridlines
Subplots
1. Multiple Subplots
2. Customizing Subplot Layouts
3. Sharing Axes and Legends between Subplots
Advanced Techniques
1. 3D Plots
2. Polar Plots
Conclusion

Empowering individuals and businesses with the tools to harness data, drive innovation, and achieve excellence in a digital world.

Solutions

Bootcamp

Products

Resources