Menu bar

20/08/2021

Data Visualization

Data visualization is an important skill in applied statistics and machine learning. This can be helpful when exploring and getting to know a dataset and can help with identifying patterns, corrupt data, outliers, and much more.

This tutorial is divided into 6 parts; they are:
  • Introduction to Matplotlib
  • Line Plot
  • Bar Chart
  • Histogram Plot
  • Box and Whisker Plot
  • Scatter Plot

1. Introduction to Matplotlib

There are many excellent plotting libraries in Python and I recommend exploring them in order to create presentable graphics. For quick and dirty plots intended for your own use, I recommend using the Matplotlib library.

# import matplotlib context
from matplotlib import pyplot
import matplotlib.pyplot as plt

...
# create a plot
pyplot.plot(...)

...
# display the plot
pyplot.show()

...
# save plot to file
pyplot.savefig('my_image.png')

2. Line Plot

A line plot is generally used to present observations collected at regular intervals. The x-axis represents the regular interval, such as time. The y-axis shows the observations, ordered by the x-axis and connected by a line.

Line plots are useful for presenting time series data as well as any sequence data where there is an ordering between observations.

# example of a line plot
from numpy import sin
from matplotlib import pyplot
# consistent interval for x-axis
x = [x*0.1 for x in range(100)]
# function of x for y-axis
y = sin(x)
# create line plot
pyplot.plot(x, y)
# show line plot
pyplot.show()

-----Result-----

Example of a line plot of data


3. Bar Chart

A bar chart is generally used to present relative quantities for multiple categories. 

The x-axis represents the categories and are spaced evenly. 

The y-axis represents the quantity for each category and is drawn as a bar from the baseline to the appropriate level on the y-axis.


# example of a bar chart
from random import seed
from random import randint
from matplotlib import pyplot
# seed the random number generator
seed(1)
# names for categories
x = ['red', 'green', 'blue']
# quantities for each category
y = [randint(0, 100), randint(0, 100), randint(0, 100)]
# create bar chart
pyplot.bar(x, y)
# show line plot
pyplot.show()


-----Result-----

Example of a bar chart of data


4. Histogram Plot

A histogram plot is generally used to summarize the distribution of a data sample. 

The x-axis represents discrete bins or intervals for the observations. 

For example observations with values between 1 and 10 may be split into five bins, the values [1,2] would be allocated to the first bin, [3,4] would be allocated to the second bin, and so on. The y-axis represents the frequency or
count of the number of observations in the dataset that belong to each bin.

# example of a histogram plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# random numbers drawn from a Gaussian distribution
x = randn(1000)
# create histogram plot
pyplot.hist(x)
# show line plot
pyplot.show()

-----Result-----

Example of a histogram plot of data

Running the example, we can see that the shape of the bars shows the bell-shaped curve of the Gaussian distribution.


5. Box and Whisker Plot

The boxplot is a graphical technique that displays the distribution of variables. 

It helps us see the location, skewness, spread, tile length and outlying points. 

The boxplot is a graphical representation of the Five Number Summary
  • The minimum.
  • Q1 (the first quartile, or the 25% mark).
  • The median.
  • Q3 (the third quartile, or the 75% mark).
  • The maximum.
Boxplots are useful to summarize the distribution of a data sample as an alternative to the histogram. 

They can help to quickly get an idea of the range of common and sensible values in the box and in the whisker respectively. 

Because we are not looking at the shape of the distribution explicitly, this method is often used when the data has an unknown or unusual distribution, such as non-Gaussian

# example of a box and whisker plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# random numbers drawn from a Gaussian distribution
x = [randn(1000), 5 * randn(1000), 10 * randn(1000)]
# create box and whisker plot
pyplot.boxplot(x)
# show line plot
pyplot.show()


-----Result-----

Example of a box and whisker plot of data


We can see that the same scale is used on the y-axis for each, making the first plot look squashed and the last plot look spread out. 

In this case, we can see the black box for the middle 50% of the data, the orange line for the median, the lines for the whiskers summarizing the range of sensible data, and finally dots for the possible outliers.


6. Scatter Plot

A scatter plot is generally used to summarize the relationship between two paired data samples. 

Paired data samples means that two measures were recorded for a given observation, such as the weight and height of a person. 

The x-axis represents observation values for the first sample, and the y-axis represents the observation values for the second sample.
Each point on the plot represents a single observation.

Scatter plots are useful for showing the association or correlation between two variables. A dataset may have more than two measures
(variables or columns) for a given observation. A scatter plot matrix is a cart containing scatter plots for each pair of variables in a dataset with more than two variables.

# example of a scatter plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# first variable
x = 20 * randn(1000) + 100
# second variable
y = x + (10 * randn(1000) + 50)
# create scatter plot
pyplot.scatter(x, y)
# show line plot
pyplot.show()


-----Result-----

Example of a scatter plot of data



No comments:

Post a Comment