Menu bar

03/11/2021

Data Preparation - Part 1 - Load and Explore Time Series Data

The Pandas library in Python provides excellent, built-in support for time series data. Once loaded, Pandas also provides tools to explore and better understand your dataset. In this lesson, you will discover how to load and explore your time series dataset. 

After completing this tutorial, you will know:
  • How to load your time series dataset from a CSV file using Pandas.
  • How to peek at the loaded data and query using date-times.
  • How to calculate and review summary statistics.

A. Daily Female Births Dataset

In this lesson, we will use the Daily Female Births Dataset as an example. This dataset describes the number of daily female births in California in 1959.


B. Load Time Series Data

Pandas represented time series datasets as a Series. A Series is a one-dimensional array with a time label for each row.

# load dataset using read_csv()
from pandas import read_csv
series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)
print(type(series))
print(series.head())

Note the arguments to the read_csv() function. We provide it a number of hints to ensure the data is loaded as a Series.
  • header=0: We must specify the header information at row 0.
  • parse dates=True: We give the function a hint that data in the first column contains dates that need to be parsed.
  • index col=0: We hint that the first column contains the index information for the time series.
  • squeeze=True: We hint that we only have one data column and that we are interested in a Series and not a DataFrame.

<class 'pandas.core.series.Series'>
Date
1959-01-01 35
1959-01-02 32
1959-01-03 30
1959-01-04 31
1959-01-05 44
Name: Births, dtype: int64


C. Exploring Time Series Data

Pandas also provides tools to explore and summarize your time series data. 

1. Peek at the Data

For example, you can print the first 10 rows of data as follows.

# summarize first few lines of a file
from pandas import read_csv
series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)
print(series.head(10))

Date
1959-01-01 35
1959-01-02 32
1959-01-03 30
1959-01-04 31
1959-01-05 44
1959-01-06 29
1959-01-07 45
1959-01-08 43
1959-01-09 38
1959-01-10 27
Name: Births, dtype: int64


You can also use the tail() function to get the last n records of the dataset.

2. Querying By Time

You can slice, dice, and query your series using the time index. For example, you can access all observations in January as follows:

# query a dataset using a date-time index
from pandas import read_csv
series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True,
squeeze=True)
print(series['1959-01'])

Date
1959-01-01 35
1959-01-02 32
1959-01-03 30
1959-01-04 31
1959-01-05 44
1959-01-06 29
1959-01-07 45
1959-01-08 43
1959-01-09 38
1959-01-10 27
1959-01-11 38
1959-01-12 33
1959-01-13 55
1959-01-14 47
1959-01-15 45
1959-01-16 37
1959-01-17 50
1959-01-18 43
1959-01-19 41
1959-01-20 52
1959-01-21 34
1959-01-22 53
1959-01-23 39
1959-01-24 32
1959-01-25 37
1959-01-26 43
1959-01-27 39
1959-01-28 35
1959-01-29 44
1959-01-30 38
1959-01-31 24
Name: Births, dtype: int64


3. Descriptive Statistics

The describe() function creates a 7 number summary of the loaded time series including mean, standard deviation, median, minimum, and maximum of the observations.

# calculate descriptive statistics
from pandas import read_csv
series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True,
squeeze=True)
print(series.describe())

count 365.000000
mean 41.980822
std 7.348257
min 23.000000
25% 37.000000
50% 42.000000
75% 46.000000
max 73.000000
Name: Births, dtype: float64


No comments:

Post a Comment