Machine learning: Data Preparation - Part 1 - Load and Explore Time Series Data

The Pandas library in Python provides excellent, built-in support for time series data. Once loaded, Pandas also provides tools to explore and better understand your dataset. In this lesson, you will discover how to load and explore your time series dataset.

After completing this tutorial, you will know:

How to load your time series dataset from a CSV file using Pandas.
How to peek at the loaded data and query using date-times.
How to calculate and review summary statistics.

A. Daily Female Births Dataset

In this lesson, we will use the Daily Female Births Dataset as an example. This dataset describes the number of daily female births in California in 1959.

B. Load Time Series Data

Pandas represented time series datasets as a Series. A Series is a one-dimensional array with a time label for each row.

# load dataset using read_csv()

from pandas import read_csv

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

print(type(series))

print(series.head())

Note the arguments to the read_csv() function. We provide it a number of hints to ensure the data is loaded as a Series.

header=0: We must specify the header information at row 0.
parse dates=True: We give the function a hint that data in the first column contains dates that need to be parsed.
index col=0: We hint that the first column contains the index information for the time series.
squeeze=True: We hint that we only have one data column and that we are interested in a Series and not a DataFrame.

Date

1959-01-01 35

1959-01-02 32

1959-01-03 30

1959-01-04 31

1959-01-05 44

Name: Births, dtype: int64

C. Exploring Time Series Data

Pandas also provides tools to explore and summarize your time series data.

1. Peek at the Data

For example, you can print the first 10 rows of data as follows.

# summarize first few lines of a file

from pandas import read_csv

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

print(series.head(10))

Date

1959-01-01 35

1959-01-02 32

1959-01-03 30

1959-01-04 31

1959-01-05 44

1959-01-06 29

1959-01-07 45

1959-01-08 43

1959-01-09 38

1959-01-10 27

Name: Births, dtype: int64

You can also use the tail() function to get the last n records of the dataset.

2. Querying By Time

You can slice, dice, and query your series using the time index. For example, you can access all observations in January as follows:

# query a dataset using a date-time index

from pandas import read_csv

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True,

squeeze=True)

print(series['1959-01'])

Date

1959-01-01 35

1959-01-02 32

1959-01-03 30

1959-01-04 31

1959-01-05 44

1959-01-06 29

1959-01-07 45

1959-01-08 43

1959-01-09 38

1959-01-10 27

1959-01-11 38

1959-01-12 33

1959-01-13 55

1959-01-14 47

1959-01-15 45

1959-01-16 37

1959-01-17 50

1959-01-18 43

1959-01-19 41

1959-01-20 52

1959-01-21 34

1959-01-22 53

1959-01-23 39

1959-01-24 32

1959-01-25 37

1959-01-26 43

1959-01-27 39

1959-01-28 35

1959-01-29 44

1959-01-30 38

1959-01-31 24

Name: Births, dtype: int64

3. Descriptive Statistics

The describe() function creates a 7 number summary of the loaded time series including mean, standard deviation, median, minimum, and maximum of the observations.

# calculate descriptive statistics

from pandas import read_csv

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True,

squeeze=True)

print(series.describe())

count 365.000000

mean 41.980822

std 7.348257

min 23.000000

25% 37.000000

50% 42.000000

75% 46.000000

max 73.000000

Name: Births, dtype: float64

Machine learning

Menu bar

03/11/2021

Data Preparation - Part 1 - Load and Explore Time Series Data

No comments:

Post a Comment