It is critical that any data preparation performed on a training dataset is also performed on a new dataset in the future. This may include a test dataset when evaluating a model or new data from the domain when using a model to make predictions.
Typically, the model fit on the training dataset is saved for later use. The correct solution to preparing new data for the model in the future is to also save any data preparation objects, like data scaling methods, to file along with the model.
In this tutorial, you will discover how to save a model and data preparation object to file for later use. After completing this tutorial, you will know:
- The challenge of correctly preparing test data and new data for a machine learning model.
- The solution of saving the model and data preparation objects to file for later use.
- How to save and later load and use a machine learning model and data preparation model on new data.
This tutorial is divided into three parts; they are:
- Challenge of Preparing New Data for a Model
- Save Data Preparation Objects
- Worked Example of Saving Data Preparation
A. Challenge of Preparing New Data for a Model
The best practice when using scaling techniques for evaluating models is to fit them on the training dataset, then apply them to the training and test datasets. Or, when working with a final model, to fit the scaling method on the training dataset and apply the transform to the training dataset and any new dataset in the future.
It is critical that any data preparation or transformation applied to the training dataset is also applied to the test or other dataset in the future. This is straightforward when all of the data and the model are in memory. This is challenging when a model is saved and used later.
B. Save Data Preparation Objects
The solution is to save the data preparation object to file along with the model. For example, it is common to use the pickle framework (built-in to Python) for saving machine learning models for later use, such as saving a final model.
Later, the model and the data preparation object can be loaded and used.
It is convenient to save the entire objects to file, such as the model object and the data preparation object.
C. Worked Example of Saving Data Preparation
In this section, we will demonstrate preparing a dataset, fitting a model on the dataset, saving the model and data transform object to file, and later loading the model and transform and using them on new data.
1. Define a Dataset
The example below creates a test dataset with 100 examples, two input features, and two class labels (0 and 1). The dataset is then split into training and test sets and the min and max values of each variable are then reported.
Importantly, the random state is set when creating the dataset and when splitting the data so that the same dataset is created and the same split of data is performed each time that the code is run.
# example of creating a test dataset and splitting it into train and test sets
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the scale of each input variable
for i in range(X_test.shape[1]):
print('>%d, train: min=%.3f, max=%.3f, test: min=%.3f, max=%.3f' % (i, X_train[:, i].min(), X_train[:, i].max(), X_test[:, i].min(), X_test[:, i].max()))
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the scale of each input variable
for i in range(X_test.shape[1]):
print('>%d, train: min=%.3f, max=%.3f, test: min=%.3f, max=%.3f' % (i, X_train[:, i].min(), X_train[:, i].max(), X_test[:, i].min(), X_test[:, i].max()))
-----Result-----
>0, train: min=-11.856, max=0.526, test: min=-11.270, max=0.085
>1, train: min=-6.388, max=6.507, test: min=-5.581, max=5.926
>1, train: min=-6.388, max=6.507, test: min=-5.581, max=5.926
Running the example reports the min and max values for each variable in both the train and test datasets.
2. Scale the Dataset
We will use the MinMaxScaler to scale each input variable to the range 0-1.
# example of scaling the dataset
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define scaler
scaler = MinMaxScaler()
# fit scaler on the training dataset
scaler.fit(X_train)
# transform both datasets
X_train_scaled = scaler.transform(X_train)
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define scaler
scaler = MinMaxScaler()
# fit scaler on the training dataset
scaler.fit(X_train)
# transform both datasets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# summarize the scale of each input variable
for i in range(X_test.shape[1]):
print('>%d, train: min=%.3f, max=%.3f, test: min=%.3f, max=%.3f' %(i, X_train_scaled[:, i].min(), X_train_scaled[:, i].max(),
X_test_scaled[:, i].min(), X_test_scaled[:, i].max()))
# summarize the scale of each input variable
for i in range(X_test.shape[1]):
print('>%d, train: min=%.3f, max=%.3f, test: min=%.3f, max=%.3f' %(i, X_train_scaled[:, i].min(), X_train_scaled[:, i].max(),
X_test_scaled[:, i].min(), X_test_scaled[:, i].max()))
-----Result-----
>0, train: min=0.000, max=1.000, test: min=0.047, max=0.964
>1, train: min=0.000, max=1.000, test: min=0.063, max=0.955
>1, train: min=0.000, max=1.000, test: min=0.063, max=0.955
3. Save Model and Data Scaler
Next, we can fit a model on the training dataset and save both the model and the scaler object to file.
# example of fitting a model on the scaled dataset
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from pickle import dump
# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# split data into train and test sets
X_train, _, y_train, _ = train_test_split(X, y, test_size=0.33, random_state=1)
# define scaler
scaler = MinMaxScaler()
# fit scaler on the training dataset
scaler.fit(X_train)
# transform the training dataset
X_train_scaled = scaler.transform(X_train)
# define model
model = LogisticRegression(solver='lbfgs')
model.fit(X_train_scaled, y_train)
# save the model
dump(model, open('model.pkl', 'wb'))
# save the scaler
dump(scaler, open('scaler.pkl', 'wb'))
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from pickle import dump
# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# split data into train and test sets
X_train, _, y_train, _ = train_test_split(X, y, test_size=0.33, random_state=1)
# define scaler
scaler = MinMaxScaler()
# fit scaler on the training dataset
scaler.fit(X_train)
# transform the training dataset
X_train_scaled = scaler.transform(X_train)
# define model
model = LogisticRegression(solver='lbfgs')
model.fit(X_train_scaled, y_train)
# save the model
dump(model, open('model.pkl', 'wb'))
# save the scaler
dump(scaler, open('scaler.pkl', 'wb'))
You should have two files in your current working directory:
The model object: model.pkl
The scaler object: scaler.pkl
4. Load Model and Data Scaler
In this case, we will assume that the training dataset is not available, and that only new data or the test dataset is available.
We will load the model and the scaler, then use the scaler to prepare the new data and use the model to make predictions.
# load model and scaler and make predictions on new data
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from pickle import load
# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# split data into train and test sets
_, X_test, _, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# load the model
model = load(open('model.pkl', 'rb'))
# load the scaler
scaler = load(open('scaler.pkl', 'rb'))
# check scale of the test set before scaling
print('Raw test set range')
for i in range(X_test.shape[1]):
print('>%d, min=%.3f, max=%.3f' % (i, X_test[:, i].min(), X_test[:, i].max()))
# transform the test dataset
X_test_scaled = scaler.transform(X_test)
print('Scaled test set range')
for i in range(X_test_scaled.shape[1]):
print('>%d, min=%.3f, max=%.3f' % (i, X_test_scaled[:, i].min(), X_test_scaled[:,i].max()))
# make predictions on the test set
yhat = model.predict(X_test_scaled)
# evaluate accuracy
acc = accuracy_score(y_test, yhat)
print('Test Accuracy:', acc)
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from pickle import load
# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# split data into train and test sets
_, X_test, _, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# load the model
model = load(open('model.pkl', 'rb'))
# load the scaler
scaler = load(open('scaler.pkl', 'rb'))
# check scale of the test set before scaling
print('Raw test set range')
for i in range(X_test.shape[1]):
print('>%d, min=%.3f, max=%.3f' % (i, X_test[:, i].min(), X_test[:, i].max()))
# transform the test dataset
X_test_scaled = scaler.transform(X_test)
print('Scaled test set range')
for i in range(X_test_scaled.shape[1]):
print('>%d, min=%.3f, max=%.3f' % (i, X_test_scaled[:, i].min(), X_test_scaled[:,i].max()))
# make predictions on the test set
yhat = model.predict(X_test_scaled)
# evaluate accuracy
acc = accuracy_score(y_test, yhat)
print('Test Accuracy:', acc)
-----Result-----
Raw test set range
>0, min=-11.270, max=0.085
>1, min=-5.581, max=5.926
Scaled test set range
>0, min=0.047, max=0.964
>1, min=0.063, max=0.955
Test Accuracy: 1.0
>0, min=-11.270, max=0.085
>1, min=-5.581, max=5.926
Scaled test set range
>0, min=0.047, max=0.964
>1, min=0.063, max=0.955
Test Accuracy: 1.0
This provides a template that you can use to save both your model and scaler object (or objects) to file on your own projects.
References:
No comments:
Post a Comment