The Python programming language is often associated with Data Science and it's applications. This is in part due to the Scikit-Learn library, one of python's excellent machine learning (ML) packages. Scikit-learn is an incredibly well documented library with a wealth of resources, and as such learning the fundamentals of it can get you surprisingly far.
This series of blog posts is designed to take you beyond understanding the core basics of machine learning in Python to the stage where you are able to build your own ML pipeline. You will then be able to start addressing various ML problems and improve your pipeline as you gain more experience, thus streamlining you code base more and more.
Though creating such a pipeline seems like a imposing task I'm hoping this guide will teach you how and leave you with an appetite to develop your abilities further. It is important to note here that the pipeline you will learn to create in this blog series focuses on the modelling and feature selection part of ML. That is, we shall not go into how to go about cleaning data and getting a flat data structure ready to feed your ML algorithms.
This quickstart guide to machine learning projects will be divided as follows:
Let's get started and learn how to model data sets using the Scikit-learn package!
The Scikit-learn package strongly adheres to the principles of the Zen of Python. This is encapsulated in the consistent and simple interfaces of its estimator API and other classes. In fact, we can make use of almost any Scikit-learn estimator by following just four main steps:
Throughout this blog series we shall work through a basic example using the breast cancer data set readily available in the Scikit-learn package. The main motivation for using this data set is that it requires no cleaning. This allows us to delve straight into the modelling aspects.
First let's create a module (simply a
.py file) called
clean.py in which we can import our dataset. Note that any precursory cleaning and integration steps you might need for a different dataset should be included in this module as well.
We can load our data into memory by importing it from the Scikit-learn
datasets as follows:
# clean.py from sklearn import datasets breast_cancer = datasets.load_breast_cancer() features = breast_cancer.data response = breast_cancer.target
Note that the
load_breast_cancer method creates a
sklearn.utils.Bunch object that has the feature and response variables of the dataset located in its
target attributes respectively.
Before we can start to model our data, we need to split it into our training and test datasets (we shall not discuss the use of an additional third set called the validation set here). A fundamental concept of supervised machine learning is the use of a (larger) labelled training set to teach our model how to recognise patterns in the data and a (smaller) labelled test set to determine whether the model is doing a good job of recognising these patterns.
Scikit has a method available in the
model_selection module called train_test_split to help us with this exact task. We simply feed in our DataFrame, specify our test size (below we use 70% of our data to train on, 30% to test on), and choose a random number seed (for consistency between runs).
# clean.py from sklearn.model_selection import train_test_split train_x, test_x, train_y, test_y = train_test_split( features, response, test_size=0.3, random_state=5)
Next, let's look at creating our first model to predict breast cancer using a stochastic gradient descent classifier (though we could easily consider many other classification models such as the random forest classifier). We first need to import the right estimator class from the Scikit. A quick Google search for scikit-learn stochastic gradient descent gives us the documentation page here. The main header contains exactly what we need.
Let's create a new module,
model.py, and import our estimator class.
# model.py from sklearn.linear_model import SGDClassifier
Having imported the estimator class we can go about using this with the following for steps:
Once we have imported our estimator class, we need to instantiate it (create an instance of the class). However, it will not yet able to make predictions as it has not yet learned from the data. This is the difference between an estimator instance and an actual model. This concept can be strange to grasp, particularly if this is your first ML project or tutorial.
So let's try and clear up the relationship between instances of estimators and models with the following analogy:
To summarise, an instance of an estimator class can be considered the untrained potential of that class, while an actual model is a fitted (and therefore parametrised) instance who's potential has been fulfilled.
And remember, estimators such as our
SGDClassifier are simply Python classes with bound data and methods - nothing too fancy to worry about.
Moving on to the actual code, having imported the model class we can then create an instance of this class.
# model.py sgdc = SGDClassifier()
Note that here we could specify arguments (representing hyper-parameters) in the function call, but for simplicity we shall use the defaults.
We now have an instance of our estimator. To use this as a means of prediction we need to train it on our training data. This step is known as fitting.
For estimator classes in Scikit-learn, the
fit method takes (at least) two parameters - denoted
X is the training set involving all the data except for the response variable (usually a Pandas DataFrame).
y is the response variable contained in the training set (usually a Pandas Series or a Numpy array).
This is the part of the code that often takes the longest to run, as this is where the underlying ML algorithm of the estimator derives the model parameters (via some method such as minimising a cost function using an optimization algorithm like gradient descent). Note that in our simple use case there is very little data so this step is actually pretty quick.
As in our code base the data that we will fit our classifier on has been imported (and cleaned) in the
clean.py module, we need to make this available in the
model.py module. Since we have defined the training set as
train_y, we can import this using:
# model.py import clean sgdc.fit(X=clean.train_x, y=clean.train_y)
Now that our model has been fitted we can test its performance with the
method. Each estimator has a default scoring metric - in this case it determines our model's accuracy.
As with the
fit method from the previous section, the
score always takes (at least) two values -
y. This time we'll be using our
test_y data sets instead.
# model.py accuracy = sgdc.score(X=clean.test_x, y=clean.test_y)
We can then use this accuracy metric to give us some indication as to whether we have a good model or not. If we judge that the accuracy score is not good enough, then we can change our model. This could be done by selecting different input features, cleaning and aggregating the features in a different way, instantiating our estimator with different hyper-parameters, or otherwise. We'll see a bit more on how to go about selecting our features in the 2nd part of this blog post series.
Using our example dataset, we see that this our accuracy is 92%! This is exceptionally high for a first run - almost as if this data set was explicitly designed for this exercise... suspicious.
Now that our model has been fitted and its performance tested, we can use it to make predictions. Here we would normally input new data that we wanted to get predicted values for (separate from the training and testing sets). However, as our example data lacks this information, let's reuse our test set just to demonstrate the
predict method's functionality.
# model.py predictions = sgdc.predict(clean.test_x)
The output of the
predict method is an array containing a prediction for each row entry of input data set. We have now successfully obtained predictions using our model!
Note that during the testing stage, in addition to using the
score method, we could have used the
predict method for further testing and analysis. For example, by using the output values to compute a confusion matrix. This can be useful in determining how our model fares when predicting specific outcomes, such as the more common responses values in a multinomial classification task. If our model only predicts the most common outcomes correctly, while completely failing on the less common, we should ask ourselves if this is really an adequate model to achieve our aims. If this is the case then the model could be simply picking the modal value for every prediction, hardly insightful. In the third part of this blog series we shall consider comparing our model against benchmarks using dummy estimators to automatically check for these sort of issues.
Hopefully you are now a bit more comfortable using the Scikit-learn library to fit basic models to your data. Though we have only demonstrated this process with a single estimator class, by following the four steps above you will be able to use almost any estimator from the Scikit-learn library. If you have the time, see this for yourself with the following exercise.
It is also worth noting that it is not just the predictive models themselves which follow the steps of instantiate, fit, test and predict but also almost all of the algorithms in the Scikit-learn library (though there is slight variations on the scoring and prediction steps). We shall see this in practice when we utilise the Recursive Feature Elimination algorithm in the following blog post during the feature selection stage. See you then!
Repeat the exact same steps as above using a random forest classifier. Do you get a better result?
Hint: Remember to import the random forest model's estimator class from it's associated module