Written By Lancelot RossertLewis Fogden,

Thu 26 April 2018, in category Data science

The Python programming language is often associated with Data Science and it's applications. This is in part due to the Scikit-Learn library, one of python's excellent machine learning (ML) packages. Scikit-learn is an incredibly well documented library with a wealth of resources, and as such learning the fundamentals of it can get you surprisingly far.

This series of blog posts is designed to take you beyond understanding the core basics of machine learning in Python to the stage where you are able to build your own ML pipeline. You will then be able to start addressing various ML problems and improve your pipeline as you gain more experience, thus streamlining you code base more and more.

Though creating such a pipeline seems like a imposing task I'm hoping this guide will teach you how and leave you with an appetite to develop your abilities further. It is important to note here that the pipeline you will learn to create in this blog series focuses on the modelling and feature selection part of ML. That is, we shall not go into how to go about cleaning data and getting a flat data structure ready to feed your ML algorithms.

This quickstart guide to machine learning projects will be divided as follows:

- In this blog post we shall look at the fundamentals of the Scikit-learn library and how to use this (readers who are familiar with this library may choose to dive straight into the next post).
- Next we shall look at automating feature selection using an algorithm called Recursive Feature Elimination.
- And finally we shall look at how to test the results of our model to determine if we have found a good model or whether more work needs to be done to improve it.

Let's get started and learn how to model data sets using the Scikit-learn package!

The Scikit-learn package strongly adheres to the principles of the Zen of Python. This is encapsulated in the consistent and simple interfaces of its estimator API and other classes. In fact, we can make use of almost any Scikit-learn estimator by following just four main steps: `instantiate`

, `fit`

, `score`

and `predict`

.

Throughout this blog series we shall work through a basic example using the breast cancer data set readily available in the Scikit-learn package. The main motivation for using this data set is that it requires no cleaning. This allows us to delve straight into the modelling aspects.

First let's create a module (simply a `.py`

file) called `clean.py`

in which we can import our dataset. Note that any precursory cleaning and integration steps you might need for a different dataset should be included in this module as well.

We can load our data into memory by importing it from the Scikit-learn `datasets`

as follows:

```
# clean.py
from sklearn import datasets
breast_cancer = datasets.load_breast_cancer()
features = breast_cancer.data
response = breast_cancer.target
```

Note that the `load_breast_cancer`

method creates a `sklearn.utils.Bunch`

object that has the feature and response variables of the dataset located in its `data`

and `target`

attributes respectively.

Before we can start to model our data, we need to split it into our training and test datasets (we shall not discuss the use of an additional third set called the validation set here). A fundamental concept of supervised machine learning is the use of a (larger) labelled training set to teach our model how to recognise patterns in the data and a (smaller) labelled test set to determine whether the model is doing a good job of recognising these patterns.

Scikit has a method available in the `model_selection`

module called train_test_split to help us with this exact task. We simply feed in our DataFrame, specify our test size (below we use 70% of our data to train on, 30% to test on), and choose a random number seed (for consistency between runs).

```
# clean.py
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(
features, response, test_size=0.3, random_state=5)
```

Next, let's look at creating our first model to predict breast cancer using a stochastic gradient descent classifier (though we could easily consider many other classification models such as the random forest classifier). We first need to import the right estimator class from the Scikit. A quick Google search for *scikit-learn stochastic gradient descent* gives us the documentation page here. The main header contains exactly what we need.

Let's create a new module, `model.py`

, and import our estimator class.

```
# model.py
from sklearn.linear_model import SGDClassifier
```

Having imported the estimator class we can go about using this with the following for steps:

**Instantiate**: Creates an instance of the estimator class**Fit**: The instance of the estimator learns from the data, creating a parametrised model**Score**: Test whether the model achieves the required accuracy**Predict**: Use the model to make predictions on new data

Once we have imported our estimator class, we need to instantiate it (create an instance of the class). However, it will not yet able to make predictions as it has not yet learned from the data. This is the difference between an estimator instance and an actual model. This concept can be strange to grasp, particularly if this is your first ML project or tutorial.

So let's try and clear up the relationship between instances of estimators and models with the following analogy:

- You are reading this blog to learn how to model using Scikit-learn. That is your goal. Comparatively the model's goal is to be able to predict responses, given some input data.
- You begin as a keen learner who does not yet know how to use the Scikit-learn library; just as the model starts off as an instance of a class that is ready to learn how to make predictions but does not yet know how.
- You then read the information in this blog and learn how to use the Scikit-learn library. Similarly the model is fitted with training data and learns how to predict responses.
- At this stage you are no longer the keen learner any more but someone who has learnt. The instance of the estimator is now parametrised and capable of making predictions - it is an actual model.
- Now that you have learned how to use the Scikit-learn library, your knowledge can be tested. You can be asked questions on your knowledge and someone can determine whether your knowledge is adequate enough. Equivalently we can test the model's predictive power on the test set and determine whether it is an adequate model or not.
- If you have learned how to use the Scikit-learn package well you can use what you've learned to achieve your goal. Just as the model can now predict responses on new input data.
- However if you do not brush up on any changes to the Scikit-learn library you will only retain your current knowledge. Moreover this might become dated and you may not be able to use the Scikit-learn package as well as you once did. Similarly if new data comes in but the model does not learn from the new data (re-train) then it will stay in its current form. This means that predictions on new data in the future might not be as accurate due to feature drift.

To summarise, an instance of an estimator class can be considered the untrained potential of that class, while an actual model is a fitted (and therefore parametrised) instance who's potential has been fulfilled.

And remember, estimators such as our `SGDClassifier`

are simply Python classes with bound data and methods - nothing too fancy to worry about.

Moving on to the actual code, having imported the model class we can then create an instance of this class.

```
# model.py
sgdc = SGDClassifier()
```

Note that here we could specify arguments (representing hyper-parameters) in the function call, but for simplicity we shall use the defaults.

We now have an instance of our estimator. To use this as a means of prediction we need to train it on our training data. This step is known as fitting.

For estimator classes in Scikit-learn, the `fit`

method takes (at least) two parameters - denoted `X`

and `y`

. `X`

is the training set involving all the data except for the response variable (usually a Pandas DataFrame). `y`

is the response variable contained in the training set (usually a Pandas Series or a Numpy array).

This is the part of the code that often takes the longest to run, as this is where the underlying ML algorithm of the estimator derives the model parameters (via some method such as minimising a cost function using an optimization algorithm like gradient descent). Note that in our simple use case there is very little data so this step is actually pretty quick.

As in our code base the data that we will fit our classifier on has been imported (and cleaned) in the `clean.py`

module, we need to make this available in the `model.py`

module. Since we have defined the training set as `train_x`

and `train_y`

, we can import this using:

```
# model.py
import clean
sgdc.fit(X=clean.train_x, y=clean.train_y)
```

Now that our model has been fitted we can test its performance with the `score`

method. Each estimator has a default scoring metric - in this case it determines our model's accuracy.

As with the `fit`

method from the previous section, the `score`

always takes (at least) two values - `X`

and `y`

. This time we'll be using our `test_x`

and `test_y`

data sets instead.

```
# model.py
accuracy = sgdc.score(X=clean.test_x, y=clean.test_y)
```

We can then use this accuracy metric to give us some indication as to whether we have a good model or not. If we judge that the accuracy score is not good enough, then we can change our model. This could be done by selecting different input features, cleaning and aggregating the features in a different way, instantiating our estimator with different hyper-parameters, or otherwise. We'll see a bit more on how to go about selecting our features in the 2nd part of this blog post series.

Using our example dataset, we see that this our accuracy is 92%! This is exceptionally high for a first run - almost as if this data set was explicitly designed for this exercise... suspicious.

Now that our model has been fitted and its performance tested, we can use it to make predictions. Here we would normally input new data that we wanted to get predicted values for (separate from the training and testing sets). However, as our example data lacks this information, let's reuse our test set just to demonstrate the `predict`

method's functionality.

```
# model.py
predictions = sgdc.predict(clean.test_x)
```

The output of the `predict`

method is an array containing a prediction for each row entry of input data set. We have now successfully obtained predictions using our model!

Note that during the testing stage, in addition to using the `score`

method, we could have used the `predict`

method for further testing and analysis. For example, by using the output values to compute a confusion matrix. This can be useful in determining how our model fares when predicting specific outcomes, such as the more common responses values in a multinomial classification task. If our model only predicts the most common outcomes correctly, while completely failing on the less common, we should ask ourselves if this is really an adequate model to achieve our aims. If this is the case then the model could be simply picking the modal value for every prediction, hardly insightful. In the third part of this blog series we shall consider comparing our model against benchmarks using dummy estimators to automatically check for these sort of issues.

Hopefully you are now a bit more comfortable using the Scikit-learn library to fit basic models to your data. Though we have only demonstrated this process with a single estimator class, by following the four steps above you will be able to use almost any estimator from the Scikit-learn library. If you have the time, see this for yourself with the following exercise.

It is also worth noting that it is not just the predictive models themselves which follow the steps of instantiate, fit, test and predict but also almost all of the algorithms in the Scikit-learn library (though there is slight variations on the scoring and prediction steps). We shall see this in practice when we utilise the Recursive Feature Elimination algorithm in the following blog post during the feature selection stage. See you then!

Repeat the exact same steps as above using a random forest classifier. Do you get a better result?

*Hint: Remember to import the random forest model's estimator class from it's associated module*