Your First Machine Learning Pipeline - Part III

Written By Lancelot Rossert, Edited By Mae Semakula-Buuza
Thu 24 May 2018, in category Data science

Machine Learning, Python



Previously in the "Your First Machine Learning Pipeline Series", we learned how to run a model on breast cancer data and select a feature set using the RFE algorithm.

After following the first two blog posts of this series, you’ve probably reached a stage where you’re comfortable with manipulating data and engineering new features.

In this blog post, we will learn how to analyse the results generated from utilising our machine learning (ML) algorithms. The two key concepts that we will be exploring are: dummy estimators, which are used as a benchmark to compare our models to and confusion matrices, primarily used to decipher which responses our models are better at predicting.

We shall explore various scoring metrics such as: accuracy, precision and recall in a little more detail while taking into account the two concepts above.

Concept 1: Dummy Classifiers

"Your First Machine Learning Pipeline – Part I" helped us understand how to create a model that could predict a testing dataset with approximately ~90% accuracy – 90% accuracy being the primary indicator (threshold value) that the model should suffice - in short, it’s a good model. Despite this, accuracy alone is not sufficient enough to indicate whether the model is actually good.

Let me present an example of why accuracy may not be the best indicator of a good model. Let’s imagine that our data generates responses where 90% of them are identical. Would our model with 90% accuracy threshold then be considered a good model? No. The reasons underlying this are pretty simple – the model is not doing any better than if we were to only predict the mode (most common value).

So before we can decide whether our model is any good, there needs to be a benchmark that we can compare it to. Our first question that we need to consider is this: Is it possible to always predict the mode of the dataset? Yes. Yes, it is. This is going to be the benchmark that we compare our model to see whether or not the mode is always predicted accurately.

The Scikit-learn package gives us access to a variety of classifiers that can make predictions using simple rules – the one that we will be using now is the DummyClassifier.

Here, we will be using the same methodology for building a ML model as found in "Your First Machine Learning Pipeline - Part I". Feel free to navigate there if you’re not too familiar with it or need a refresher.

The DummyClassifier can be utilised to make predictions as follows:

from sklearn.dummy import DummyClassifier

dummy_name = DummyClassifier

# instantiate and fit
dummy = DummyClassifier(strategy='most_frequent'), y=clean.breast_cancer_response_train)

# test dummy score
dummy_accuracy = dummy.score(X=clean.breast_cancer_data_test, y=clean.breast_cancer_response_test)

# we can even make predictions to the dummy
dummy_pred = dummy.predict(clean.breast_cancer_data_test)


Our DummyClassifier generates an accuracy score of ~77% - it should be noted that this score is considerably lower than the chosen threshold accuracy of our actual model (~90%). It is safe to say that our actual model is good as it performs better than the DummyClassifier aka our benchmark.

Concept 2: Confusion Matrices

Looks like we just found out that utilisation of dummy estimators and benchmarks are not only quick, but intuitive ways of determining whether a model is adding predictive value.

Confusion matrices on the other hand are a bit more challenging to get your head around. They evaluate the accuracy of a classification which makes them really powerful in describing a model's predictive strengths and limitations.

A confusion matrix looks like the following:

confusion matrix

The y-axis (vertical columns) represents the total number of responses that belong to each class, whereas the x-axis (horizontal columns) corresponds to the total responses we predicted to belong in each class.

The matrix diagonal represents the predictions that are predicted correctly: True Negatives (TN) and True Positives (TP).

On the other hand, the reverse diagonal of the matrix represents the predictions that are predicted incorrectly: False Positives (FP) and False Negatives (FN).

All this information on confusion matrices can be difficult to digest so let’s understand this better by referring back to the Dummy Classifier case. This always predicts the mode, which happens to be 1 here.


In the image above, we can see that only the left entries (TN and FN) contain nonzero values – this indicates that these correspond to category 1. The Dummy Classifier has predicted 131 correct by labelling the response as 1 (TN) and predicted 39 incorrectly by labelling the response as 0 (FN).

Confusion matrix – How do I code this?

Confusion matrices can be imported from the Scikit-Learn library by using the following function:

from sklearn.metrics import confusion_matrix

We can use this to obtain the confusion matrices for the Stochastic Gradient Descent Model and Dummy Classifier:

#### Confusion matrix
y_true = clean.breast_cancer_response_test # ground truth target values
labels = y_true.unique()

# Confusion matrix for Stochastic Gradient Descent Model (as in
y_pred = model.breast_cancer_response_pred
con_mtx_sgdc = confusion_matrix(y_true, y_pred, labels)

# Confusion matrix for Dummy Variable
y_pred_dummy = dummy_pred
con_mtx_dummy = confusion_matrix(y_true, y_pred_dummy, labels)

As a result, confusion matrices will be generated. Note that these are easier to visualise when plotted – it might be worth using the Python package, plotly to visualise it.

Scoring metrics

So far, all of our analysis has been based on using the accuracy metric. It is not only the most intuitive, but it represents how close a measurement comes to a true value.


# Equation 1 - Accuracy
                (TP + TN)
Accuracy = -------------------
           (TP + FN + FP + TN)

Accuracy is the proportion of predictions that are correct (TP + TN) over the sum of total entries (TP + FN + FP + TN). This is best illustrated in Equation 1 and can be calculated using the confusion matrix by performing this calculation: top left entry + bottom right entry / the sum of total entries.

Accuracy is not the only scoring metric that is used when evaluating your results. Shown below are examples of other scoring metrics that can be used.


# Equation 2 - Precision
Precision = ---------
            (TP + FP)

Equation 2. Precision.

The proportion of correct positive observations (TP) compared to the total predicted positive observations (TP +FP). This is best illustrated in Equation 2 and can be calculated using the confusion matrix by taking the top left entry / sum of all the left entries.

Linking this back to the breast cancer dataset, this is the number of breast cancers that are benign (TP) and predicted as benign (FP).

Sensitivity or recall:

# Equation 3 - Recall
Recall = ---------
         (TP + FN)

Equation 3. Sensitivity or recall.

The ratio of correct positive observations (TP) compared to the total observations that are actually correct (TP + FN). This is best illustrated in Equation 3 and within the confusion matrix, you simply take the the top left entry and divide it by the sum of the top entries.

This is the number of benign breast cancers that are predicted correctly out of all of those that are actually benign.

Exercise 3

In the healthcare industry, incorrect diagnostics can often have dire consequences. For example, if a patient is misdiagnosed with benign breast cancer (when it is actually malignant), they are more susceptible to worsening their medical condition as it is likely that they will undergo incorrect treatment.

So, where am I going with this example? Well… in this case, the accuracy metric may not be the most appropriate to use when evaluating this diagnosis.

Task: Re-do the RFE feature selection using Recall as the scoring metric.

Do the number of features in your optimised dataset change at all?

Hint: You will need to change the RFECV scoring variable to scoring='recall' when creating an instance of the RFECV class.


You’ve finally completed the "Your First Machine Learning Pipeline Series", so give yourself a pat on the back! But, what exactly have we learned throughout this series?

Of course, there’s a great deal of refinement that can be done within this process to improve the pipeline, but hopefully, you are now familiar with the basics of it all.

Although it was not covered in this blog series, it is important to note that cleaning your datasets prior to modelling is a key aspect of Data Science. Make sure that you familiarise yourself with the best practices of cleansing datasets so that you can effectively optimise your ML pipeline.

Wait… I’m eager for more!

Once you have learned the basics, there is no better substitute than by learning through practice. Therefore, I encourage you to try your hand at Kaggle competitions, which enable you to utilise and manipulate a variety of datasets, apply ML techniques (especially the ones that we learnt in this series) and check your results by referring to solutions provided by contributors. All in all, this will allow you to perfect your ML pipeline.

With this, I bid you goodbye. Happy (machine) learning!