This blog will cover how we can implement stacking and ensembling of multiple machine learning models into a single predictive model in Python.
Stacking predictive models is a technique in which you use a collection of classifiers to output a set of predictions; subsequently, you will train a second classifier on those predictions. The goal of this Stacking technique is to reduce generalisation errors of classifications. The idea behind doing this is that a stacking approach allows the final model to gain more information on the problem space by using the outputs of multiple initial models to train on as opposed to being trained by itself. This concept of stacking was first introduced by Wolpert in a 1992 paper. The success of stacking approaches can be seen on Kaggle's leader boards, the top of which are dominated by stacking approaches.
In a more practical sense, what makes stacking attractive in machine learning projects is that it can lead to significant accuracy gains without being overly complex to implement, particularly in comparison to extensive feature engineering or hyper parameter tuning; stacking allows for a (relatively) reliable and easy boost to accuracy. So, if stacking is an easy way to make your models better, why not just do it for every project? In other words, this is sounding way too good to be true, so why wouldn't you stack models?
Firstly, stacking is inevitably going to increase the time it takes to run your models, potentially by a copious amount. Therefore, for any project where model performance is a major concern, stacking may not be appropriate. Also, stacking adds a layer of complexity to your model; instead of having a single model to look after, you now have many models to look after and that will invariably add to the complexity and workload of your project. These are things that you will need to consider if you plan to incorporate stacking into your approach.
Let’s examine a scenario in which we want to build a 2-level stacking classifier. By "2-level classifier", what we mean is a collection of classifiers that all train on the same dataset and output predictions as level 1. Those predictions are then fed into a single classifier algorithm that make up level 2. This allows the second level learner to get a more complete view of the feature space than it would normally get.
So, in order to build level 1, we start by making a selection of base learners. For our base learners we are going to use
Gradient Boosting and a
Our first step is to write a set of helper functions, that will allow us to set up all our models and use a kfold object to handle our training and testing sets.
""" Create a class to extend the Sci-kit Learn classifier. This reduces the amount of code that we will have to write later on. """ KF = KFold(n_splits=5) class SklearnHelper(object): def __init__(self, clf, seed=0, params=None): """Get the name of the model for labelling purposes and set a random seed.""" self.name = re.search("('.+')>$", str(clf)).group(1) params['random_state'] = seed self.clf = clf(**params) def train(self, x_train, y_train): """Fit with training data.""" self.clf.fit(x_train, y_train) def predict(self, x): """Make a prediction.""" return self.clf.predict(x) def feature_importances(self,x,y): """Refit and get the feature importances.""" return self.clf.fit(x,y).feature_importances_ def get_oof(clf, x_train, y_train, x_test): """Get out-of-fold predictions for a given classifier.""" # Initialise the correct sized dfs that we will need to store our results. oof_train_df = pd.DataFrame(index=np.arange(ntrain), columns=[clf.name]) oof_test_df = pd.DataFrame(index=np.arange(ntest), columns=[clf.name]) oof_test_skf_df = pd.DataFrame(index=np.arange(ntest), columns=np.arange(KF)) # Loop through our kfold object for i, (train_index, test_index) in enumerate(KF): # Use kfold object indexes to select the fold for train/test split x_tr = x_train[train_index] y_tr = y_train[train_index] x_te = x_train[test_index] # Train the model clf.train(x_tr, y_tr) # Predict on the in-fold training set oof_train_df[clf.name].iloc[test_index] = clf.predict(x_te) # Predict on the out-of-fold testing set oof_test_skf_df[i] = clf.predict(x_test) # Take the mean of the 5 folds predictions oof_test_df[clf.name] = oof_test_skf_df.mean(axis=1) # Returns both the training and testing predictions. return oof_train_df, oof_test_df
Now, we can use our helper functions to initialise our level 1 classifiers and then return our results as a dataframe. Note that here I assume that you have already successfully imported all of your classifiers (
from sklearn.ensemble import RandomForestClassifier) and created dictionaries of their respective hyper-parameters (
# Choose a random seed SEED = 127 # Initialize classifier objects rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params) et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params) ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params) gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params) svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params) # Create our out-of-fold train and test predictions. These base results will be used as new features et_train_df, et_test_df = get_oof(et, X_train, y_train, X_test) # Extra Trees rf_train_df, rf_test_df = get_oof(rf, X_train, y_train, X_test) # Random Forest ada_train_df, ada_test_df = get_oof(ada, X_train, y_train, X_test) # AdaBoost gb_train_df, gb_test_df = get_oof(gb, X_train, y_train, X_test) # Gradient Boost svc_train_df, svc_test_df = get_oof(svc, X_train, y_train, X_test) # Support Vector Classifier # Concatenate the training and test sets x_train = pd.concat([et_train_df, rf_train_df, ada_train_df, gb_train_df, svc_train_df, lr_train_df], axis=1) x_test = pd.concat([et_test_df, rf_test_df, ada_test_df, gb_test_df, svc_test_df, lr_test_df], axis=1)
Now, we can use the results from our level 1 classifier as inputs for our level 2 classifier. For our level 2 learner, we are going to use an XGBoost model.
# Create an XGBoost classifier with previous results as our input gbm = xgb.XGBClassifier(n_estimators=200).fit(x_train, y_train) predictions = gbm.predict(x_test) probs = gbm.predict_proba(x_test)
So, how can we improve this approach from what we have now?
Some of the key takeaways to know with stacking methods like this are:
There may be a case where you may want to include models that learn in different ways and learn from the feature space in different ways. The idea behind this is that this will then generalise better as some models will be better at solving different parts of the problem and you are exploiting as much information from the feature set as possible.
We can also produce multiple stackers if we wanted. We don't need to limit ourselves to relatively simple single learner in our 2nd level. If we wanted, we could replicate this process with lots of different models and approaches as they have several different ensembles from which we can produce an average result at the end.
So, with relatively little effort and some easily repeatable code, we can now produce an ensemble model that will often perform very well in comparison to single classifiers. Note that there is a myriad of different ways we can stack and ensemble code that haven't been covered here - for example:
I hope you enjoyed exploring the topic of ensembling models with me! For some extra reading around the subject, I would highly recommend starting with this guide; it covers many more methods and includes additional reading resources.