Alteryx's R Random Forest Output: Explained

By Nicole Cruise, Mon 06 March 2017, in category Data science

Alteryx, R

  
@

The Forest Model

Using the sample Alteryx module, Forest Model, the following article explains the R generated output. If you would like to view the data and output yourself using Alteryx, open Alteryx, and go to:

Help > Sample Workflow > Predictive Analytics > 7.Forest Model

and run the sample workflow.

The source data set contains personal and financial data that can be used to determine if a client will default on their loan based on Current Account Balance, Duration with the Bank, Credit History, etc.

The Basic Summary

image1.png

Record 2 above shows the call to the randomForest model. The arguments represent the fields fed to the algorithm and describe the type of model to be fitted.

formula - Default is the target variable (the field you want to predict). The fields following the ~ are the predictor variables, which are used to predict the target variable.

data - The Data frame containing the variables in the model.

ntree - The number of trees to create within the model - this number should generally not be set too small. We will see later on in the Percentage Error for Different Number of Trees plot why this is the case. The default is usually 500 based on the finding of Breiman.

image2.png

In record 3, the type of forest as well the # of trees and number of variable tried at each split are given. There are two types of random forest - classification and regression:

Regression involves estimating or predicting a response, if you wanted to predict a continuous variable or number. E.g. predict a sales figure for next month.

Classification is identifying a class or binary variable, if you want to predict a categorical variable (Yes/No). E.g. - if a client is going to renew a subscription or not.

Number of trees - The number of trees in the forest.

Number of Variables tried at each split - The number of predictive variables considered at each split within a tree. At each split within each tree, the model only uses a subset of variables as to reduce the bias towards more influential variables.

image3.png

The nature of a random forest is it iteratively uses a different subset of the data (bootstrap aggregation) to make multiple decision trees. At each iteration, the tree created using the subset is tested with the data that is not used to create the tree. The average of errors of all these interactions is the Out of Bag Error, as given in record 4.

image4.png

In record 5 is the confusion matrix. It is used to describe the performance of the forest model (or any other classification model).The table below explains each of its elements:

image5.png

The column targets (Yes/No) are predicted by the forest and the row targets are the actual targets. From the above confusion matrix the model is better at predicting ‘No’ then ‘Yes’, as the classification error measures the % of times the model predicted incorrectly.

Analysing the Plots

image6.png

The plot above helps you decide how many trees to have in your model. On the y-axis is the error of the model and the x-axis is the number of trees used.

As you can see, when using between 0 - 50 trees the error remains quite high, but drops and flattens out at around 100 trees. There is an additional drop for both classes at around 500 trees, therefore it could be interesting to add additional trees to this model to see if the error will further decrease.

There will of course become a point where each additional tree only adds further time and computational power, but does not improve overall model performance.

image7.png

The variable importance plot given above shows how important each variable is when classifying the data. The predictor variables are on the y-axis, with the mean decrease gini on the x-axis. The mean decrease gini is a measure of how each variable contributes to the purity on each node in a tree.

As with the number of trees within the forest, reducing the number of variables within the model can decrease computational time and power without decreasing the model accuracy. However, it is important not to have too few predictive variables as the model might not be able to separate the classes correctly.