An Introduction to Random Forests

7 min readNov 11, 2020

Starting off, let me clear a notion many first-timers and non-technical people would have. Random forests does not mean luscious trees spread over a vast stretch of land and seemingly placed randomly in clusters. In fact, random forest don’t even involve trees, well biologically.

Random Forests are an ensemble learning method utilized in the ever expanding and most sought after field of machine learning. Random forests operate by constructing a multitude of decision trees(we’ll get to that too!) at training time and outputting the class that is the mode of the classes or average prediction of the individual trees. When the mode is predicted, it effectively is solving a classification problem while the average prediction solves the regression problem.

Decision Trees

In the field of machine learning, decision trees are data structures which provide an easier way of simplifying classifications. A node splits into child nodes, where the program can continue it’s execution through the conditions and outputs of the existing child nodes.

The major drawback of using a decision trees is that they are rarely accurate. In particular, trees that are grown very deep tend to learn highly irregular patterns; they overfit their training sets, i.e. the model results in having a low bias but high variance, which degrades the test results.

This is where the effectiveness of random forests is seen. Whereas decision trees fail to handle highly irregular training sets, random forests tackle the problem by creating clusters of decision trees, each tree having different root and child nodes. These clusters are then analyzed to reduce the variance in training, thus proving to be better than more accurate than a lonely decision tree itself, although resulting with a slightly higher bias value.

How Random Forests Work?

Random forests work through two important steps: decision tree learning and bagging.

Decision Tree Learning

Decision tree learning involves creating random trees, each with different values. Though not similar, creating a forest gives the effect of a K-fold validation set, which is essentially making multiple combinations of the training set to feed into the model.

Bagging

Bootstrap aggregating, or bagging for short, is the training algorithm applied to the forests created. It works in the following way:

Given a training set x= x1, …, xn with responses y= y1, …, yn, bagging repeatedly selects a random sample with replacement of the training set and fits trees to the samples such as

For t= 1, …, T:

Sample, with replacement, n training examples from X, Y; call these xt, yt.
Train a classification or regression tree ft on xb, yb.

An example of how bagging is performed is given below.

After training for all the training samples i.e. all the decision trees with randomly selected elements from the training sets, results are predicted by either averaging the predictions from each decision tree regressor or by majority voting in case of decision tree classifier. The algorithm required to prediction depends upon the type of decision tree we had initialized.

Pictorial representation of the working of Random Forests.

Output Calculation

In random forests, we use either maximum voting or averaging method to predict the values. For each, there is a separate methodology and output.

First, let’s start with the averaging method. In this method, decision tree regressors are taken in as inputs and the outputs are predicted for each tree. After calculating the output, each value of the multiple trees made for a single piece of training set value is added and then divided by the total number of trees created. This results in an average value which denotes the output of the given training example.

Next off, is the maximum voting method. In this, decision tree classifiers are used for predicting outputs. This is a rather simple concept to understand. In maximum voting, each tree classifies a value they predict to be the correct value. After all trees predict the outputs, we count the number of times each result occurs and assign the highest counted value as the definitive output for the training example. This coincides perfectly with the terminology of maximum voting.

Why and When to Choose Random Forests?

It is unexcelled in accuracy among current algorithms.
It runs efficiently on large data bases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building progresses.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
It has methods for balancing error in class population unbalanced data sets.
Generated forests can be saved for future use on other data.
Prototypes are computed that give information about the relation between the variables and the classification.
It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
It offers an experimental method for detecting variable interactions.

Code for Random Forest

No technical blog is complete without some code or formula. Codes are the reason many people, including myself at times, open blogs. To get a quick insight or easy understanding of the entire concept through a few lines of code is what makes the field of computer science beautiful.

Choosing Process

Creating n decision trees for the forest-

n=int(input())
trees=[]
random_x_train=[None for i in range(n)]
random_x_test=[None for i in range(n)]
for i in range(n):
    random_x_train[i],random_x_test[i]=choose(x_train,x_test)
    dt=DecisionTreeClassifier()
    dt.fit(random_x_train[i],y_train)
    trees.append(dt)

Choosing sample from training and testing set-

def choose(x_train,x_test):
    minrand=4
    rand=random.randint(minrand,len(x_train.columns)-1)
    if(rand>minrand):
        minrand=rand
    feat=[0,1,2,3]
    temp_train=pd.DataFrame(x_train.iloc[:,:4],columns=x_train.iloc[:,:4].columns)
    temp_test=pd.DataFrame(x_test.iloc[:,:4],columns=x_test.iloc[:,:4].columns)
    columns=x.columns
    while(len(feat)!=minrand):
        ind=random.randint(0,len(x_train.columns)-1)
        if(ind not in feat):
            temp_train[columns[ind]]=x_train.iloc[:,ind]
            temp_test[columns[ind]]=x_test.iloc[:,ind]
            feat.append(ind)
    return temp_train,temp_test

Mode for Classifier-

def mode(p):
    fp=[]
    for i in range(len(p[0])):
        n={}
        for j in range(len(p)):
            if(p[j][i] not in n):
                n[p[j][i]]=1
            else:
                n[p[j][i]]+=1
        m=max(n)
        fp.append(m)
    return fp

Average for Regressor-

def average(p):
    fp=[]
    for i in range(len(p[0])):
        avg=0
        for j in range(len(p)):
            avg+=p[j][i]
        fp.append(avg/len(p))
    return fp

Maximum Voting-

print("Max Voting")
pred=[]
for j in range(n):
    pred.append(trees[j].predict(random_x_test[j]))
fp=mode(pred)
print("Final Prediction",fp)

Average Voting-

print("Average Voting")
pred=[]
for j in range(n):
    pred.append(trees[j].predict(random_x_test[j]))
fp=average(pred)
print("Final Prediction",fp)

Results

Through these user-defined functions, I was able to implement my own random forest structure. However, this is just for understanding. Packages such as scikit-learn and sklearn provide much more refined and optimized functions for doing the same. Using those packages is as simple as putting butter on bread.

However, I wanted to highlight the concept of random forests through this post since I feel it’s a rather intriguing method of learning.

With a dataset of mine, by using a decision tree classifier, I was able to achieve a training and testing accuracy of 85% and 83% respectively. But on applying the random forest classifier, my accuracies boosted up to 96% and 95% respectively.

I know it isn’t enough and we should all strive for the golden 100% mark, but still, seeing a radical boost was initially shocking. To see how a forest is much more beneficial than a single tree, I can now see the applications of such a data structure.

Conclusion

Random Forests present us with a vast opportunity of handling data. They prove to be easy to understand, provide good learning rates and prevent overfitting, for the most part.

Nothing can ever achieve 100% perfection. That’s a line which has been taught to me ever since I started my journey in the field of machine learning. But the motivation to achieve a 100% perfection is plenty.

Also, with all this talk about trees and forests, don’t forget about the actual trees and forests which exist to help us survive on this earth. Saving our environment is mandatory and we need to do our best at it.

Phew! This, being my first go at writing a blog, has been a daunting process. But in the end, it’s nice to be able to share something which might prove to be helpful to someone, someday.