Hello ppl,
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
In data
science speak, the reason that the random forest model works so well is:
A large number
of relatively uncorrelated models (trees) operating as a committee will
outperform any of the individual constituent models.
To understand Random
forest algorithm, understanding the working of decision trees is very
important. The idea of decision trees is covered in this post.
Simple idea of Decision
trees is that, if we are having multiple features, we choose a particular
feature and a threshold for that particular feature. Based on this condition,
we split the samples into two groups with one having that particular feature
value lesser than the threshold and the other group higher. This process is
repeated until the stopping criteria. At every step the quality of split is
assessed usually using either Gini impurity measure or Entropy measure, which
is also explained in the decision tree post.
Decision trees are
highly specific to the training data and tend to overfit. The decision trees
use all the features together to classify the data, whereas the Random forest
algorithm has multiple decision trees that has different subsets of the actual
features. The important point is that individual decision trees can share the
feature. i.e the features need not be split mutually exclusive.
Let us try to
understand this with an example,
Suppose we have 10
features to describe the data. With all the 10 features, any one feature might
be dominant in the training data and could influence the test results
adversely. So what we do is, we create multiple decision trees (say 5), with
and without that particular feature. This reduces the influence of that
particular feature in the classification and the average results from all the
trees contribute to the final test results.
Having said this, the
question arises on how to split the features for the individual trees. We use
the concept of Bagging or bootstrap aggregation for this. Individual trees can
randomly choose the features from all of the features with replacement. This
makes all the features available for every individual tree to choose from.
Data
The data we are going
to use is breast cancer dataset from kaggle (link in Reference). The
requirement with this particular dataset is to classify the tumor to be
malignant or benign based on the given features. There are 30 features to
describe the cancer data. Some examples are radius, texture, and perimeter.
Coding
We are going to
implement the coding part using scikit-learn. The important parameters to see
are,
n_estimators
:
Number of trees in forest. Default is 10.
criterion:
“gini” or “entropy” same as decision tree classifier.
min_samples_split:
minimum number of working set size at node required to split. Default is 2.
There are so many
parameters to play with, in the scikit – learn library RandomForestClassifier
algorithm. In the below code, the
pre-processing is done using min-max scalar. The number of tress are fixed to
10.
Result
Accuracy using gini 97.2027972027972
Accuracy using entropy 98.6013986013986
We can clearly see that the results are better than the decision tree algorithm.
References
https://towardsdatascience.com/understanding-random-forest-58381e0602d2
https://medium.com/machine-learning-101/chapter-5-random-forest-classifier-56dc7425c3e1
No comments:
Post a Comment