Data Science & Tech: Random Forest Algorithm for Breast cancer classification data

Hello ppl,

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

In data science speak, the reason that the random forest model works so well is:

A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

To understand Random forest algorithm, understanding the working of decision trees is very important. The idea of decision trees is covered in this post.

Simple idea of Decision trees is that, if we are having multiple features, we choose a particular feature and a threshold for that particular feature. Based on this condition, we split the samples into two groups with one having that particular feature value lesser than the threshold and the other group higher. This process is repeated until the stopping criteria. At every step the quality of split is assessed usually using either Gini impurity measure or Entropy measure, which is also explained in the decision tree post.

Decision trees are highly specific to the training data and tend to overfit. The decision trees use all the features together to classify the data, whereas the Random forest algorithm has multiple decision trees that has different subsets of the actual features. The important point is that individual decision trees can share the feature. i.e the features need not be split mutually exclusive.

Let us try to understand this with an example,

Suppose we have 10 features to describe the data. With all the 10 features, any one feature might be dominant in the training data and could influence the test results adversely. So what we do is, we create multiple decision trees (say 5), with and without that particular feature. This reduces the influence of that particular feature in the classification and the average results from all the trees contribute to the final test results.

Having said this, the question arises on how to split the features for the individual trees. We use the concept of Bagging or bootstrap aggregation for this. Individual trees can randomly choose the features from all of the features with replacement. This makes all the features available for every individual tree to choose from.

Data

The data we are going to use is breast cancer dataset from kaggle (link in Reference). The requirement with this particular dataset is to classify the tumor to be malignant or benign based on the given features. There are 30 features to describe the cancer data. Some examples are radius, texture, and perimeter.

Coding

We are going to implement the coding part using scikit-learn. The important parameters to see are,

n_estimators : Number of trees in forest. Default is 10.

criterion: “gini” or “entropy” same as decision tree classifier.

min_samples_split: minimum number of working set size at node required to split. Default is 2.

There are so many parameters to play with, in the scikit – learn library RandomForestClassifier algorithm. In the below code, the pre-processing is done using min-max scalar. The number of tress are fixed to 10.

Result

Accuracy using gini 97.2027972027972

Accuracy using entropy 98.6013986013986

We can clearly see that the results are better than the decision tree algorithm.

References

https://www.kaggle.com/sarahvch/breast-cancer-wisconsin-prognostic-data-set

https://towardsdatascience.com/understanding-random-forest-58381e0602d2

https://medium.com/machine-learning-101/chapter-5-random-forest-classifier-56dc7425c3e1

Data Science & Tech

Wednesday, August 5, 2020

Random Forest Algorithm for Breast cancer classification data

Data

Coding

Result

References

No comments:

Post a Comment

Automatic segmentation based on thresholding