Hello ppl,
In this post, we are going to see the usage of decision tree algorithm which is a basic block of Random forest algorithm.
Data
The data we are going to use is breast cancer dataset
from kaggle (link in Reference). The requirement with this particular dataset is to classify the
tumor to be malignant or benign based on the given features. There are 30
features to describe the cancer data. Some examples are radius, texture, and
perimeter.
Algorithm
Decision tree algorithm is a supervised algorithm. It
works like a graph tree with multiple nodes starting from a single node. At every node, there is a question and based
on the question, the samples are split into child nodes.
For example, let us assume, we are separating people who
could buy a particular product(class 1) and the features describing the data (person) could be salary, average expenditure, savings and so on. We can initially split
that people based on a particular salary threshold. With the samples split into
two classes, we can again classify the split-samples with another question.
The questions in general are obtained by the algorithm
of CART, which we can see detailed in another post. The general implementation
of the search is a greedy algorithm which considers every parameter of every
sample as a threshold and split the data into two. For example, if we have 10
samples each with 4 parameters, then we compare and choose the best of 40
conditions.
The goodness of
the split can be measured using Gini impurity measure or entropy measure. The
parameter which gives the best measure is chosen as the condition for one node.
The stopping criteria for the algorithm can be the
minimum number of samples in every node or the depth of the tree or if all
samples are fully split.
Gini measure
P1 = Number of samples in split1/ Total number of
samples in the node
P2 = Number of samples in split2/ Total number of
samples in the node
There are 2 ways of representing the measure
Gini= 2* P1 *P2
Or
Gini = 1 – (P1^2 +P2^2)
The minimum and maximum of Gini impurity measure is 0
and 0.5
Entropy measure
In the entropy measure, we calculate the information
gain to evaluate the good ness of split. Entropy is degree of randomness of
elements or in other words it is measure
of impurity. Mathematically, it can be calculated with the help of
probability of the items as:
Now, say we split the data into 2, now we can
calculate the entropy of the splits.
Information Gain = Entropy(Total_data) — sum([weighted
average] * entropy(split_data))
Weighted average is the ratio of number of samples in split_data to the total_data.
For example, Let the node1 have 9 samples with entropy
0.66 . On giving condition, the left child has 5 samples and the right child
has 4 samples each with entropy 0.45 and 0.29, we can calculate information
gain as,
Information Gain = 0.66 - [0.2 + 0.16] = 0.3
Coding
Completing the theory necessary to understand the
concept, we are going to use the decision classifier package from scikit-learn
for classification. The analysis of data in itself is an important topic to be
seen in a separate post. So here we are doing only basic pre-processing on the
data.
Pre-processing steps
1) Check for empty cells or Nan values and removing empty cells
2) Tried
normalizing the data using both z – score calculation and scaling with
difference of maximum and minimum value (Both the methods give similar results)
In the given code, these particular operations are specific for the breast cancer dataset.
The raw data after splitting into test and train data is pre-processed using Min-max method. The results of decision tree based on both Gini and Entropy measure are given.
Results
Accuracy using gini
89.5104895104895
Accuracy using entropy
94.4055944055944
In this case, the Entropy outperforms the Gini measure
for the specified depth of 6.
References
https://www.kaggle.com/sarahvch/breast-cancer-wisconsin-prognostic-data-set
https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
https://medium.com/machine-learning-101/chapter-3-decision-trees-theory-e7398adac567
A further reading of the concept Random Forest Algorithm is given in this post
Very good
ReplyDeleteA great article. Which method is reliable Gini measure or Entropy for estimating the goodness of the split?
ReplyDeleteBoth can be used interchangeably. Only difference is that, entropy is computationally intensive compared to gini ( because of log)
ReplyDeleteOk Thanks
DeleteI didnt fully understand.....but so proud you :) !!!!
ReplyDelete