Tuesday, August 4, 2020

Simple explanation of Decision tree algorithm with Breast cancer classification dataset

Hello ppl,

In this post, we are going to see the usage of decision tree algorithm which is a basic block of Random forest algorithm.

 Data

The data we are going to use is breast cancer dataset from kaggle (link in Reference). The requirement with this particular dataset is to classify the tumor to be malignant or benign based on the given features. There are 30 features to describe the cancer data. Some examples are radius, texture, and perimeter.

Algorithm

Decision tree algorithm is a supervised algorithm. It works like a graph tree with multiple nodes starting from a single node.  At every node, there is a question and based on the question, the samples are split into child nodes.

For example, let us assume, we are separating people who could buy a particular product(class 1) and the features describing the data (person) could be salary, average expenditure, savings and so on. We can initially split that people based on a particular salary threshold. With the samples split into two classes, we can again classify the split-samples with another question.  

The questions in general are obtained by the algorithm of CART, which we can see detailed in another post. The general implementation of the search is a greedy algorithm which considers every parameter of every sample as a threshold and split the data into two. For example, if we have 10 samples each with 4 parameters, then we compare and choose the best of 40 conditions.

 The goodness of the split can be measured using Gini impurity measure or entropy measure. The parameter which gives the best measure is chosen as the condition for one node.

The stopping criteria for the algorithm can be the minimum number of samples in every node or the depth of the tree or if all samples are fully split.

Gini measure

P1 = Number of samples in split1/ Total number of samples in the node

P2 = Number of samples in split2/ Total number of samples in the node

There are 2 ways of representing the measure

Gini= 2* P1 *P2

Or

Gini = 1 – (P1^2 +P2^2)

The minimum and maximum of Gini impurity measure is 0 and 0.5

Entropy measure

In the entropy measure, we calculate the information gain to evaluate the good ness of split. Entropy is degree of randomness of elements or in other words it is measure of impurity. Mathematically, it can be calculated with the help of probability of the items as:

For example,
if we have items as number of dice face occurrence in a throw event as 1123,
the entropy is
p(1) = 0.5
p(2) = 0.25
p(3) = 0.25
entropy = - (0.5 * log(0.5)) - (0.25 * log(0.25)) -(0.25 * log(0.25) = 0.45

Now, say we split the data into 2, now we can calculate the entropy of the splits.

Information Gain  = Entropy(Total_data) — sum([weighted average] * entropy(split_data))

Weighted average is the ratio of number of samples in split_data to the total_data.

For example, Let the node1 have 9 samples with entropy 0.66 . On giving condition, the left child has 5 samples and the right child has 4 samples each with entropy 0.45 and 0.29, we can calculate information gain as,

Information Gain = 0.66 - [0.2 + 0.16] = 0.3

Coding

Completing the theory necessary to understand the concept, we are going to use the decision classifier package from scikit-learn for classification. The analysis of data in itself is an important topic to be seen in a separate post. So here we are doing only basic pre-processing on the data.

Pre-processing steps

1)  Check for empty cells or Nan values and removing empty cells

2)     Tried normalizing the data using both z – score calculation and scaling with difference of maximum and minimum value (Both the methods give similar results) 

In the given code,  these particular operations are specific for the breast cancer dataset.

The raw data after splitting into test and train data is pre-processed using Min-max method. The results of decision tree based on both Gini and Entropy measure are given. 

Results

Accuracy using gini  89.5104895104895

Accuracy using entropy  94.4055944055944

In this case, the Entropy outperforms the Gini measure for the specified depth of 6.

References

5 comments:

  1. A great article. Which method is reliable Gini measure or Entropy for estimating the goodness of the split?

    ReplyDelete
  2. Both can be used interchangeably. Only difference is that, entropy is computationally intensive compared to gini ( because of log)

    ReplyDelete
  3. I didnt fully understand.....but so proud you :) !!!!

    ReplyDelete

Automatic segmentation based on thresholding

Hlo ppl,  Segmentation is one of the major preprocessing steps in analyzing the information present in the image. There are various methods ...