When we talk about decision trees two terms often intersect, “Entropy” and “Information gain”, of course, most machine learning aspirants and engineers will be having some knowledge of these terms. The main question here is “why are they important ?” if we can answer this one question, then the rest would be just a cakewalk. Let’s begin with a simple decision tree.
Decision Tree
A simple decision tree has two nodes and one root, in biological trees that we see outside, the roots are inside the ground, and the tree starts to grow from opposite to the gravity, but the decision tree is in favor of gravity, it’s roots are to the top and the tree grows downwards. When we have to say whether the animal is a cat or dog we can see it and say it, but if a machine needs to decide the same thing then we need to train it with certain features, like pointed ears, pointed nose, etc. Depending on the features we provide and the training we give to the machine it will be able to classify the animal.
The above figure shows a simple decision tree, where we are trying to classify whether it is a dog or a cat. So far good, here comes the confusion, which features should be considered as the main split, like a root split. If we have an animal dataset and we are classifying the animals, we have several features in the dataset like the sounds, size, weight, height and food that it eats and where does it live, etc, if we consider the feature “food” as the root split, can we able to deliver best results? Well, it depends, if you have a cat, you can feed the cat only vegetarian food, then eventually it starts hating non-veg, so it depends on the food we provide, same in the case of other animals. So, considering food as the root split would not be the best idea. Then how can we do it? Here comes “Entropy”, which can help us to decide which would be the best split. Let’s not do it mathematically, just displaying a formula and substitute the values here and you get the best split if so there is no need for this article. Let’s understand what is entropy and then slowly we can understand the underlying mathematics.
Entropy
When we want to estimate how likely the event is going to happen per trial we use entropy. Let’s consider the same old classic example of a flipping a coin, If we toss a coin 10 times where we got 8 heads consecutively and 9th time we got a tail and 10th again ahead, here, the 9th flip would be a surprising factor, so if we calculate the probability,
The probability of heads in 10 times = 9/10
The likelihood of tails 10 times = 1/10
To find the estimated output value of a surprise factor we take the inverse (because, probability of something which is rare to happen is always inverse to probablity of something that often happens). So,
P(H) = 1/1/10
P(T) = 1/1/10
Here there could be a chance where, estimated (heads) and surprising (Tails) probabilities would be equal, so to avoid it we take the log of it.
P(H) = log(1/9/10) => log 1 — log(9/10) => log1 — (log 9 — log 10) => 0.05
P(T) = log(1/1/10) => log 1 — log 1 + log 10 => 1
Now the entropy would be = Estimated output * probability + surprising output * probability
E = 0.05 * 9/10 + 1 * 1/10 = > 0.045 + 0.1 => 0.0145 is the entropy.
Now the final step what we are seeing is the formula for entropy that we find everywhere,
E = — (P(+ve) log(P(+ve)) + P(-ve) log(P(-ve))),
Here we are considering negative because we estimate the probabilities we take the inverse where we get log(1) which is 0, so we end up getting negative there.
Entropy is used to check the priority at each node and decide whether it is a pure split or impure split. Pure split means if we have some information there and we can split further and we can able to get either completely “yes” or “No”, an impure split is quite the opposite, we have 50–50 chances of both “yes” and “No”.
If we consider the above decision, (For ease of computation, I have considered a very basic DT), we have the main root with 10 yes and 5 nos, so after dividing we can see the results, now if we calculate the entropy by using the above formula,
E = — (P(+ve) log(P(+ve)) + P(-ve) log(P(-ve))), here +ve is yes, and -ve is no
E = — 6/8 log (6/8) — 2/8 log(2/8)‹
Here we are considering 8 for the left, where the total number is 8, if we calculate this we end up getting ==> 0.24. Now this will continue again until we find the pure split, and that pure split is called a leaf. The leaf node has either only “Yes” or only “No”. If we have a pure split the entropy will be “0” and if we have an impure split (2 yes/2 No) then we get the entropy as 1.
Now, we have understood what is entropy, so we have answered some parts of the question, but the main part here is which feature should we consider, for we use Information Gain.
Information gain
To decide on which feature should be given priority for a split we use Information gain. It calculates the average of entropy for an entire decision tree at each split, and once the whole DT is built, then depending on the highest average values, it decides which feature should be the best split.
The formula is used for calculating IG, here Entropy is for that specific split and Sv is the number of values after a split is done and S is the total number of values.
If we consider again the same example, Sv for root node split is 8/15 as the values after a split are 6 yes and 2 no, and the total is 15. Sv is the value after the split.
If we substitute the values in the formula,
IG for root node = — 10 /15 log(10/15) — 5/15 log(5/15) — 6/15 E(split 2) — 7/15 E(split 3), If we compute this we get the IG value around 0.016.
Now, this IG is calculated for this DT with some x feature as the root node, next it calculates IG for some y feature as the root node, and compares both the values of IG, which split has the highest IG value that structure is used to build the DT.
We have understood why we use Entropy and Information gain in DT, well this is not the end of the story yet. There is one more approach to deciding on the best split, which is “GINI Impurity”. Many would have heard this term more than Entropy, that is because often ML engineers who try to solve a problem using DT, use GINI Impurity over Entropy. But why?
GINI Impurity
It helps to understand the likelihood of new data being misclassified if it were given a random class label from a dataset.
The above is the formula for GI, we can rewrite it as,
GI = 1 — [P(+ve)² + P(-ve)²]
But why are we considering GINI impurity over Entropy when is used for the same purpose, Consider the above diagram, which shows the probability of positive class v/s negative class, where the entropy and GINI are calculated. Look at the Entropy formula, and how much computation is involved in it. We need to calculate the probability and the logarithm of the probability for each split. A lot of time and memory is wasted, instead, if we use GINI impurity the computational time would be reduced. Because the Entropy values range between 0 to 1, but GINI impurity ranges from 0 to 0.5. So, for an impure split, the value would be 0.5.
We can still use entropy over GINI impurity, it depends upon what problem we are trying to answer and how deep the DT is going to be.
Conclusion
It would be easy when we look at these formulas, but if we don’t execute them on our own then we will be confused with the terms. If we understand the underlying math, then it would be easy for us to find a solution. Decision trees are widely used algorithms but we need to be careful while building them as they are prone to overfitting. That is the reason we use a random forest algorithm over DTs for specific approach.