Every person learning or working with Machine learning algorithms will come across this term Boosting, some break their heads to understand this concept , some might think Why do we need to do this? What happens if we ignore this ? as there are many more options out there. But honestly, every technique used in the field of ML has a unique ideology behind it, and using them will ideally improve the performance. So let’s decode what is boosting and will also talk about one of the popular boosting methodology AdaBoost.
Boosting
Often people while using boosting (AdaBoost, Gradient boosting) think that these are machine learning algorithms, but they are not. Instead Boosting is a methodology that is applied to the already existing machine learning algorithm, mostly applied on DT because they produce the best results (We can also use them on other ML models). They are also referred to as Meta-learning.
F(x) = αt*(ht(x))
Here, h(x) = Stump (weak learner)
α ∈ R
Stump
Stump is the smallest decision tree possible which has only one node and two leaves.
A stump is a weak learner which is too simple to perform, boosting methodology uses a series of stumps(weak learners) that individually produce a result and add them together to create a strong model.
Imagine a weak and powerless person fighting with a powerful man, who wins? The answer is simple, the stronger man wins but if a group of weak people fights against a powerful man then it is obvious that the powerful man will lose. This is what exactly happens in boosting, it builds many weak learners and ensembles them to form a strong model which can produce efficient results.
AdaBoost
AdaBoost is boosting methodology, which uses several stumps in series where each stump learns adaptively based on the findings of the previous stump, this is the reason why it is called adaptive boosting.
How does adaptive boosting work?
Will divide the process of adaptive boosting into steps and discuss each step,
- Sample the data (x1,x2,x3…..xn, these are the data points) and set weights of each data point
- Create a stumps
- Calculate the total error
- update the weights
- Creating a new dataset
- Repeat the process from step one
Step 1: Sampling and adding initial weights
Consider the above dataset which has 10 data points, initially we will add weight to each data row.
Initial weight = 1/n
Where, n is the number of rows, as there are 10 rows we have added 1/10 as the initial weights.
Why we are adding weights to the data points?
While solving any classification problem the common issue that we encounter is misclassifications, so we will try to solve the problem by adding some coefficients, which might reduce the error and classify the points correctly.
The same method is used here, remember boosting is a methodology it is not an algorithm itself.
Step 2: Construct a stump (weak learner ),
Adaboost uses entropy to choose the best stump. Once it chooses the first stump, it predicts an output say y1
f(x) — — -> y(1) (ouput)
Step 3: Calculate the total error
Let’s say there is one misclassification, so we need to find the total error of the misclassifications.
The total error is the sum of all the errors in the classified record for sample weights. In our case, there is only 1 error, so Total Error (TE) = 1/10.
After calculating the TE, we need to find the performance of the stump which is the alpha (α)
Performance of Stump(α) = 1/2 (ln (1-TE/TE))
Alpha = 1/2 (ln(1–1/10)/1/10))
α = 1.09 (approx)
Step 4: update the weights
We need to update the weights individually for incorrect classification and correct classification.
For incorrect classification,
New Sample Weight = Initial sample Weight * e^(α)
New Sample weight = 1/10 * 2.97,
New Sample weight= 0.297
For correct classification,
New Sample Weight = Initial sample Weight * e^- (α)
New Sample weight = 1/10 * 0.336
New Sample weight= 0.0336
[Note: Always the sum of the total weights should be equal to one]
So when we sum up all the new weights they will not be equal to one, so we add up new weights and divide each new sample weight by the total weights.
For example, consider first row weight 0.297, divide this by the total weight 0.5994, we get 0.056 which we can see in the table. Continue the process for each weight. When we add the new weights the sum will be equal to 1.
Step 5: Creating a new dataset
We need to create a new dataset from the latest update, for this, we use bucketing.
The process of bucketing is simple, we divide each row into a specific range, starting from 0 to the first weight, and then add the next row weight to the range and continue the process till the last row which is shown below,
In the new dataset, the frequency of misclassified data or records will be more, which means the algorithm will now run for 10 iterations (as there are 10 rows) and for each iteration, it will choose a random weight, for instance in the first iteration the algorithm selected 0.321 which is the 4th row from the below dataset (misclassified row) that row will be added into the new dataset, so this process continues for 10 iterations, the majority of the rows will be the misclassified row.
This new dataset will be the input for the next stump, as more data points are misclassified there is a higher chance that the next stump will classify them correctly.
Step6: Repeating the process
These 5 steps will be again implemented for each stump until the data points are correctly classified, which means the AdaBoost is building the number of weak learners in series and adding them up to form a strong model which indeed produces better results.
AdaBoost on Testing data
For training adaptive boosting implements these steps, then what about testing? How will the AdaBoost work on testing data?
This follows the similar process as in bagging,
- The sample weights are added
- Data is processed by the stump
- The stump produces an output.
Consider the output is 1, the next stump follows a similar process and produces a result consider again it is 1, and the same for 3rd stump consider again as 1, the fourth stump the output is 1 and for fifth and sixth stump the output is 0 and so on. Now the algorithm ranks each output, as there is more number of ones so the final result will be one, the same process is followed in bagging too.
Conclusion
Adaptive boosting is a very effective methodology that can be used on any ML model, but it works well on decision trees as it yields good results. We need to remember that constructive too many stumps can result in overfitting, but it happens only when the data is too huge or many outliers in the data. So, before using boosting methodologies, we need to diagnose the data whether using any boosting techniques helps to yield better results because Adaptive boosting is slow learning takes time in training the model if we use boosting where it is not required then probably we would be unnecessarily increasing the time complexity.