Understanding Gradient Descent

Krishna
4 min readJan 12, 2022

--

Introduction

Often while trying to build a linear regression model we come across the term ‘Gradient descent, people who are new to machine learning get confused with this and some ignore to understand concept behind this. To understand the functionality of this, one should be familiar with three concepts,

  1. The slope of a straight line
  2. The intercept of a straight line
  3. Cost function

Slope-intercept form

Y = mx + b

This equation looks familiar right, this is the equation of the line in slope-intercept form. Here, m is the slope and b is the intercept.

What is the slope?

A slope of a line is a value that describes the direction and steepness of the line.

Why do we need to calculate slope?

Consider an example, your friend called you and invited to a party but you don’t know where exactly the place is, what would be your next question that you would ask your friend, ‘hey can you please help me with the direction ?’. In the same way, to know the direction of a line, where it is moving we need to find the slope.

How to Calculate a slope?

It is the change in y by change in x.

M = (y-y1) / (x-x1)

What is a y-intercept of line (b)?

It is the point where the line crosses the y axis.

(Note: There is a y-intercept and x-intercept, here we are only talking about y-intercept)

Cost function

Linear regression is a machine learning algorithm built to predict a target by training itself with a set of features. For instance, if you were asked to build a machine learning algorithm that could predict the house rents in your city so that, if someone wants to relocate depending upon their needs they can easily find a house. For this, you decided to build a linear regression model and predict the house rents, but how would you know whether the predictions are correct. For this purpose, you use an error metric to measure the difference between the actual rents and the predicted rents. This error metric is what we call a cost function.

In linear regression, Mean square error(MSE) is to find the cost.

MSE = (1/n) * Σ(a — p)**2

n is the number of values.
‘a’ is the actual rent price.
‘p’ is the rent prices that the model has predicted.

If the error is small, then it is a good model, if it is too large then we need to use some hyperparameters to tune. For this, we use gradient descent.

Gradient Descent:

Imagine on a foggy day you are walking on a steep road, due to the fog you couldn’t see how deep this steep is and how long you need to walk, but to reach your destination you need to walk down, if you try to run then you might break your leg, so to come down safely only option you have is to slowly step down, after your first step you calculated the steepness and the direction which is nothing but the slope according to this value you will plan your next step, you will continue this process until you reach the end. Now, apply the same concept by replacing yourself with the cost. Theoretically, this is a small example to understand gradient descent.

Mathematically, gradient descent is an optimization algorithm used to minimize the cost function. To find the minimum cost, we iteratively calculate ‘m’ and ‘b’ values for each data point and compute their partial derivatives, store these as a new gradient. The new gradient, tells us the slope of the cost function at the current position, according to this value we will update our parameters. For every update, we need to take a step which is called the learning rate. If the learning rate is too high, then the model could cover the distance too quickly and cannot produce an optimal value, whereas a learning rate that is too small can cause the process to struck.

Consider the cost function for every point in the data,

MSE (cost function)

𝑓(𝑚,𝑏)=1/𝑁 ∑ (𝑦𝑖−(𝑚𝑥𝑖+𝑏))**2, ( where i = 1 to n, m is the slope and b is the intercept)

Partial derivative wrt to m

𝑑𝑓/𝑑𝑚 = 1/𝑁 ∑ (-2xi(𝑦𝑖−(𝑚𝑥𝑖+𝑏))

Partial derivative wrt to b

𝑑𝑓/𝑑b = 1/𝑁 ∑ (-2(𝑦𝑖−(𝑚𝑥𝑖+𝑏))

New m = m − (learning rate) * 𝑑𝑓/𝑑𝑚

New b = b− (learning rate) * 𝑑𝑓/𝑑b

The values of m and b will be calculated as each step and updated as new m and b, this process is carried on until the algorithm finds the minimum cost.

Conclusion

To reduce the cost function, gradient descent is one of the efficient algorithms that can be used. Should be very precise while choosing the learning rate because this value should not be too large or too small in order to find the minimum cost.

--

--

Krishna
Krishna

Written by Krishna

Machine learning | Statistics | Neural Networks | data Visualisation, Data science aspirant, fictional stories, sharing my knowledge through blogs.

No responses yet