Activation Functions

11 min readOct 29, 2022

We start building a neural network because we have a problem that needs to be solved, but we have many machine learning algorithms which can be used for the same purpose, so what is the need in building a neural network? One of the main reasons we use NN’s is to mimic a human brain, a human brain can only be replaced by another human brain, so if we train a network that thinks like a human brain, which acts like a human brain and which can analyze the problem like a human brain then we can reduce the human effort by replacing these networks. This could increase efficiency and reduce the time in solving a problem. Before building a NN, we need to understand how a human neuron works, and later the exact implementation can be made using an artificially created neuron.

Neuron

The above figure shows an actual Neurone in the human brain, the work of this neuron is to receive some data and transfer the processed data. To make it clear, imagine this as an antenna, which receives a signal and then decodes it and transmits it, the same process also happens inside a neuron. But how do they receive the data and transmit it, for this purpose each neuron has two elements “Dendrites” and “Axon”, Dendrites act as receivers and Axon as the transmitter, but why are they needed? For instance, Imagine a group of people working together doing construction work, each person is from a different country and they speak different languages, so how do they communicate? They need some common language that helps them to exchange their ideas, so this is what “Dendrites” and “Axon” do, each neuron works on a piece of data, and once it completes the job, it needs to communicate with other neurons and transfer the data, so these two elements will help in communicating between the neurons.

An artificial neuron should also work the same way,

So, a similar model is followed while building a neural network, a single neuron is called a node, which receives inputs and processes them, and sends the output. The above figure shows a node with five inputs and one output, the inputs are independent variables, and these are for one particular instance, you can think of this as a row from the dataset (a single row represents data of a particular person, or an animal or a thing). For each input there is a weight added, these weights help the neural network to learn by allowing the neuron to know what data should be passed, and paused. We will discuss why weights are added to the inputs in detail with examples further in the article. Inside the neuron, all the inputs multiplied with their respective weights are added (sum(input * weight)), next this is sent into an activation function, (there are many types of activation functions but will give an overview of what and how an activation function work.)

Now, we have an understanding of a neuron, and nodes and how they work, so let us talk about activation work. Before we discuss more activation functions, let us understand what a function does.

Function

In the above image, we are providing an input x and producing an output y. What is happening inside the function? We are finding the relation between x and y, where x is an independent variable, which means it has no dependence on anything and y is a dependent variable, which means it needs x to determine its value. Consider the rent of a flat, rent is based on the area of the flat, the location of the flat, and the facilities that are available around the flat, so here rent is dependent on all these factors but not vice versa.

Rent = y and [Area, Location, Facilities] = x

f(x) = y ===> f([Area, Location, Facilities]) = y

So, this is what a function does basically, now let us understand why an activation function is used and what importance it has in producing output, before that let us discuss briefly Node, weights, and bias.

Node and weights

A node in a neural network layer is a single neuron, it can take multiple inputs and then generates output. For each input, a weight is added (initially this weight would be a random integer), the reason behind adding weight is simple, let me explain with an example.

Consider, we are dealing with house data, and we want to predict the prices of the house depending on the features related to the house, such as area, number of bedrooms, location, locality, quality, etc. When we provide these features as input to a neuron, how does it know which feature is important, as a human we can interpret depending upon our requirements but how would an automated machine understand, here comes the weights, for each input, there is a weight, so it decides which feature is more important. If a feature is most important for example area of the house is most important so there would be a higher weight assigned to it, so that the neuron can understand, yes! this need to be processed, if some features such as a swimming pool or a garden are not a very essential part of a house, it depends on the customer, so there would be a minimum weight assigned to it. This is how weights help the neural network understand the features and predict the output.

Y = x1*w1 + x2*w2 + …. + xn * wn

Now, there would be a doubt, what if the weights are zero? Adding the weight doesn’t make any sense, then how could we resolve this?

So here comes another interesting attribute Bias. We add bias to the weight to make sure the input is not zero, the weight and bias can be positive or negative depending on what factor we want to adjust.

Y = (x1*w1 + b1) + (x2*w2 + b2) + … + (xn*wn + bn)

Here bias can be understood as a threshold, where the x*w value should surpass the value of b, and then according to that the weight (w) is adjusted.

This is what happens inside a node. Now, this output is passed into an activation function, but why?

What is this activation function?

We already discussed how a function work, so for the same purpose we use an activation function, we have an independent variable, weight added to it, and a bias term, so we need to find the relation between these terms and the output or the target y. In addition to it, we also need to set a limit for the output, if there is no limit, then it would be hard to interpret the results. To solve this issue, we make sure the results fall between certain probabilities, especially for classification problems, this can be achieved through an Activation function. The output produced but a neuron is passed into an activation function, and depending upon what type of activation function we are using it produces an output.

There are many types of activation functions, and each function has its importance, depending upon the type of problem we are trying to solve we use an appropriate activation function.

Linear Function

f(x) = b*x, where b is a constant (don’t get confused with this b with the bias term)

This is a simple activation function, which produces a linear output. If we have an input of say 4 and a constant of 2

f(4) = 2*4 ==> 8,

if we have constant 2 and input -2,

f(-2) = 2*-2 = -4.

So, here the range is between (-∞,∞). This function cannot be used for solving a complex problem, because, it will not understand the pattern instead multiplies the complex input with the constant and produces an output, which can lead to bad performance. Imagine, asking a 2nd-grade student for help to solve an engineering maths problem, almost all he can do is additions, subtraction, and multiplications ignoring understanding the purpose and intention to solve the problem.

Sigmoid Function

This is one of the most popularly used activation functions, which is used when we have non-linearity. Its output ranges between (0,1)

Sigmoid function ==> f(x) = 1/1+e^(-x)

As the value ranges between 0 and 1, the mid-value 0.5 is considered the threshold for classification problems. For example, if we are trying to classify male and female, if the output of the sigmoid function is more than 0.5 then it is a male and if it is less than 0.5 it is a female. We can understand the sigmoid function’s output as the probability of a particular case being true.

Tanh

Tanh function is something similar to the sigmoid function, the only difference here is that the values range between -1 to 1, which means it is symmetric. It is also called the hyperbolic tangent function.

Tanh = 2/(1-e^(-2x)) -1 ==> e^x — e^(-x)/e^x + e^(-x)

There is an advantage of using the Tanh function instead of the sigmoid function because during the backpropagation stage, we need to take the partial derivatives of the cost function (gradient descent) and also the partial derivative of the activation function. If we have more hidden layers in the network, during the backpropagation, by the time we reach the first layer the derivative of the activation function becomes very small which could affect the ability of the network to adjust the weights, so if we use the tanh function we have a higher value because it’s values range between -1 to 1.

The above image shows you the comparison of the derivative values of both sigmoid and tanh. One more advantage of using Tanh over sigmoid is that the average of the data is closer to zero, so if we use the tanh function then we are normalizing the data before sending it into the next layer. But using a sigmoid function or Tanh function also has a disadvantage, if the x value which is the input (f(x)), is very high or very low the partial derivative of this will be almost closer to zero, so by the time we reach the first layer during the backpropagation the value becomes negligible, this is called vanishing gradient problem. So, for this we use ReLU.

ReLU

The most popular and commonly used activation function is the rectified linear unit. The main advantage of using the ReLU function over the above is that it will not activate all the neurons at the same time. This means if the output of the ReLU function is less than zero then the value is set to zero and that particular neuron is deactivated, if the output of the neuron is greater than 0 then the actual value is given out as the output (for example, if the value is 0.5, then the 0.5 value is sent out as output) and the neuron is activated.

f(x) = 1 if x >0
f(x) = 0, if x < 0

Because of this property of the ReLU, the vanishing gradient problem will be solved, so even when the value of x is high or low by the time we reach the first layer during the backpropagation the network can still understand the change in the output, and adjust the weights.

There are two other forms of ReLU function which are the leaky ReLU and exponential ReLU.

Leaky ReLU

This is not so commonly used, because it depends upon the problem we are solving for, because the ReLU function is not completely linear, so, Leaky ReLU would add some linearity to it by adding a constant 0.01 to the x when the value is less than 0.

f(x) = 1 if x >0
f(x) = 0.01*x , if x < 0

ELU

This is also not so commonly used, because again it depends upon the problem we are solving, because the ReLU function is not completely linear, so, Exponential ReLU would add some exponential component to the x when the value is less than 0.

f(x) = 1 if x >0
f(x) = a(e^x-1) , if x < 0

SoftMax

If we are solving for a binary classification we can use sigmoid, but what if we are solving for a multi-class classification problem, in this case, the softmax activation function will be useful. Consider the example, of the classification of Lions, Tigers, Leopards, and cheetahs. We have four classes here, in this case, we cannot use a sigmoid function, because it only can give out the probability of two classes, so using a Softmax we can classify these animals,
If we get the output as [3,0.9,1.2,2.1], we can clearly say that 3 is the highest value so the image must be a Lion, but these output values can be scaled to fall between 0 and 1. This can be done using a softmax function.

[3,0.9,1.2,2.1], for each value we take the exponent and divide it by the sum of exponents,

Lion = e³/ (e³+e^.9+e^1.2+e^2.1)

Tiger = e ^.9/ (e³+e^.9+e^1.2+e^2.1)

Leopard = e^1.2/ (e³+e^.9+e^1.2+e^2.1)

Cheetah = e^2.1/ (e³+e^.9+e^1.2+e^2.1)

This will give out the probability and we will consider the highest probability, without using softmax if we directly use the max([3,0.9,1.2,2.1]), we get 3 as the output which is still a lion, but this will give us only the highest value, but not the probability distribution, if we need the probability of one class over the others then softmax will be the best activation function as it normalizes the output.

We have discussed when and why we use an activation function and the types of activation functions. But there are some cases where we don’t need to use any activation function for the output neuron. When we are dealing with prediction problems, such as house predictions, flight fare predictions or stock predictions, etc, we just need the value to be produced as output instead of normalizing it using any non-linear activation function. But remember, we will be using activation functions in the hidden layers, but not for the output neuron for solving a regression problem.

Conclusion

Activation functions are very easy to understand but in real-time it needs a lot of understanding of the data that we use for training and a strong understanding of the problem that we are solving. If we are not clear about this, the network sometimes could not produce accurate results, so the activation function is also a hyperparameter, which needed to be optimized.

About me

I’m currently working as a data science intern, to get some practical knowledge of what I have learnt in my Master’s. I prefer to share my knowledge through my blogs, which can be benefited for the new aspiring data science students. If you like my work please feel free to contact me through LinkedIn (Gopi Krishna Duvvada)

Activation Functions

Written by Krishna