Stats for Data science(Part-2)

9 min readJul 31, 2022

In part 1, we have introduced the basic maths that is required to start a career in data science, but honestly, we all have studied and worked with those concepts in our 9th or 10th grade. Today, let me introduce very interesting and important concepts in statistics which will be used throughout the study in machine learning and deep learning.

Probability
Consider this example, you are waiting for a train in a station and you are tired, and cannot stand in the train. So, you decided to flip a coin and check (If it is heads then the train will have seats or else no), whether the seats in the train are occupied or not. You flipped the coin now, what are you doing here? Testing how likely the seats on the train are occupied, and how sure are u will be decided by the flip of the coin. This is the probability, estimating the likelihood of an event to happen.

Probability distribution Function

Let me continue the above example, if I ask you what is the probability of having exactly one unoccupied seat on the train,
P(Seats = 1)
Can you estimate the likelihood, obviously no! Because, that one seat which is unoccupied could be because it is damaged, or somebody spilled some water on it or any other reason which you cannot exactly say. It is hard to estimate any even with 100% accuracy until it has already happened. Let me ask you, what is the probability of having 1 or 2 or greater than 2 unoccupied seats on the train? I guess this would be a lot easier to estimate because, you have some options here if not 2 seats, there would be at least one.
P(Seats >=1)

The above image shows a probability distribution function, we can see that the shaded area is our likelihood of having more than one unoccupied seat on the train. If we take the integral of that shaded area then we could find the area which gives us the probability of having a seat.

Note: If you what to the probability of having exactly one unoccupied seat on the train then it is zero and the whole area under the curve must be equal to one (which is true, we are calculating the probability of having more than one unoccupied seat in the train from all the available seats. So, if there are 10 seats you are finding the probability of more than one, which is 1>10< 3 ==> having a probability of 1/10 (one seat out of all the 10 seats) to 3/10 (three seats out of all the 10 seats), if all seats are unoccupied then 10/10 = 1).

Binomial Distribution

I guess the train example has given an intuition of what is PDF (Probability distribution function) is, so to be in sync let me continue considering the same example for binomial distribution as well.
You are waiting for the train and estimating the probability of having more than one seat by flipping a coin, now let us consider this coin. If you flip the coin 3 times and check the probability of having an unoccupied seat on the train,
Case 1: Having one head
You have flipped the coin and see if you get at least one head then you will have an occupied seat, so what is the probability for that,
P(H = 1) = 1/2 (for heads) * 1/2 (For tails) * 1/2 (for tails) == > 1/6.
So, you have a probability of 1/6 of getting one head if you flip a coin three times, and you have a 1/6 probability of having an unoccupied seat. (Note: the probability of having one unoccupied seat will again depend on the number of seats available on the train, so for simplicity, I am not considering that factor.)
Case 2: Having 2 heads
You have flipped the coin and see if you get at least two heads then you will have an occupied seat, so what is the probability for that,
P(H = 2) = 1/2 (for one heads) * 1/2 (For tails) * 1/2 (for tails) == > 1/6 and,
==> 1/2 (For a tail) * 1/2 (For another tail) * 1/2 (For a head) == > 1/6.
If you observe we have a problem here, we are considering having 2 heads but omitting the position of the head, which means if you get a head in the first flip, then the next head could be in the second flip or the third you don’t know that. So, the probability would be,
==> 3 (for the first head) * 2 (for the second head) / 2 (Because you want two heads)
==> 6/2 ==> 3, now we have 3 combinations of having a head in 3 flips (HTT, THT,TTH)

==> 3*1/6 ==> 3 combinations multiplied by the probability of one head because each head has a probability of 1/6.
==> 1/2, chances of having 2 heads while flipping a coin three times.
So, if you have some understanding of permutations and combinations we can re-arrange the above calculation as,
==> 3! /1!*2!, Let me decode this,
3! Is for having one head during the first flip (HTT).
2! Is for having second head, where we have our first head, so there are only 2 more places (THT, TTH).
1! Is 6 can be written as 3! == 3*2*1 so, we don’t need one so divided by 1! (I have another example below, which gives you more intuition).
2! It 2 heads, you wanted 2 heads.
From the above calculation let us see if you decided to flip a coin 5 times,
It would be 5!/3! (This is what 1! Above is doing, we can write 20 as 5!/3!)
==> 5*4*3*2*1/3*2*1 ==> 5*4 ==>20 / 2 (2 is for two heads) ==> 10,
so you will have 10 possibilities of having 2 heads when you flip a coin for 5 times, I guess this gives you a better understanding of what we have done above.

So, to put it all together, for 2 heads in 3 flips ==> 3!/1!*2!

The above graph, shows the binomial distribution, we are estimating the probability of getting a head while flipping a coin, so each result of the flip can be plotted on a graph and we end up getting the above (It varies depending upon the results). If we try to convert the above calculation into the formula, we will have this,

Where is this used in data science, well everything we learn will not have a very straightforward application, some would act like a catalyst and some would help as intermediate stages. So, BD (binomial distribution) is used when we have 2 mutually exclusive outcomes (Mutually exclusive means, when you cannot perform two events simultaneously, for example, you cannot bowl and bat simultaneously while playing cricket).

Normal Distribution (Normal Distribution is also called Gaussian distribution)

We have a clear understanding of the mean, the sample mean, population means, variance, standard deviation, and different types of random variables (part-1), now let me introduce the most popular and most important concept in the field of statistics Normal Distribution.

You have seen a bell-shaped curve before, normal distribution is often represented with this curve. This curve explains how our data is distributed, in part 1, we have discussed what a sample and population data is, so when we represent that data it is distributed in this way. Where the mid-point represents the mean of the data, and if we go towards the right then we are adding one standard deviation to the mean, similarly, when we go towards the left from the mean we are subtracting one standard deviation from the mean. But why?

Mean and STD,

Mean will let us know what is the average amount of substance we have with us (here substance can be anything, age, height, or even gender, etc.), and standard deviation will let us know how far the data or the substance is distributed from the mean. (Mean-std to mean to mean+std).
This is what the normal distribution represents, now you might have a doubt what is the probability of our data is normally distributed, in really 70% of the times the data will be normally distributed, 30% chances of our data being left skewed or right skewed (Will talk about these terms in the below sections).

If we recollect the example which we used in part-1, we considered the average heights of students in a class, if we bring back that example, if we have 100 students and we calculate the mean,
Mean of student’s heights = 5’4+5’3+5’8+5’9+6’0+……………..+5’4+5’9/100 ==> 5’7.
S0, 5’7 would be the average height of students in general. So in the curve above the average would be 5’7 which is in the taller region. If we go towards the right we are adding one standard deviation, which means we are deviating from the mean so, we can see the area becoming narrower, which denotes we have taller people toward the right from the mean, and similarly, toward the left, we subtract standard deviation from the mean and again the curve is narrowed. Which indicated we have very less amount of shorter people.

This is the data that we infer from a normal distribution curve, and often this is the first and basic step generally data analysts or scientists do to understand the distribution of data. From these, we have a few more concepts,

Right skewed ND
Left Skewed ND
Empirical rule

Right skewed ND

From the image we can see that, the tail is towards the positive side of the line, which represents the right skewed of positively skewed data, consider the example that you may have heard many times, income values. When we plot the data of income of people in a particular state or a country, we have most of the incomes between $ 20000 to $ 50000 and very few incomes (like billionaires) towards the right (> $50000), so in these cases, we get a right-skewed bell curve.

Left Skewed

From the above figure, we can see what is left-skewed distribution means, if the tail of the bell curve is towards the left side of the line we call it left skewed or negatively skewed distribution. Consider this example, if people are asked to walk for 100 meters in 10 mins, most of them will be completed it in 10 mins or before that, only a few people (may old or physically handicapped) would take some more time upon the 10 mins to complete the walk. In this case, most people will be on right, towards the positive side, and only a few will be on the negative side.

Empirical Rule

This is a standard rule that is defined after performing many numbers of experiments with the data distributed, the rule says,
If the data is normally distributed (sometimes it is called standard normal distributed), then 68% of the data is distributed between one standard deviation toward the left and right of the mean (μ — std to μ to μ + std). 95% of the data is distributed between two standard deviations toward the left and right of the mean (μ — 2std to μ to μ + 2std), and 99.5% or 99.7% of data is distributed between three standard deviations toward the left and right of the mean (μ — 3std to μ to μ + 3std).

The empirical rule will help remove the outliers from the data (we will talk about outliers in part 3 of the maths for stats article).

To be cont…

Conclusion

Data science is vast field, the easiest way to learn these concepts by dividing them into micro subjects. This article will give you an understanding of data distributions used in statistics. No body can be an expert in any field, but we can be better by practicing and solving different problems, many people will understand the concepts but they ignore to practice, data science is once field where we need a work on lot of different use cases.

Stats for Data science(Part-2)

Written by Krishna

No responses yet