Basics of Dealing with Data

Krishna
10 min readApr 13, 2023

--

About Data

The first thing every data science aspirant thinks “We will be building models and automating the process using the data”, but the real question is “How will we get the data ?”. Before answering this question let us understand what is data.

Data is information, “Is everything available information?” The answer would be “No”. Information means something that makes sense and provides some value. Consider the example, “Cat there hungry some food they provide is”, does this statement makes sense, the answer is “NO”, but this is just data without any sense. Now let me rephrase this, “The cat is hungry provide some food”, now this perfectly makes sense, we have not added any new data but rearranged the existing data which now provides some information. In a nutshell, perfect data means something that makes sense and provides information.

Data should make some sense, but do we get data always clean and something that directly provides information? The answer would be “No”, we always get very raw data, which is very hard to interpret but if we apply some techniques and clean and organize the data properly then it will provide good information, from which we can pull further insights through visualization.

Data Collection

We have understood briefly about data, let us know how to collect the data. Collecting data is one of the most important and hectic tasks for a data analyst/data scientist because no data would be readily available, we need to use many methods to collect and further refine the data. While you are working in a firm in the data team, you will have two kinds of data,

  1. Internal data: The data that the firm has previously collected or the data which describes the activities of the firm.
  2. External data: The data that should be provided by the client, this data is hard to get because we need to organize many meetings with the clients describing the following factors,

a. Why do we need the data?

b. What will be doing with the data?

c. Data security assurance

d. An overview of the results

e. Provide some previous results using sample raw data.

We need to build trust with the clients and use the data accurately to produce good results.

What if we need to collect the data from ourselves?

For this we have to first understand what problem we are solving, for example, if we are developing a movie recommendation system then it doesn’t make any sense in collecting sports-related data. So, first need to understand what we are trying to solve, once we are done with that, we can use the following methods to collect the data,

  1. Use APIs to pull the data from platforms like Youtube, Instagram, etc
  2. Using web scrapping to collect data, we can use Python’s beautiful soap library for scraping.
  3. For samples, we can get readily available data from Kaggle, Bloomberg data, etc (This is only for internal projects, for real-time purposes these datasets will not be recommended).
  4. Conduct Surveys or polls.

Reading Data

We have understood a few ways to collect necessary data for analysis/research, now the question is “How to read the data?”

Well, those who are familiar with Python File Input/Output methods can easily read and can write the files. But would that be appropriate for analysis? The answer would be “No”, the File Input/Output methods would be great for internal file handling, but for data analysis, the best method is to make use of libraries that can convert the files into a “DataFrame”. Let’s first understand what is a data frame.

DataFrame is a table where the data is organized in rows and columns that can be restructured and filtered according to the problem we are trying to solve

For reading CSV files, we make use of “Pandas”, which is a Python library that helps in data manipulation and structuring data frames.

The above image demonstrates how to read a csv file into Jupyter Notebook using Pandas. The Data frame provides information about Train Passenger data. Not always the data would be in CSV files, sometimes we need to pull out the data from the webpages, so for that, we can make use of Pandas HTML file reading method.

Before we jump into reading HTML files using pandas, we need to consider a few points,

  1. Not every table in a website is available through HTML Tables
  2. Accessing HTML tables using pandas would not be an ideal method, instead, we can use web scrapping using “beautiful soap”.
  3. The most efficient way is to use APIs.

So, for working with HTML files Pandas would not be a great choice, but for internal analysis, for example, to sample testing, etc, we can use Pandas HTML. Consider a Wikipedia article about data,

The above shows how to read in the HTML files, when we pass on the URL to pandas, it will start searching for the tables in the HTML file, if we see there are 21 tables, to access each table we need to specify the table number which is the index.

With pandas, we can also read Excel files, (In general we don’t use pandas for reading Excel files, because we can manipulate the files using Ms. Excel) but we need to remember one point while doing this, pandas cannot read Excel files which has advance formulas or Macros or any visualization, it just reads the raw data.

The most important part comes here, cleaning the data, once we have read our data into data frames, we need to check for any kind of unwanted values and remove them.

Cleaning the data

Cleaning the data has two phases,

  1. Identify the null/missing values and find a way to either fill them up or remove them directly.
  2. Identify the outliers and remove them.

Let’s first start with Identifying the Missing values, when we read any CSV file or convert a Txt file into a CSV file and read in, often there would be some values that would be missing. For example, we are trying to analyze survey data conducted about the next elections, and the survey was conducted in a particular city with a targeted audience. If a few of them were not willing to answer the questions the response will be recorded as “Not interested”, but sometimes due to some technical issues the response will not be recorded and is left empty. Few cases the data can be accidentally erased, so these issues cause missing values in data. The absence of data creates Null Values.

To handle this, we first need to analyze the following,

  1. How many missing values are there in the data?
  2. Which all features have the most missing values?
  3. Are these missing values replaceable by any kind of data manipulation?
  4. What impact does it create if we drop these missing values?

Once we answer these questions we can proceed with cleaning the data,

The above image shows, the number of missing values in the dataset. It would be hard to just figure out just by looking at the number, so I have created a bar plot, plotting the percentage of the missing values.

Now, there are two dependent columns here budget and gross, both have missing values, so we cannot drop all these values because that would impact the analysis. We always need to understand dropping the null values is not always the ideal way, because we will be losing a lot of data, so first need to go for other options, such as filling the missing values with the average of that particular column. But this also would not be ideal for all the situations, bascause if we have more null values and these are replaced with averages then during the analysis most of the data would be in one specific boundary and the insights drawn from this analysis could be misleading.

Then what we can do instead of dropping or filling with the average?

  1. Look for the dependent feature and use some formula to get data from it. Consider the example, If we are dealing with the movie data, there are more null values in the budget, look for the revenue/gross column then we can make use of this column to calculate the profits. Depending on the profits we can estimate the average budget and fill that value.
  2. Use any kind of regression model to replace it. Train the model with the existing data, make the column that has the most missing values the target, and calculate the missing values. (But this would not produce many accurate results, this method completely depends on the problem we have and the data we are dealing with)

Once we are done working with the null values, the next step is to work with Outliers. What is an outlier?

The values that do not match or fit with the rest of the data are outliers.

But, why do we need to clear the outliers? Outliers mainly affect the Machine learning models in the form of overfitting.

“When there is a high variance, overfitting happens, if the model is overfitted then it fails to make correct predictions on any new data.” (Click Here)

During the visual analysis, there could be some misinterpretations that can be recorded, so to over these we often look out for outliers and remove them.

How to remove Outliers?

There are three ways in removing the outliers,

  1. Visualize the data, if there are only a few outliers we can use some pandas filtering techniques and drop them directly which saves some time. But most of the time the outliers would not be in this way.
  2. The second and most famous method is the Inter Quartile Range (IQR) method. Before we start discussing how to find out the IQR, let me introduce the “Five number Summary” of the data,

a. The minimum of the data

b. The first quartile of the data (Q1)

c. The median of the data

d. The third quartile of the data (Q3)

e. The maximum of the data

These five will tell the overview of the dataset. For example, the range of the data can be easily calculated from this five-number summary,

Range = Maximum — Minimum

The IQR is also calculated similarly,

IQR = Q3 — Q1

Now how to detect the outliers using the IQR?

  1. Calculate the Interquartile ranges.
  2. Multiply the IQR by 1.5 (IQR * 1.5)
  3. Subtract this value from the minimum value of the data (Min — (1.5*IQR))
  4. Add the same value with the Maximum value of the data (Max + (1.5*IQR))
  5. The data that falls beyond this range is considered to be the outlier

Outliers are found beyond = Min — (1.5*IQR)) and (Max + (1.5*IQR))

The Five number summary is visually shown by the box plot.

3. The Third method to find the outliers is by using the empirical rule. The empirical rule says that 68% of the data lies within the first standard deviation, 95% of the data lies within the 2nd standard deviation and 99.7% data lies within the 3rd standard deviation. The data which lies beyond the 3rd standard deviation is considered to be an outlier.

Manipulating Data

Now we have understood data, the ways to collect and the ways to clean and organize it. Now after the cleaning process, often there would be scenarios where we need to apply some kind of methods to break the dataset further, in short, we sometimes need to manipulate the data for better understanding. For example, if we are dealing with bank data, and the dataset contains the following columns, ‘Age of the customer’, ‘Employment’, ‘Salary’, ‘loan’, ‘Balance’, and ‘period’. Now in this dataset, we don’t have the time of the loan, even though the bank knows for what reason the customer took the loan, the data doesn’t specify this, so to add this column we will try to manipulate the given data, if the given loan is greater than a certain amount and considering the period we can categorize that into either ‘High’,’ medium’, ‘low’. So, this helps to understand, whether to contact the customer immediately and ask for the repayment or not. (Note: This example is just for simple understanding purposes, in reality, the datasets could be more complicated).

In these cases, we can make use of the pandas “Apply” method.

We are simplifying the data by using the existing methods, and this applied method can also be used to fill in the missing values depending on the data. Consider the same example which we have discussed above in the missing value section, which is the movie data, so if we have the gross feature, we can calculate the profits by subtracting the gross from the budget and estimate the missing values using the average of this value. To replace the missing values we use the application methods.

Conclusion

Nobody is an expert in handling data, every day there is a new method evolving through which we can blend the data according to the problem we are trying to solve. The fundamentals of handling data remain the same, this article discusses the basic fundamental methods of handling data and manipulating it, moving forward we will explore more new and advanced methods to groom the data according to our needs.

--

--

Krishna
Krishna

Written by Krishna

Machine learning | Statistics | Neural Networks | data Visualisation, Data science aspirant, fictional stories, sharing my knowledge through blogs.

Responses (1)