Introduction to Data science

Krishna
8 min readMay 11, 2024

--

The world has always been driven by data, though until the early 1900s, much of society lacked a significant interest in comprehending and utilising it. The upheavals of WW1 and WW2 served as catalysts for numerous inventions, some of which were employed for destructive purposes while others contributed to advancements for the greater good. Peter Naur notably introduced the term “Data Science” in his 1974 publication, “The Concise Survey of Computer Methods.” Naur’s work illustrated how data could be harnessed to understand market behaviour, subsequently leading to its adoption across various sectors for forecasting and informed decision-making, thus benefiting organisations. Subsequent advancements in the field of data science expanded from the analysis of gathered data to endeavour’s such as replicating human behaviour, particularly through advancements in neural computing.

In the early 2000s, data science, both as an educational pursuit and a profession, was widely perceived as one of the most challenging fields. This was primarily due to the scarcity of user-friendly visualisation tools and a limited array of libraries conducive to the burgeoning world of machine learning. It remained somewhat obscure until the mid-2000s. However, with advancements in the field and notable contributions from the Python and Microsoft families, the path to proficiency in this domain has become considerably smoother. Yet, amidst these advancements, a prevalent misconception persists: many believe that data science is primarily about programming and the intricate construction of algorithms. Allow me to dispel this notion. Data science is, in fact, a rich amalgamation of disciplines. It blends mathematics, logical reasoning, domain expertise, market acumen, a penchant for data exploration, robust logical thinking, computational prowess, and a modicum of programming skill. Additionally, it encompasses various other areas, making it a multifaceted discipline tailored to specific fields of application.

Many individuals often struggle to discern the disparities between data analysis, data engineering, and data science. Allow me to illuminate each of these in brief.

All data-related endeavours commence with two primary processes:

1. Understanding the problem or circumstance and devising a pseudo-algorithm outlining the approach to tackle the problem.

2. Data collection and governance, which warrants a closer examination (further elaboration on data governance).

Now, let’s delve into comprehending each of these:

Data Analysis: To illustrate, consider the analogy of filmmaking. In filmmaking, the initial requisite is a plot, which encapsulates the essence of the entire film. Subsequently, characters are delineated, their relationships established, and the narrative arc crafted to weave a cohesive story. Upon this foundation, storyboarding ensues, visually mapping out scenes as per the screenplay’s direction. The screenplay dictates the flow of the narrative, whether linear, non-linear, or spatially related. As the screenplay progresses, collaboration between writers and directors ensues, culminating in decisions regarding scene selection, length, and character attributes. This iterative process eventually yields a finalised script, initiating the production phase, inclusive of budget allocation and casting, among other aspects.

Transposing this analogy to data, the plot equates to the problem or scenario necessitating a nuanced approach for comprehension. Data analysts and managers spearhead the collection phase, akin to crafting the story in filmmaking. Attributes are assigned to each data point, and relationships between them are established, akin to constructing characters and plot arcs. Subsequently, data analysts leverage tools such as Power BI, Tableau, or Python to visualise the data, analogous to storyboarding. Various plot iterations are crafted and transformed into dashboards or utilised for further data refinement. Collaborative discussions among stakeholders ensue based on these visualisations, facilitating informed decision-making and bottleneck identification. This iterative cycle culminates in data pattern recognition, aiding in refining strategies and maintaining data integrity. Documentation of these stages and attribute definition for future utilisation wraps up the process.

1. Understanding the problem and requirements: Data analysis begins with a thorough comprehension of the problem or objectives at hand.

2. Data gathering: The next step involves collecting relevant data from various sources.

3. Organising and cleaning the data: Once collected, the data must be organised and cleaned to ensure accuracy and consistency.

4. Establishing relationships between data points: Data analysts then establish relationships between different data points to uncover insights and correlations.

5. Creating sample visuals from the data: Sample visuals are generated from the data to provide an initial glimpse into potential patterns and trends.

6. Producing unique data visuals and engaging in discussions: Based on these sample visuals, data analysts engage in discussions with both internal stakeholders and clients to refine the analysis further.

7. Understanding patterns and identifying bottlenecks: Through a comprehensive examination of the visuals, patterns and potential bottlenecks are identified, informing decision-making.

8. Making decisions based on results and continuous data maintenance: Decisions are made based on the insights gleaned, and the data is continuously updated and maintained to ensure relevance and accuracy.

9. Documenting all stages and defining attributes: Finally, all stages of the data analysis process are documented, and attributes are defined for future reference and use.

These steps provide a high-level overview of the typical data analysis process. While they may vary from one organisation to another, they are guided by tried and tested principles rather than strict rules.

Now, let’s delve into the role of data engineers.

Data Engineering: The initial stages of understanding requirements, as well as the collection and cleaning of data, remain consistent, forming a common and essential phase in the data science journey. However, what distinguishes a data engineer is their focus on developing data scripts. This entails ensuring the uninterrupted flow of data once visualisations are created and decisions are made or in progress. Data engineers must guarantee that data flows smoothly from one source, is properly structured, and reaches its endpoint without disruption. Any anomalies in the data flow must be promptly tracked and addressed to maintain its integrity.

To illustrate this, let’s return to the filmmaking analogy. After the story is crafted, the writers continue to collaborate with dialogue writers, directors, cinematographers, and casting directors. Any modifications to the story development require discussion with the original writers, as they are the creators of the narrative.

Similarly, data engineers design the data flow, ensuring its continuity and integrity. They write SQL queries to clean extraneous data at the outset, define data types, and manage dynamic data storage. In many industries, data is dynamic and updated regularly by users, necessitating cloud resources for maintenance. Data engineers are proficient in cloud management, data extraction, cleaning, and storage processes. They ensure that processed data retains its pattern and accessibility for machine learning or further analysis, contributing to forecasting and data categorisation.

In essence, data engineering is crucial for maintaining the smooth flow and integrity of data throughout the data science process, enabling effective analysis and decision-making.

Now, let’s delve into the role of a data scientist.

Data Scientist: I found it somewhat disheartening to realise that merely possessing a master’s degree in data science does not automatically confer the title of data scientist. It takes a combination of exceptional skill and perhaps a stroke of luck to secure a position at a reputable startup. Nonetheless, the journey has been intriguing, and I may consider penning a separate blog to share the lessons learned along the way. For now, let’s focus on understanding the essence of a data scientist’s role.

While the initial stages of requirement gathering and data preprocessing remain consistent, the pivotal role of a data scientist truly begins with understanding the problem at hand. In many cases, they are the first to identify issues or propose requirements that could streamline existing processes within an organisation. This ability stems from their extensive exposure to diverse datasets and a keen understanding of where potential challenges may arise.

Consider this example: to assess whether an individual has programming experience, instead of asking numerous questions, a data scientist might task them with creating a small application. By observing how they define algorithms, organise code, and employ specific functions, such as try and except blocks for error handling, or creating reusable functions for code modularity, an experienced programmer’s expertise becomes evident.

A data scientist possesses a deep understanding of these intricacies. However, their primary challenge often lies in building and deploying models. The initial step involves selecting appropriate machine learning algorithms to address the requirements. This entails testing various algorithms with sample data, creating ensembles, and correlating outputs with expected solutions. Once the models are chosen, collaboration with data engineers is crucial to establish the data flow — ensuring the model receives the requisite data. A data scientist must also be adept at data analysis, engineering, cloud services utilisation, and drawing domain-specific conclusions.

Configuration of the models and their integration into the workflow is another vital aspect. Rigorous testing is imperative before finalising results. While model development may be relatively swift with adequate knowledge and resources, thorough testing is paramount. A single erroneous solution could potentially jeopardise the entire organisation’s endeavours. For instance, Google Lens sales were anticipated to soar based on extensive monitoring of human behaviour and predictive analyses. However, the outcome fell short of expectations, underscoring the critical importance of thorough testing and validation.

In summary, whether in data analysis, engineering, or science, each role contributes to the same overarching process, with the degree of expertise varying accordingly. While salaries and workloads may fluctuate based on specific circumstances, data scientists typically command the highest compensation, with engineers and analysts following closely behind, their earnings influenced by the depth of their knowledge.

Lets take a quick summary of skills needed for each of the stages,

Data Analysis,

  1. Basic understanding of statistics and probability.
  2. A programming language (preferable, Python or R)
  3. Knowledge on any of the visualisation tool (Power Bi or Tableau)
  4. Knowledge of the domain you intend to work with.
  5. Basic knowledge on SQL and data bases.
  6. Basic knowledge on cloud services (Azure, AWS, etc..)
  7. Ability to extract information from the data.

Data Engineering.

  1. understanding of statistics and probability.
  2. Good with any programming language (preferable, Python or R)
  3. Very good knowledge on any of the visualisation tool (Power Bi or Tableau)
  4. Very Knowledge of the domain you intend to work with.
  5. Able to write SQL queries and good experience on using data bases.
  6. Experience using cloud services (Azure, AWS, etc..)
  7. Ability to extract information from the data.
  8. Knowledge on Machine learning models.

Data Science,

  1. Experience in working with statistical and probability problem,
  2. Very good knowledge on machine learning and neural networks.
  3. Good knowledge on the behind the scenes of each model.
  4. Experience working with SQL
  5. Experience working with Cloud services.
  6. A very good experience working with a programming language (preferable, Python or R)
  7. A very good Knowledge on any of the visualisation tool (Power Bi or Tableau)
  8. Ability to predict the problem and proactive designing a solution

The key is not to attempt to master every aspect of the field at once, but rather to start with small steps, maintain consistency, and strive to understand the process deeply. As Bruce Lee famously said, “I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times.” Success in this field hinges on dedication and continuous learning. Luck may not always be on our side, but with knowledge and perseverance, we can steadily progress towards our goals. Remember, pretending will only get you so far; true mastery requires genuine effort and commitment.

--

--

Krishna
Krishna

Written by Krishna

Machine learning | Statistics | Neural Networks | data Visualisation, Data science aspirant, fictional stories, sharing my knowledge through blogs.

No responses yet