Understanding Data Science from a Beginner’s Lens
You think data science is just a buzz word and you look around to find everybody blindly using it.
So what comes to your mind when somebody says the words – “data scientist”?
Microsoft Excel User (I can’t reject it, but believe me it’s more than that)? For Computer nerds, you might link it with machine learning and for Maths nerds you might link it with statistics.
But what is it really?
What might be the questions that data science answers?
This might be a very broad question. Let’s just quickly glimpse through two of the data models to uncover the idea:
That’s too many technical terms, let’s start from the start. 🙂
But what is Data Science and who is a data scientist?
Data Science is the field of study of data. Let me help you understand it further. Data in its raw form is meaningless. But when handled with care it becomes more valuable than gold. In today’s world, all the big economies are fighting over this data and its storage. This is the new oil of the twenty-first century (Fingers crossed, no civil wars for it unlike in you know “Middle East Asian” countries).
Now let’s break it into a more understandable form, Physicist is a person who exposes the reality of matter and energy, Chemist is a person who exposes the structure of chemical reactions and compounds. Similarly, a Data Scientist is a person who exposes unique patterns in data.
The journey of a Data Scientist is very unique from any other profession. Unlike other professions, data science is not a part of any particular discipline. It is a culmination of various professions including but not limited to Statistics and Machine Learning. People only look at this fancy word but let’s dive deeper to know why there is such a buzz around it and in turn learn some unknown facts and gossips from the community that can change people(literally).
What is Data?
Data are facts or statistics collected over a period, produced in structured as well as unstructured form. This is the technical term but let’s break it down. It means everything around you are facts and statistics, from your breath to your speech(pun intended). Humans produce enormous amounts of data. But only when these facts and statistics are collected and tabulated, they become Smart Data.
In Data Science we are concerned more about the data we are getting than anything else in the world. If you ask a Data Scientist what his last wish would be, he’ll probably say to get well-structured data with enough data points and if not already a job in FAANG.
Types of Data:-
Numerical(Integer Entries Eg:- The temperature of the day 35 C)
Categorical(More like Words or Characters Eg:- Favourite colors of Individuals Red, Blue or Green)
Image(Eg:- Image of a dog)
Text(Eg:- A paragraph from your favorite novel)
These are the basic forms of data that are present and all the other data forms can more or less be converted into these forms.
Data Science can further be divided into two forms:
Supervised Learning: We know the outcomes beforehand like every IPL final with MI in it 🙂
Unsupervised Learning: We don’t know the outcomes beforehand and make certain assumptions about the outcomes. Only certain times when CSK makes it to the final and defeats mi.
We will specifically talk about Supervised learning in this article.
Supervised Learning can further be divided into:
Regression(Where the outcome is a set of continuous values like How much money is RCB going to spend on players?) they have a range of integer values.
Classification(Where the outcome is a set of classes like For RCB the important question is; Are they ever gonna win?) they can have an integer as well as categorical values, but only certain defined values and no new values outside these defined values are allowed. For example when we train a model to recognize a dog and a cat, if we pass a picture of a kangaroo it will not classify it as a kangaroo. The machine only knows about dogs and cats. It will classify the pic as a dog. It’s like a newborn baby which tries to classify the entire world using only a few classes they know.
In 2014, Facebook experimented with over 700,000 users where it manipulated their news feed to determine whether it would affect their emotions. They presented a certain sample(This just means a group of the total people in the experiment) with a positive feed while the other sample was only provided with a negative feed. It was observed that the people who were provided with positive feed tend to post positive posts on their wall and the ones who received negative feed tend to post negative. Facebook was hugely criticized for this experiment citing that it breaches ethical guidelines for “Informed consent”. Just someday today Facebook spying.
This brings us to our next question.
Is Data Science going to kill us?
At least not for now, but we should be concerned for later. As illustrated in the interesting and uncommon fact, being aware of your data is a crucial thing now. You might not even know when you’re being manipulated. But this is not just to stop the big companies. Data Science is the future and there is enormous money spent in this field. Everything now produces data and people want a specialist to handle this data but the amount of data scientists being produced don’t match the requirements and the qualifications.
The data is produced exponentially while the data scientists are being produced linearly.
This imbalance is very significant.
We are all smart people here, we know the concept of “Demand and supply” in Economics, right?
Whenever there is a lot of Demand and small supply the people who are demanding are ready to pay huge sums for the supply. So there is a lot of money in Data Science but most of the Data Scientists being produced are not skilled enough to take advantage of this present system.
In 2006, Netflix announced a competition that changed Data Science forever. It had whopping prize money of $1 million for the top achiever. This competition involved building a more efficient recommendation system for Netflix. This is how “Netflix and chill” came to be a super house in Data science. More such competitions take place almost every day now(Of Course not with the same amount of money but beggars can’t be choosers). I guess the sooner you get involved with Data Science the higher chance you get to be successful and also get to say that “Don’t ask me to Chill, I practically invented it”.
What does a Data Scientist do?
There are 5 pillars of Data Science:
Data gathering(This is a statistical concept and is the most important part of Data Science)
Predicting Outcomes and Future Insights
For understanding, let’s imagine that we are building a house from scratch and learn the process in a more practical form.
Let’s tackle each, one by one:
1. Data Gathering
1st step in building a house. The step where we gather information, like choosing the place to live by comparing different lands. We have to find the perfect place and skilled workers, otherwise, it becomes a burden and we start hating the very thing we swore to create.
So this the data collection part, collecting all the data we are going to work on. This is more of a statistical concept.
Just imagine a native Hindi(Indian Language) student with no understanding of English, being asked to write their paper in English. This is an analogy for bad data practices. A particular format should be followed to gather data otherwise it’s just gibberish and you just lost your job! Collecting Data is also a piece of art, not like modern art(Just Kidding).
Data Gathering Fact:
For gathering personal human data, people require a permit and without acquiring this, no special medical study can be conducted.
And you know who is the best person for collecting human data?
If you said nurses something is wrong with you but the answer is correct. Most of the medical surveys are conducted through nurses. They know a lot about patients and are the ones who are directly dealing with the patients and have enough medical understanding to fill a survey properly.
2. Data Analysis
This is the step where we make the blueprint of our house. What all things we want in our house, a swimming pool or no swimming pool.
This is the part where your creativity comes in. Here we find insights from the data presented to us. Initially, certain steps can be followed but after a few common steps, it comes down to only one thing. Do you have the caliber to decipher the patterns in the data?
Data Analysis is more than just representing data through beautiful graphs and figures. It is used to help visualize yourself and others how data looks and what all can be inferred from this data. All basic and advanced inferences are found at this step which are the stepping stones for the future.
Data Analysis Fact:
Data Analysis using graphs is usually done to explain non-technical people. The person who has mastered this skill is regarded as a Data Analyst. He deals more with data analysis than directly modeling data and is the person who communicates the data cycle to Non-technical people.
3. Data Cleaning
This is the part where the hard structure is made. Putting up the concrete and building a solid structure, but, while building this structure several important rules have to be followed like the house should be earthquake resistant and should pass all the basic criteria.
In this step, we just clean the data to convert it into a form that is acceptable by the machine(Our computer). The computer only accepts data in numerical form. So all the other forms of data are converted into a numerical form for Modelling.
This is the step where the real difference between a Grand Master and a newbie is visible. It tests the patience of the modeller as steps 3 and 4 are repeated in a loop with step 5 as the checking criteria.
You know the last wish of a Data Scientist right?
Add Cleaned Data to that. Now even God started laughing, cause even he hates probability,” God doesn’t play dice” (Famous Albert Einstein quote).
Data Cleaning is not just about converting data into one form, it is the part where we feed input to our machine which is the basis of machine learning. Only when we feed the right input we will receive the output we want.
Data here is usually split into 3 forms:-
Training Dataset: To train the model
Validation Dataset: To test the trained model using the same form of data already present.
Test Dataset: To test the model on a new dataset like the data we don’t have right now, but when it becomes available we test out the model with it.
In data science, machine learning, and especially AI the interior functioning of the model becomes a black box (something even the data scientists building it don’t understand). This is contrary to the thinking that the rules are being set by the data scientists. Instead, the data scientists are just setting up the inputs, the machine is setting up its own rules upon encountering those inputs.
For example, google translate previously had millions of lines of code with coders manually feeding all the cases for a particular language. Now it consists of only 5 lines of code with huge sums of input being fed to it daily. Isn’t it scary to learn that these machines might know more about us than we do about ourselves?
4. Data Modelling
The structure is ready, now comes the interior design(With unparalleled control of the females from the house). People think this is the most important step. But they forget only when the base is strong the house can survive for a long time. All the steps before this step are the building steps up to this moment.
In Data Modelling we use multiple models ranging from Linear Regression to Tree-based models, I’ll not go into detail with them in this article but this is the algorithm building part of the process. Algorithms are given less importance than Data Feeding but that should not be the case. We have to form a balance between the two. But after forming a particular input we need to choose the best algorithm concerning this input. This process is sometimes more like a hit and trial method, but there is a certain process called Cross-validation, which can be used to compare different models. Let’s keep it general for the time being.
Data Modelling Gossip:
Recently, a data science competition was held on Kaggle, another data science site, with the Deep Fake competition. But it was struck with controversy (intense music plays) the top performer used a dataset from youtube to make a synthetic dataset. Facebook Stated “You may notice that the top rankings have changed. Unfortunately, the top two teams in the preliminary standings used external data sources in their winning submissions that were not allowed under the rules of this competition.”
The rules of the competition do allow teams to use data other than the official competition data to develop and test models and submissions. Teams however must “(i) ensure the External Data is available to use by all participants of the competition for purposes of the competition at no cost to the other participants and (ii) post such access to the External Data for the participants to the official competition forum before the Entry Deadline.”
Though the data they acquired was publicly available from a youtube repository, they didn’t have permission from each individual in that dataset to use their faces. Several kagglers and Data Science experts came in their support since they were using the dataset under public usage(With a license) and did not violate the competition guidelines.
This is a tricky topic even for an experienced practitioner.
There is nothing informative for a beginner here, I just included it for fun. And also to point out that Facebook gets in a lot of trouble, even when it tries to do something good.
5. Prediction Outcomes and Future Insights
Now our home is ready, the only thing left is to invite people and learn their views. Everybody has a different opinion of your house, good or bad.
Their views can be used to make the interior design better or can be used to make a certain look of the house better but the opposite effect can also be observed. This step is the final step. We measure the accuracy of our model using a specific metric to see how well it did.
After this step, all the other steps can be performed in a loop to improve the model.
Amazon uses data science for its customer help desk. Connecting the right customers to the right help is important. Like most of the customers might have a problem with their delivery but a few might contact for an issue with kindle. As most of the customer problems are related to delivery this can cause a problem as the people on the phone might not be qualified enough to handle that particular issue. This issue was resolved by amazon by forming a data set of people where the details of the callers were used by checking their purchase details and whether or not they owned a Kindle. So that the caller has to answer fewer questions but still can avail of the best service. This helped them get customer satisfaction which is one of the factors responsible to make Amazon the mammoth it is now.
Now all the top companies in the world use Data Science in some form or the other, just like Amazon, and the companies which adopted this technique the quickest are best on the list.
Brings us to the end of the article. This is a gentle introduction into the field of data science without any mathematics and using gossips and facts to make the article a more enriching experience for the new readers.
My aim was to make you understand the presence and functioning of Data Science and Data Scientist, the buzzword of the 21st century.
Any Positive or Negative feedback is welcome.[Write in the comments what did you think about this article]
Hope this article inspires people to take up Data Science!
Author: Ayan Dogra
Linkedin Link: https://www.linkedin.com/in/ayan-dogra-4516311a4
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.