Big Data Learning Path for all Engineers and Data Scientists out there

Saurabh.jaju2 09 Oct, 2019 • 10 min read

Introduction

The field of big data is quite vast and it can be a very daunting task for anyone who starts learning big data & its related technologies. The big data technologies are numerous and it can be overwhelming to decide from where to begin.

This is the reason I thought of writing this article. This article provides you a guided path to start your journey to learn big data and will help you land a job in big data industry.  The biggest challenge we face is identifying the right role as per our interest and skillsets.

To tackle this problem, I have explained each big data role in detail and also considering different job roles of engineers and computer science graduates.

I have tried to answer all your questions which you have or will encounter while learning big data. To help you choose a path according to your interest I have added a tree map which will help you identify the right path.

 

Table of Content

  1. How to get started?
  2. What roles are up for grabs in the big data industry?
  3. What is your profile, and where do you fit in?
  4. Mapping roles to Big Data profiles
  5. How to be a big data engineer?
    • What is the big data jargon?
    • Systems and architecture you need to know
    • Learn to design solutions and technologies
  6. Big Data Learning Path
  7. Resources

 

1. How to get started?

One of the very first questions that people ask me when they want to start studying Big data is, “Do I learn Hadoop, Distributed computing, Kafka, NoSQL or Spark?”

Well, I always have one answer: “It depends on what you actually want to do”.

So, let’s approach this problem in a methodical way. We are going to go through this learning path step by step.

 

2. What roles are up for grabs in the big data industry?

There are many roles in the big data industry. But broadly speaking they can be classified in two categories:

  • Big Data Engineering
  • Big Data Analytics

These fields are interdependent but distinct.

The Big data engineering revolves around the design, deployment, acquiring and maintenance (storage) of a large amount of data. The systems which Big data engineers are required to design and deploy make relevant data available to various consumer-facing and internal applications.

While Big Data Analytics revolves around the concept of utilizing the large amounts of data from the systems designed by big data engineers. Big Data analytics involves analyzing trends, patterns and developing various classification, prediction & forecasting systems.

Thus, in brief, Big data analytics involves advanced computations on the data.  Whereas big data engineering involves the designing and deployment of systems & setups on top of which computation must be performed.

 

3.What is your profile and where do you fit in?

Now, we know what categories of roles are available in the industry, let us try to identify which profile is suitable for you. So that, you can analyze where you may fit in the industry.

Broadly, based on your educational background and industry experience we can categorize each person as follows:

  • Educational Background

(This includes interests and doesn’t necessarily point towards your college education).

  1. Computer Science
  2. Mathematics

 

  • Industry Experience
  1. Fresher
  2. Data Scientist
  3. Computer Engineer (work in Data related projects)

 Thus, by using the above categories you can define your profile as follows:

Eg 1: “I am a computer science grad with no experience with fairly solid math skills”.

You have an interest in Computer science or Mathematics but with n o prior experience you will be considered a Fresher.

Eg 2: “I am a computer science grad working as a database developer”.

Your interest is in computer science and you are fit for a role of a  Computer Engineer (data related projects).

Eg 3: “I am a  statistician working as a data scientist”.

You have an interest in Mathematics and fit for a role of a Data Scientist.

So, go ahead and define your profile.

(The profiles we define here are essential in finding your learning path in the big data industry).

 

4. Mapping roles to profiles

Now that you have defined your profile, let’s go ahead and map the profiles you should target.

 

4.1 Big Data Engineering roles

If you have good programming skills and understand how computers interact over the internet (basics) but you have no interest in mathematics and statistics. In this case, you should go for Big data engineering roles.

 

4.2 Big data Analytics roles

If you are good at programming and have your education and interest lies in mathematics & statistics, you should go for Big data Analytics roles.

 

5. How to be a big data Engineer?

Let us first define what a big data Engineer needs to know and learn to be considered for a position in the industry. The first and foremost step is to first identify your needs. You can’t just start studying big data without identifying your needs. Otherwise, you would just be shooting in that dark.

In order to define your needs, you must know the common big data jargon. So let’s find out what does big data actually means?

 

5.1 The Big Data jargon

A Big data project has two main aspects –  data requirements and the processing requirements.

  • 5.1.1 Data Requirements jargon

Structure:  As you are aware that data can either be stored in tables or in files. If data is stored in a predefined data model (i.e has a schema) it is called structured data. And if it is stored in files and does not have a predefined model it is called unstructured data. (Types: Structured/ Unstructured)

Size:  With size we assess the amount of data. (Types: S/M/L/XL/XXL/Streaming)

Sink Throughput: Defines at what rate data can be accepted into the system. (Types: H/M/L)

Source Throughput: Defines at what rate data can be updated and transformed into the system. (Types: H/M/L)

 

  • 5.1.2 Processing Requirements jargon

Query time: The time that a system takes to execute queries. (Types: Long/ Medium /Short)

Processing time: Time required to process data (Types: Long/Medium/Short)

Precision: The accuracy of data processing (Types: Exact/ Approximate)

 

5.2 Systems and architecture you need to know

Scenario 1: Design a system for analyzing sales performance of a company by creating a  data lake from multiple data sources like customer data, leads data, call center data, sales data, product data, weblogs etc.

 

5.3 Learn to design solutions and technologies

Solution for Scenario 1: Data Lake for sales data

(This is my personal solution, you may come up with a more elegant solution if you do please share below.)

So, how does a data engineer go about solving the problem?

A point to remember is that a big data system must not only be designed to seamlessly integrate data from various sources to make it available all the time, but it must also be designed in a way to make the analysis of the data and utilization of data for developing applications easy, fast and always available (Intelligent dashboard in this case).

Defining the end goal:

  1. Create a Data Lake by integrating data from multiple sources.
  2. Automated updates of the data at regular intervals of time (probably weekly in this case)
  3. Data availability of analysis (round the clock, perhaps even daily)
  4. Architecture for easy access and seamless deployment of an analytics dashboard.

Now that we know what our end goals are, let us try to formulate our requirements in more formal terms.

 

  • 5.3.1 Data related Requirements

Structure: Most of the data is structured and has a defined data model. But data sources like weblogs, customer interactions/call center data, image data from the sales catalog, product advertising data. Availability and requirement of image and multimedia advertising data may depend on from company to company.

Conclusion: Both Structured and unstructured data

Size: L or XL (choice Hadoop)

Sink throughput: High

Quality: Medium (Hadoop & Kafka)

Completeness: Incomplete

 

  • 5.3.2 Processing related Requirements

Query Time: Medium to Long

Processing Time: Medium to Short

Precision: Exact

As multiple data sources are being integrated, it is important to note that different data will enter the system at different rates. For example, the weblogs will be available in a continuous stream with a high level of granularity.

Based on the above analysis of our requirements for the system we can recommend the following big data setup.

 

6. Big Data Learning Path

Now, you have an understanding of the big data industry, the different roles and requirements from a big data practitioner. Let’s look at what path you should follow to become a big data engineer.

As we know the big data domain is littered with technologies. So, it is quite crucial that you learn technologies that are relevant and aligned with your big data job role. This is a bit different than any conventional domains like data science and machine learning where you start at something and endeavor to complete everything in the field.

Below you will find a tree which you should traverse in order to find your own path. Even though some of the technologies in the tree are pointed to be data scientist’s forte but it is always good to know all the technologies till the leaf nodes if you embark on a path. The tree is derived from the lambda architectural paradigm.

With the help of this tree map, you can select the path as per your interest and goals. And then you can start your journey to learn big data. Click here to download the infographic.

One of the essential concepts that any engineer who wants to deploy applications must know is Bash Scripting. You must be very comfortable with linux and bash scripting. This is the essential requirement for working with big data.

At the core, most of the big data technologies are written in Java or Scala. But don’t worry, if you do not want to code in these languages ou can choose Python or R because most of the big data technologies now support Python and R extensively.

Thus, you can start with any of the above-mentioned languages. I would recommend choosing either Python or Java.

Next, you need to be familiar with working on the cloud. This is because nobody is going to take you seriously if you haven’t worked with big data on the cloud. Try practicing with small datasets on AWS, softlayer or any other cloud provider. Most of them have a free tier so that students can practice. You can skip this step for the time being if you like but be sure to work on the cloud before you go for any interview.

Next, you need to learn about a Distributed file system. The most popular DFS out there is Hadoop distributed file system. At this stage you can also study about some NoSQL database you find relevant to your domain. The diagram below helps you in selecting a NoSQL database to learn based on the domain you are interested in.

The path until now are the mandatory basics which every big data engineer must know.

Now is the point that you decide whether you would like to work with data streams or dormant large volumes of data. This is the choice between two of the four V’s that are used to define big data (Volume, Velocity, Variety and Veracity).

So let’s say you have decided to work with data streams to develop real-time or near-realtime analysis systems. Then you should take the Kafka path. Else you take the Mapreduce path. And thus you follow the path that you create. Do note that, in the Mapreduce path you do not need to learn pig and hive. Studying only one of them is sufficient.

 

In summary: The way to traverse the tree.

  1. Start at the root node and perform a depth-first traversal style.
  2. Stop at each node check out the resources given in the link.
  3. If you have decent knowledge and are reasonably confident at working with the technology then move to the next node.
  4. At every node try to complete at least 3 programming problems.
  5. Move on to the next node.
  6. Reach the leaf node.
  7. Start with the alternative path.

Did the last step (#7) baffle you! Well truth be told, no application has only stream processing or slow velocity delayed processing of data. Thus, you technically need to be a master at executing the complete lambda architecture.

Also, note that this is not the only way you can learn big data technologies. You can create your own path as you go along. But this is a path which can be used by anybody.

If you want to enter the big data analytics world you could follow the same path but don’t try to perfect everything.

For a Data Scientist capable of working with big data you need to add a couple of machine learning pipelines to the tree below and concentrate on the machine learning pipelines more than the tree provided below. But we can discuss ML pipeline later.

Add a NoSQL database of choice based on the type of data you are working with in the above tree.

As you can see there are loads of NoSQL databases to choose from. So it always depends on the type of data that you would be working with.

And providing a definitive answer to what type of NoSQL database you need to take into account your system requirements like latency, availability, resilience, accuracy and of course the type of data that you are dealing with.

 

7. Resources

1.Bash Scripting

2.Python

3. Java

4.Cloud

5. HDFS

6. Apache Zookeeper

7. Apache Kafka

8. SQL

9. Hive

10. Pig

11. Apache Storm

12. Apache Kinesis

13. Apache Spark

14. Apache Spark Streaming

 

End Notes

I hope you enjoyed reading this article. With the help of this learning path, you will be able to embark upon your journey in big data industry. I have covered most of the major concepts which you will require to land a job.

If you have any doubts or questions, feel free to post them below.

Learn, compete, hack and get hired

Saurabh.jaju2 09 Oct 2019

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Syed Ishrathullah
Syed Ishrathullah 24 Mar, 2017

Great article..thanks .. I am based in the UK and work in IT Security as an INformation Security Manager.. How do you think IT security will integrate with Big Data. In what way.. and what skills will be needed ?

Parag k
Parag k 25 Mar, 2017

Thanks for details .. most helpful document with reference link

venkatesh
venkatesh 25 Mar, 2017

Thanks for the great article It helped a lot to understand

Taran Bhagat
Taran Bhagat 25 Mar, 2017

Very useful article. I'm a civil engineering professor. Also fond of construction project management. How can Big data analytics or big data engineering help me. ?

anup@AV
anup@AV 28 Mar, 2017

very good article Sourabh ..gives lot of clarity.

Los sitios más destacados para la formación en Big Data en España y Online | Ideas de un Project Manager – José Julio López
Los sitios más destacados para la formación en Big Data en España y Online | Ideas de un Project Manager – José Julio López 28 Mar, 2017

[…] Otras recopilaciones de recursos y formación: Data science masters  , KDNuggets community  , Big Data Learning Path . […]

Syed
Syed 28 Mar, 2017

Hi Saurabh, A couple of points. 1) You havent mentioned SAS. Any reason why as it holds good sway in Big data anaytics ? 2) I also wanted your view on the timelines needed to pursue these courses . for instance ..someone like me who is an Info Sec Manager wanting to get into Big Data will try following your article ( as i loved it ) and go for Bash Scripting Review -- ( 1 week since by background in Comp Science) Python - ( 1 month) Hadoop ( 2 months) AWS ( 2 months) Kafka ( 2 months) Spark ( 2 months) What;s your view pls? Regards Syed

Big Data Learning Path for all Engineers and Data Scientists out there
Big Data Learning Path for all Engineers and Data Scientists out there 29 Mar, 2017

[…] Read more By SAURABH.JAJU2 Source: analyticsvidhya.com […]

vaishnavi
vaishnavi 31 Mar, 2017

hii i am it engineer 2016 passed out also completed two months hadoop course from one institute.now i'm thinking for post graduation program in big data and analytics, should i go for it or should wait and get some experience

Akash
Akash 08 Apr, 2017

Please give any other resource for bash scripting.

Hari Harikrishnan
Hari Harikrishnan 23 Apr, 2017

This is brilliant article...tying it back to a lambda article and showing the path...nicely done! A must-read for anyone who is thinking "big data". I am going to look around here to see if you've a similar one for algorithms (ML/stats) too.

Ashwini K
Ashwini K 05 May, 2017

Thank for the article. I am finding the learning paths very useful.

工程师及数据科学家的大数据学习路径 | 神刀安全网
工程师及数据科学家的大数据学习路径 | 神刀安全网 17 May, 2017

[…] 原文地址:Big Data Learning Path for all Engineers and Data Scientists out there […]

Prakhar Gupta
Prakhar Gupta 17 May, 2017

Hi Saurabh, Great article, extremely informative! I am currently working in the Data Warehousing domain and looking to jump into the big ocean of Big Data. Do you suggest learning from a professional training provider or self-learning? Also, can any one provide any pointers on a good training provider (weekend class) in Delhi NCR?

Joseph kiphizi
Joseph kiphizi 24 May, 2017

Good article...It helped me alot on how to start learning big data From Arusha Tanzania.

Andrea Johnson
Andrea Johnson 25 May, 2017

Real-Time Analysis using Big Data Analytics and Deep Learning.. https://kovidacademy.com/deep-learning-and-artificial-intelligence/

Maqsood
Maqsood 05 Jun, 2017

wow! what a great article

Jay Shah
Jay Shah 08 Jun, 2017

You guys have done amazing job, especially Saurabh! This article is amazing. I would like to share my little background as I live in New Jersey and came to US back in 2005. I just got my masters from NJIT in information systems, Informatics. After my undergrad, I started working for Merck & Co. and it has been 2 years with this great company. I would like to understand as many sources are provided and I am a self-learner. There are many jobs for data engineering but it is really hard to tackle all the resources at once. How to get my foot in the door for data engineer? as I have been playing around with Python, Full MEAN Stack, AWS, Haddop within Hive and Spark.

地表最強大數據系統學習法:想變成數據科學家、工程師就看這篇! | 发头条
地表最強大數據系統學習法:想變成數據科學家、工程師就看這篇! | 发头条 21 Jun, 2017

[…] 原文連結 […]

Big Data Learning Path for all Engineers and Data Scientists out there » IIBS
Big Data Learning Path for all Engineers and Data Scientists out there » IIBS 07 Jul, 2017

[…] https://www.analyticsvidhya.com/blog/2017/03/big-data-learning-path-for-all-engineers-and-data-scien… […]

Ahsan Hameed
Ahsan Hameed 12 Jul, 2017

Good article. Can you provide me the name of the trainer in SanFrancisco Bay Area and silicon valley area.

Evi
Evi 24 Jul, 2017

Amazing!!!!! Thank you!!!!

rakesh kumar
rakesh kumar 25 Jul, 2017

As a fresher, what is the scope of beginning a career in BIG DATA using in India? Register here: BIG DATA-Spark with Scala

Siva
Siva 06 Aug, 2017

Hi, Thanks for the article, it's extremely helpful. I'm right now working in analytics platform in an IT company. So I have some 6 months experience in hdfs , hive , oozie and other normal data related tools. So what do you think should be my next step. Im not good at statistics or maths. And I want to know whether big data scientist and data engineer are same. Can I do a post graduation in big data ? Thanks in advance

Anjali
Anjali 08 Sep, 2017

Heya і’m for the first time here. I found this article and blog is very interesting helpful. The comments on this blog show the trust people have in you. Thanks, fіnd It reaⅼⅼy useful & it helped me out a lot.

amar
amar 20 Sep, 2017

We were looking for such kind of information about Big data that how to get start with big data .The information shared over here is just what we needed .It sounds helpful for learning and future perspective as well.

singh
singh 23 Sep, 2017

Hi Sourabh, Thank for sharing the knowledgeable article. I have been preparing to become Big data analytics. can you share any resource for learning Big Data on AWS? I have completed big data fundamental session looking for intermediate that resource you have shared that's price is bit high. can you share any alternate way? Thanks in advance.

Big Data Analytics Training In Hyderabad
Big Data Analytics Training In Hyderabad 25 Sep, 2017

Thanks for sharing such a wonderful information on Big Data We are expecting more blogs from you We are also providing some wonderful information from our website.If you have together information please visit our website.

Indicium
Indicium 19 Nov, 2017

Great article, thanks!

maze
maze 29 Dec, 2017

Nice blog. Big data is becoming an effective basis for competition in pretty much every industry.

Mounika
Mounika 06 Jan, 2018

Great Article Answered almost all the questions of a reader Actually I have decided to learn Big data and didn't know where to start with.....Thanks for explaining which role best suits whom.

Reshma
Reshma 16 Jan, 2018

Hello Sir, I am working as SEO Executive/ Digital marketing. I want to change my Career path. I am not getting which one to choose a career in big data or advanced excel in analytics. I have 2 yrs work exp in SEO, But I am not enjoying my work. If I choose big data, what are possibilities of getting the job because I will enter this field as fresher? Can you please guide me.

Suman
Suman 25 Jan, 2018

hi Kunal, I work as a BA in data migration project of US health insurance. I am interested towards health insurance or health care analytics domain. I am browsing and found that Big data has a huge impact in health care/health insurance. I dont want to go for a data engineer role rather want to go for the big data analytics role. Please suggest which course will be good. My preference: > https://www.jigsawacademy.com/full-stack-big-data-analytics/ > https://www.edureka.co/?utm_source=google-search&utm_medium=cpc&utm_campaign=Brand-Search-IN&gclid=Cj0KCQiA-qDTBRD-ARIsAJ_10yLtqQCiT68MiKsPlcfHd2uX7BvDXcXbof6fPU9Bx7zLgzcv1XPzuGIaAtuDEALw_wcB Could you please suggest which course is good for analytics role but not big data engineer?

R-Algo Engineering Big Data
R-Algo Engineering Big Data 20 Feb, 2018

Awesome article! With the variety of majors that focus on big data, choosing a degree that focuses on big data can be tough. I know some great data scientists that majored in mathematics using R for the course of their undergrad. Seems more degrees are starting to focus on big data and data engineering and the career path seems endless.

Rajan Vishwakarma
Rajan Vishwakarma 12 Mar, 2018

Data Scientists are professionals who can analyze and explain complex digital data. They organize varying data elements with various techniques including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, visualization, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.

Anwar
Anwar 21 Apr, 2018

Nice Article...Thanks

Related Courses