Learning Path : Step by Step Guide for Beginners to Learn SparkR

Last Updated : 04 Jul, 2016

5 min read

Introduction

Lately, I’ve been reading the book Data Scientist at Work to draw some inspiration from successful data scientists. Among other things, I found that most of the data scientists have emphasized upon the evolution of Spark and its incredible extent of computational power.

This piqued my interest to know more about Spark. Since then, I’ve done an extensive research on this topic to come across every possible bit of information I could find.

Fortunately, Spark has extensive packages for different programming languages. I think, being an R user, my inherent inclination to SparkR is justified.

After I finished with the research, I realized there is no structured learning path available on SparkR. I even connected with folks who are keen to learn SparkR, but none came across such structured learning path. Have you faced the same difficulty ? If yes, here’s your answer.

This inspired me to create this step by step learning path. I’ve listed the best resources available on SparkR. If you manage to complete the 7 steps thoroughly, you are expected to acquire intermediate level of adeptness on Spark. However, your journey from intermediate to expert level would require hours of practice. You knew that, right ? Let’s begin!

Learning Path : Step by Step Guide for Beginners to Learn SparkR

Step 1: What is Spark? Why do we need it?

Spark is an Apache project promoted as “lightning fast cluster computing”. It’s astonishing computing speed makes it 100x faster than hadoop and 10x faster than Mapreduce in memory. For large data processing, Spark has become first choice of every data scientist or engineer today.

You see Amazon, eBay, Yahoo, Facebook, everyone is using Spark for data processing on insanely large data sets. Apache Spark has one of the fastest growing big data community with more than 750 contributors from 200+ companies worldwide. According to the 2015 Data Science Salary Survey by O’Reilly, presence of Apache Spark skills added $11,000 extra to the median salary.

To explore the amazing world of Spark in detail, you can refer this article.

You can also watch this video to learn more about the value that Spark has added to the business world:

However, if you more of a person who read stuffs, you can skip the video and check this recommended blog.

Interesting Read: Apache officially sets a new record in large scale sorting

Step 2: What is Spark R?

Being an R user, let’s channelize our focus on SparkR.

R is one of the most widely used programming languages in data science. With its simple syntax and ability to run complex algorithms, it is probably the first choice of language for beginners.

But, R suffers from a problem. That is, its data processing capacity is limited to memory on a single node. This limits the amount of data you can process with R. Now, you know why does R runs out of memory when you attempt to work on large data sets. To overcome this memory problem, we can use SparkR.

Along with R, Apache Spark provides APIs for various languages such as Python, Scala, Java, SQL and many more. These APIs act as a bridge in connecting these tools with Spark.

For a detailed view of SparkR, this is a must watch video:

Note: SparkR has a limitation. Currently, it only support linear predictive models. Therefore, if you were excited to run boosting algorithm on SparkR, you might have to wait until the next version is rolled out.

Step 3 : Setting up your Machine

If you are still reading, I presume that this new technology has sparked a curiosity in you and that you would be determined to complete this journey. So, lets move on with setting up the machine:

To install SparkR, firstly, we need to install Spark in our systems, since it runs at the backend.

Following resources will help you in installation on your respective OS:

After you’ve successfully installed, it just takes few extra steps to initiate SparkR , once you are done with Spark installation. Following resources will help you to initiate SparkR locally:

Step 4 : Getting the Basics Right

Start with R: Though I assume that you would be knowing R if you are interested to work with Big Data. However, if R is not your domain, this course by data camp will help you to get started with R.

Exercise: Install a package swirl in R and do the complete set of exercises.

Database handling with SQL: SQL is widely used in SparkR in order to implement functions easily using simple commands. This helps in reducing the code lines you have to write. Also, increases the speed of operations. If you are not familiar with SQL, you should do this course by codecademy.

Exercise: Practice 1 and Practice 2

Step 5 : Data Exploration with SparkR and SQL

Once your basics are at place, it’s time to learn to work with SparkR & SQL.

SparkR enables us to use a number of data exploration operations using a combination of R and SQL simultaneously. The most common ones being select, collect, group_By, summarize, subset and arrange. You can learn these operations with this article.

Exercise: Do this exercise by AmpBerkley

Dataset used in above exercise: Download

Step 6 : Building Predictive Models (Linear) on SparkR

As mentioned above, SparkR only supports linear modeling algorithms such as Regression. However, it’s just a matter of time until we are facing this constraint. I am expecting them to soon roll out an updated version which would support non-linear models as well.

SparkR implements linear modeling using the function glm. On the other hand, at present, Spark has a machine learning library known as MLlib (for more info on MLlib, click here), which supports non-linear modeling.

Learn and Practice: To build your first linear regression model on SparkR, follow this link. To build a logistic regression model, follow this link.

Step 7 : Integrating SparkR with Hive for Faster Computation

SparkR works even faster with Apache Hive for database management.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Integrating Hive with SparkR would help running queries even faster and more efficiently.

If you want to step into bigdata, the use of hive would really be a great advantage for efficient data processing. You can install Hive by following the links given for respective OS:

For Windows
For Ubuntu
For Mac OS

After you’ve installed R successfully, you can start integrating Hive with SparkR using the steps demonstrated in this video. Alternatively, if you are more comfortable in reading, this video is also available in text format on this blog.

For a quick overview on SparkR, you can also follow its official documentation.

End Notes

I hope that I have made the learning path clear enough to accelerate your journey into data science using SparkR.

SparkR is often being seen as an intermediate step to switch into Big Data using R. I learned SparkR because I used to find immense difficulty in working on large data sets in R. SparkR provided me a convenient and cost free way to continue with my learning.

In addition, for a R user, SparkR can also provide headstart to someone who wishes to transition into big data industry. It’s is much powerful than I have explored yet.

Did you find this article helpful ? Have you worked on SparkR ? Do share your suggestions / experience in the comments section below.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons to compete with many Data Scientists from all over the world.

Beginner Big data Learning Path R

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

subro

Thank you, Great learning Path, BTW what else books do you read on datascience

Show 1 reply

Shashwat Srivastava

Thanks Subro. You can check out some of the best books on data science in one of our blogs. http://www.analyticsvidhya.com/blog/2014/06/books-data-scientists-or-aspiring-ones/

Dox vK

After reading th is article I did some research myself and found the following link that I thought deserves a read, https://databricks.com/blog/2016/06/15/an-introduction-to-writing-apache-spark-applications-on-databricks.html

Show 2 reply

Shashwat Srivastava

Thanks for sharing the article. It would definitely be helpful for those wanting to write spark applications.

Neil Dewar

Databricks is a great resource for people wanting to learn Spark. If you don't want to go through having to do a local installation of Spark, Hive etc you can use Spark in the cloud by signing up for a free databricks Community Edition account. Once you have the account, you will have access to databricks great training materials. Here's a link to the community edition sign-up: https://databricks.com/try-databricks?x=1&utm_expid=115457034-1.KtU7kcGAQAu9HxHVJLjLHg.1

Vinod Pathak

What are your Views on sparklyr? I think this is also a new way to handle big data. It gives dplyr capability and several machine learning algorithms. Check out this link : http://spark.rstudio.com/

Show 1 reply

Shashwat Srivastava

Hi Vinod, Sparklyr is quite a new package and is very useful for implementing Spark packages. There is no doubt about the fact that it is very useful for beginners in Spark. Infact it is still developing with more and more features getting embedded into it. Thanks for mentioning it. I'm sure it will be utilised by those interested in Spark.

mirriam ndunge

this is informative

Harneet

Can we integrate spark with Python as done in SparkR

Show 1 reply

Shashwat Srivastava

Yes Harneet, we can integrate Spark with Python using Spark Python API PySpark. To get a brief idea about it,, you can follow this link. https://spark.apache.org/docs/0.9.0/python-programming-guide.html

KarthiAru

How does this performance compared to sparklyr http://spark.rstudio.com?

Neil Dewar

Shashwat, thank you for an informative post - I'm sure it will draw a lot of R users into trying SparkR. I would add a couple of comments to it. (1) In terms of Machine Learning functionality, SparkR currently provides two functions: Generalized Linear Models (GLM) and Naive Bayes. (2) There are numerous quotes out there that the data scientist spends 50-80% of their time in data preparation. The SparkR data exploration (and manipulation) functions that you mention are highly valuable in preparing large data sets for analysis. (3) SparkR presents a full set of R functionality to the user. SparkR's limitations that you mention are in terms of only being available on SparkR DataFrames which get processed multi-threaded on a cluster. If you use any R functions on R data structures, they are processed single threaded, and are no more efficient that if you ran them on your local machine. (4) Probably the most important thing I've learned about SparkR is that SparkR creates, manipulates, and processes Spark DataFrames, which are not the same as R data.frames. The user needs to stay constantly aware of whether their variable is a Spark DataFrame or an R data.frame. Most of the error messages you get as you explore are likely to relate to erroneously trying to perform SparkR functions on R data structures, or perform R functions on Spark DataFrames. Let's hope that the Spark team rapidly ports other functions from the Spark MLLib library to work on DataFrames (and not just RDDs) - once that happens they will surely start to become available in SparkR!

Show 1 reply

Shashwat Srivastava

Hi Neil, Thanks for the sharing the information. It is because of enthusiastic community members like you that we are constantly inspired to explore further. I am sure that the details you shared above would be extremely useful for the community.

Prabeesh K.

nice blog. well said

Show 1 reply

Shashwat Srivastava

Thanks Prabeesh.

Raghu

Great Work Bro, I am gonna try this now. I am a beginner in this field and many a times I end up here for my questions.

skjainmiah

Nice to hear about Spark R. I am going to study Msc Data Science what suggestions would you give me to have a head start in my career.

Harneet

Hi All, There is a tool called "Zeppelin" which has all the interpreters for Spark, like R, python etc.. I am trying to install it as per the instructions available over net but unable to do it successfully, has anyone installed the same, if yes can you please help me. Regards, Harneet.

Philip

A nice article to start working with sparkr

Vijay

Can anyone please suggest me the learning path for PySpark?

Prasanta Panja

Hi, Step 3 to initiate SparkR for windows link does not work. Probably below steps can be used . Set the SPARK_HOME in R properly . SPARK_HOME should be set without the 'bin' part. For example, if the bin path within spark folder is "C:/Spark2.2.1/spark-2.2.1-bin-hadoop2.6/bin" then it should be set as "C:/Spark2.2.1/spark-2.2.1-bin-hadoop2.6/" . So step goes like this . Step 1: Sys.setenv(SPARK_HOME="C:/Spark2.2.1/spark-2.2.1-bin-hadoop2.6/") Step 2: sc <- spark_connect(master = "local") Example of how to use: trainDataTbl <- copy_to(sc, trainData) ## Here trainData is R dataframe and trainDataTbl is Spark dataframe. spark_kmeans % ml_kmeans(centers = 3, iter.max = 10,features = c("City_Category","Monthly_Income")) ## ml_kmeans is part of spark library. Thanks & Regards, Prasanta Panja

Write for us

Write, captivate, and earn accolades and rewards for your work

Reach a Global Audience
Get Expert Feedback
Build Your Brand & Audience

Cash In on Your Knowledge
Join a Thriving Community
Level Up Your Data Science Game

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Learning Path : Step by Step Guide for Beginners to Learn SparkR

Introduction

Step 1: What is Spark? Why do we need it?

Step 2: What is Spark R?

Step 3 : Setting up your Machine

Step 4 : Getting the Basics Right

Step 5 : Data Exploration with SparkR and SQL

Step 6 : Building Predictive Models (Linear) on SparkR

Step 7 : Integrating SparkR with Hive for Faster Computation

End Notes

You want to apply your analytical skills and test your potential? Then participate in our Hackathons to compete with many Data Scientists from all over the world.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)