Learning Path : Step by Step Guide for Beginners to Learn SparkR

Shashwat.2014 Srivastava 04 Jul, 2016 • 5 min read

Introduction

Lately, I’ve been reading the book Data Scientist at Work to draw some inspiration from successful data scientists. Among other things, I found that most of the data scientists have emphasized upon the evolution of Spark and its incredible extent of computational power.

This piqued my interest to know more about Spark. Since then, I’ve done an extensive research on this topic to come across every possible bit of information I could find.

Fortunately, Spark has extensive packages for different programming languages. I think, being an R user, my inherent inclination to SparkR is justified.

After I finished with the research, I realized there is no structured learning path available on SparkR. I even connected with folks who are keen to learn SparkR, but none came across such structured learning path. Have you faced the same difficulty ? If yes, here’s your answer.

This inspired me to create this step by step learning path. I’ve listed the best resources available on SparkR. If you manage to complete the 7 steps thoroughly, you are expected to acquire intermediate level of adeptness on Spark. However, your journey from intermediate to expert level would require hours of practice. You knew that, right ? Let’s begin!

Learning Path : Step by Step Guide for Beginners to Learn SparkR

 

Step 1: What is Spark? Why do we need it?

Spark is an Apache project promoted as “lightning fast cluster computing”. It’s astonishing computing speed makes it 100x faster than hadoop  and 10x faster than Mapreduce in memory. For large data processing, Spark has become first choice of every data scientist or engineer today.

You see Amazon, eBay, Yahoo, Facebook, everyone is using Spark for data processing on insanely large data sets. Apache Spark has one of the fastest growing big data community with more than 750 contributors from 200+ companies worldwide. According to the 2015 Data Science Salary Survey by O’Reilly, presence of Apache Spark skills added $11,000 extra to the median salary.

To explore the amazing world of Spark in detail, you can refer this article.

You can also watch this video to learn more about the value that Spark has added to the business world:

However, if you more of a person who read stuffs, you can skip the video and check this recommended blog.

Interesting Read: Apache officially sets a new record in large scale sorting

 

Step 2: What is Spark R?

Being an R user, let’s channelize our focus on SparkR.

R is one of the most widely used programming languages in data science. With its simple syntax and ability to run complex algorithms, it is probably the first choice of language for beginners.

But, R suffers from a problem. That is, its data processing capacity is limited to memory on a single node. This limits the amount of data you can process with R. Now, you know why does R runs out of memory when you attempt to work on large data sets. To overcome this memory problem, we can use SparkR.

Along with R, Apache Spark provides APIs for various languages such as Python, Scala, Java, SQL and many more. These APIs act as a bridge in connecting these tools with Spark.

For a detailed view of SparkR, this is a must watch video:

Note: SparkR has a limitation. Currently, it only support linear predictive models. Therefore, if you were excited to run boosting algorithm on SparkR, you might have to wait until the next version is rolled out.

 

Step 3 : Setting up your Machine

If you are still reading, I presume that this new technology has sparked a curiosity in you and that you would be determined to complete this journey. So, lets move on with setting up the machine:

To install SparkR, firstly, we need to install Spark in our systems, since it runs at the backend.

Following resources will help you in installation on your respective OS:

  1. Windows
  2. Ubuntu
  3. Mac OS

After you’ve successfully installed,  it just takes few extra steps to initiate SparkR , once you are done with Spark installation. Following resources will help you to initiate SparkR locally:

  1. Windows
  2. Ubuntu
  3. Mac OS

 

Step 4 : Getting the Basics Right

Start with R: Though I assume that you would be knowing R if you are interested to work with Big Data. However, if R is not your domain, this course by data camp will help you to get started with R.

Exercise: Install a package swirl in R and do the complete set of exercises.

Database handling with SQL: SQL is widely used in SparkR in order to implement functions easily using simple commands. This helps in reducing the code lines you have to write. Also, increases the speed of operations. If you are not familiar with SQL, you should do this course by codecademy.

Exercise:  Practice 1 and Practice 2

 

Step 5 : Data Exploration with SparkR and SQL

Once your basics are at place, it’s time to learn to work with SparkR & SQL.

SparkR enables us to use a number of data exploration operations using a combination of R and SQL simultaneously. The most common ones being select, collect, group_By, summarize, subset and arrangeYou can learn these operations with this article.

Exercise: Do this exercise by AmpBerkley

Dataset used in above exercise: Download

 

Step 6 : Building Predictive Models (Linear) on SparkR

As mentioned above, SparkR only supports linear modeling algorithms such as Regression. However, it’s just a matter of time until we are facing this constraint. I am expecting them to soon roll out an updated version which would support non-linear models as well.

SparkR implements linear modeling using the function glm. On the other hand, at present, Spark has a machine learning library known as MLlib (for more info on MLlib, click here), which supports non-linear modeling.

Learn and Practice: To build your first linear regression model on SparkR, follow this link. To build a logistic regression model, follow this link.

 

Step 7 : Integrating SparkR with Hive for Faster Computation

SparkR works even faster with Apache Hive for database management.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Integrating Hive with SparkR would help running queries even faster and more efficiently.

If you want to step into bigdata, the use of hive would really be a great advantage for efficient data processing. You can install Hive by following the links given for respective OS:

  1. For Windows
  2. For Ubuntu
  3. For Mac OS

After you’ve installed R successfully, you can start integrating Hive with SparkR using the steps demonstrated in this video. Alternatively, if you are more comfortable in reading, this video is also available in text format on this blog.

For a quick overview on SparkR, you can also follow its official documentation.

 

End Notes

I hope that I have made the learning path clear enough to accelerate your journey into data science using SparkR.

SparkR is often being seen as an intermediate step to switch into Big Data using R. I learned SparkR because I used to find immense difficulty in working on large data sets in R. SparkR provided me a convenient and cost free way to continue with my learning. 

In addition, for a R user, SparkR can also provide headstart to someone who wishes to transition into big data industry. It’s is much powerful than I have explored yet.

Did you find this article helpful ? Have you worked on SparkR ? Do share your suggestions / experience in the comments section below.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons to compete with many Data Scientists from all over the world.

I am a keen learner who thrives on learning new and interesting aspects of science. Data and AI amaze me as they bring forth new challenges. The idea of how fast the world is going to change looking the current trends in Machine learning is mind boggling and at the same time a revolution in itself. Music, books and data science keep me busy.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

subro
subro 01 Jul, 2016

Thank you, Great learning Path, BTW what else books do you read on datascience

Dox vK
Dox vK 01 Jul, 2016

After reading th is article I did some research myself and found the following link that I thought deserves a read,https://databricks.com/blog/2016/06/15/an-introduction-to-writing-apache-spark-applications-on-databricks.html

Vinod Pathak
Vinod Pathak 01 Jul, 2016

What are your Views on sparklyr? I think this is also a new way to handle big data. It gives dplyr capability and several machine learning algorithms. Check out this link : http://spark.rstudio.com/

mirriam ndunge
mirriam ndunge 01 Jul, 2016

this is informative

Harneet
Harneet 01 Jul, 2016

Can we integrate spark with Python as done in SparkR

KarthiAru
KarthiAru 03 Jul, 2016

How does this performance compared to sparklyr http://spark.rstudio.com?

Neil Dewar
Neil Dewar 03 Jul, 2016

Shashwat, thank you for an informative post - I'm sure it will draw a lot of R users into trying SparkR. I would add a couple of comments to it.(1) In terms of Machine Learning functionality, SparkR currently provides two functions: Generalized Linear Models (GLM) and Naive Bayes. (2) There are numerous quotes out there that the data scientist spends 50-80% of their time in data preparation. The SparkR data exploration (and manipulation) functions that you mention are highly valuable in preparing large data sets for analysis. (3) SparkR presents a full set of R functionality to the user. SparkR's limitations that you mention are in terms of only being available on SparkR DataFrames which get processed multi-threaded on a cluster. If you use any R functions on R data structures, they are processed single threaded, and are no more efficient that if you ran them on your local machine. (4) Probably the most important thing I've learned about SparkR is that SparkR creates, manipulates, and processes Spark DataFrames, which are not the same as R data.frames. The user needs to stay constantly aware of whether their variable is a Spark DataFrame or an R data.frame. Most of the error messages you get as you explore are likely to relate to erroneously trying to perform SparkR functions on R data structures, or perform R functions on Spark DataFrames.Let's hope that the Spark team rapidly ports other functions from the Spark MLLib library to work on DataFrames (and not just RDDs) - once that happens they will surely start to become available in SparkR!

Prabeesh K.
Prabeesh K. 04 Jul, 2016

nice blog. well said

Raghu
Raghu 05 Jul, 2016

Great Work Bro, I am gonna try this now. I am a beginner in this field and many a times I end up here for my questions.

skjainmiah
skjainmiah 11 Jul, 2016

Nice to hear about Spark R. I am going to study Msc Data Science what suggestions would you give me to have a head start in my career.

Harneet
Harneet 16 Jul, 2016

Hi All,There is a tool called "Zeppelin" which has all the interpreters for Spark, like R, python etc.. I am trying to install it as per the instructions available over net but unable to do it successfully, has anyone installed the same, if yes can you please help me.Regards, Harneet.

Philip
Philip 25 May, 2017

A nice article to start working with sparkr

Vijay
Vijay 18 Sep, 2017

Can anyone please suggest me the learning path for PySpark?

Prasanta Panja
Prasanta Panja 04 Feb, 2018

Hi, Step 3 to initiate SparkR for windows link does not work. Probably below steps can be used .Set the SPARK_HOME in R properly . SPARK_HOME should be set without the 'bin' part. For example, if the bin path within spark folder is "C:/Spark2.2.1/spark-2.2.1-bin-hadoop2.6/bin" then it should be set as "C:/Spark2.2.1/spark-2.2.1-bin-hadoop2.6/" . So step goes like this .Step 1: Sys.setenv(SPARK_HOME="C:/Spark2.2.1/spark-2.2.1-bin-hadoop2.6/") Step 2: sc <- spark_connect(master = "local")Example of how to use: trainDataTbl <- copy_to(sc, trainData) ## Here trainData is R dataframe and trainDataTbl is Spark dataframe.spark_kmeans % ml_kmeans(centers = 3, iter.max = 10,features = c("City_Category","Monthly_Income")) ## ml_kmeans is part of spark library.Thanks & Regards, Prasanta Panja

  • [tta_listen_btn class="listen"]