Learning Path : Step by Step Guide for Beginners to Learn SparkR
Introduction
Lately, I’ve been reading the book Data Scientist at Work to draw some inspiration from successful data scientists. Among other things, I found that most of the data scientists have emphasized upon the evolution of Spark and its incredible extent of computational power.
This piqued my interest to know more about Spark. Since then, I’ve done an extensive research on this topic to come across every possible bit of information I could find.
Fortunately, Spark has extensive packages for different programming languages. I think, being an R user, my inherent inclination to SparkR is justified.
After I finished with the research, I realized there is no structured learning path available on SparkR. I even connected with folks who are keen to learn SparkR, but none came across such structured learning path. Have you faced the same difficulty ? If yes, here’s your answer.
This inspired me to create this step by step learning path. I’ve listed the best resources available on SparkR. If you manage to complete the 7 steps thoroughly, you are expected to acquire intermediate level of adeptness on Spark. However, your journey from intermediate to expert level would require hours of practice. You knew that, right ? Let’s begin!
Step 1: What is Spark? Why do we need it?
Spark is an Apache project promoted as “lightning fast cluster computing”. It’s astonishing computing speed makes it 100x faster than hadoop and 10x faster than Mapreduce in memory. For large data processing, Spark has become first choice of every data scientist or engineer today.
You see Amazon, eBay, Yahoo, Facebook, everyone is using Spark for data processing on insanely large data sets. Apache Spark has one of the fastest growing big data community with more than 750 contributors from 200+ companies worldwide. According to the 2015 Data Science Salary Survey by O’Reilly, presence of Apache Spark skills added $11,000 extra to the median salary.
To explore the amazing world of Spark in detail, you can refer this article.
You can also watch this video to learn more about the value that Spark has added to the business world:
However, if you more of a person who read stuffs, you can skip the video and check this recommended blog.
Interesting Read: Apache officially sets a new record in large scale sorting
Step 2: What is Spark R?
Being an R user, let’s channelize our focus on SparkR.
R is one of the most widely used programming languages in data science. With its simple syntax and ability to run complex algorithms, it is probably the first choice of language for beginners.
But, R suffers from a problem. That is, its data processing capacity is limited to memory on a single node. This limits the amount of data you can process with R. Now, you know why does R runs out of memory when you attempt to work on large data sets. To overcome this memory problem, we can use SparkR.
Along with R, Apache Spark provides APIs for various languages such as Python, Scala, Java, SQL and many more. These APIs act as a bridge in connecting these tools with Spark.
For a detailed view of SparkR, this is a must watch video:
Note: SparkR has a limitation. Currently, it only support linear predictive models. Therefore, if you were excited to run boosting algorithm on SparkR, you might have to wait until the next version is rolled out.
Step 3 : Setting up your Machine
If you are still reading, I presume that this new technology has sparked a curiosity in you and that you would be determined to complete this journey. So, lets move on with setting up the machine:
To install SparkR, firstly, we need to install Spark in our systems, since it runs at the backend.
Following resources will help you in installation on your respective OS:
After you’ve successfully installed, it just takes few extra steps to initiate SparkR , once you are done with Spark installation. Following resources will help you to initiate SparkR locally:
Step 4 : Getting the Basics Right
Start with R: Though I assume that you would be knowing R if you are interested to work with Big Data. However, if R is not your domain, this course by data camp will help you to get started with R.
Exercise: Install a package swirl in R and do the complete set of exercises.
Database handling with SQL: SQL is widely used in SparkR in order to implement functions easily using simple commands. This helps in reducing the code lines you have to write. Also, increases the speed of operations. If you are not familiar with SQL, you should do this course by codecademy.
Exercise: Practice 1 and Practice 2
Step 5 : Data Exploration with SparkR and SQL
Once your basics are at place, it’s time to learn to work with SparkR & SQL.
SparkR enables us to use a number of data exploration operations using a combination of R and SQL simultaneously. The most common ones being select
, collect
, group_By
, summarize
, subset
and arrange
. You can learn these operations with this article.
Exercise: Do this exercise by AmpBerkley
Dataset used in above exercise: Download
Step 6 : Building Predictive Models (Linear) on SparkR
As mentioned above, SparkR only supports linear modeling algorithms such as Regression
. However, it’s just a matter of time until we are facing this constraint. I am expecting them to soon roll out an updated version which would support non-linear models as well.
SparkR implements linear modeling using the function glm
. On the other hand, at present, Spark has a machine learning library known as MLlib
(for more info on MLlib, click here), which supports non-linear modeling.
Learn and Practice: To build your first linear regression model on SparkR, follow this link. To build a logistic regression model, follow this link.
Step 7 : Integrating SparkR with Hive for Faster Computation
SparkR works even faster with Apache Hive for database management.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Integrating Hive with SparkR would help running queries even faster and more efficiently.
If you want to step into bigdata, the use of hive would really be a great advantage for efficient data processing. You can install Hive by following the links given for respective OS:
After you’ve installed R successfully, you can start integrating Hive with SparkR using the steps demonstrated in this video. Alternatively, if you are more comfortable in reading, this video is also available in text format on this blog.
For a quick overview on SparkR, you can also follow its official documentation.
End Notes
I hope that I have made the learning path clear enough to accelerate your journey into data science using SparkR.
SparkR is often being seen as an intermediate step to switch into Big Data using R. I learned SparkR because I used to find immense difficulty in working on large data sets in R. SparkR provided me a convenient and cost free way to continue with my learning.
In addition, for a R user, SparkR can also provide headstart to someone who wishes to transition into big data industry. It’s is much powerful than I have explored yet.
Did you find this article helpful ? Have you worked on SparkR ? Do share your suggestions / experience in the comments section below.
20 thoughts on "Learning Path : Step by Step Guide for Beginners to Learn SparkR"
subro says: July 01, 2016 at 6:40 am
Thank you, Great learning Path, BTW what else books do you read on datascienceDox vK says: July 01, 2016 at 12:21 pm
After reading th is article I did some research myself and found the following link that I thought deserves a read, https://databricks.com/blog/2016/06/15/an-introduction-to-writing-apache-spark-applications-on-databricks.htmlVinod Pathak says: July 01, 2016 at 1:27 pm
What are your Views on sparklyr? I think this is also a new way to handle big data. It gives dplyr capability and several machine learning algorithms. Check out this link : http://spark.rstudio.com/mirriam ndunge says: July 01, 2016 at 1:28 pm
this is informativeHarneet says: July 01, 2016 at 6:48 pm
Can we integrate spark with Python as done in SparkRShashwat Srivastava says: July 02, 2016 at 6:18 am
Thanks for sharing the article. It would definitely be helpful for those wanting to write spark applications.Shashwat Srivastava says: July 02, 2016 at 6:23 am
Yes Harneet, we can integrate Spark with Python using Spark Python API PySpark. To get a brief idea about it,, you can follow this link. https://spark.apache.org/docs/0.9.0/python-programming-guide.htmlKarthiAru says: July 03, 2016 at 2:27 am
How does this performance compared to sparklyr http://spark.rstudio.com?Neil Dewar says: July 03, 2016 at 2:03 pm
Shashwat, thank you for an informative post - I'm sure it will draw a lot of R users into trying SparkR. I would add a couple of comments to it. (1) In terms of Machine Learning functionality, SparkR currently provides two functions: Generalized Linear Models (GLM) and Naive Bayes. (2) There are numerous quotes out there that the data scientist spends 50-80% of their time in data preparation. The SparkR data exploration (and manipulation) functions that you mention are highly valuable in preparing large data sets for analysis. (3) SparkR presents a full set of R functionality to the user. SparkR's limitations that you mention are in terms of only being available on SparkR DataFrames which get processed multi-threaded on a cluster. If you use any R functions on R data structures, they are processed single threaded, and are no more efficient that if you ran them on your local machine. (4) Probably the most important thing I've learned about SparkR is that SparkR creates, manipulates, and processes Spark DataFrames, which are not the same as R data.frames. The user needs to stay constantly aware of whether their variable is a Spark DataFrame or an R data.frame. Most of the error messages you get as you explore are likely to relate to erroneously trying to perform SparkR functions on R data structures, or perform R functions on Spark DataFrames. Let's hope that the Spark team rapidly ports other functions from the Spark MLLib library to work on DataFrames (and not just RDDs) - once that happens they will surely start to become available in SparkR!Neil Dewar says: July 03, 2016 at 2:07 pm
Databricks is a great resource for people wanting to learn Spark. If you don't want to go through having to do a local installation of Spark, Hive etc you can use Spark in the cloud by signing up for a free databricks Community Edition account. Once you have the account, you will have access to databricks great training materials. Here's a link to the community edition sign-up: https://databricks.com/try-databricks?x=1&utm_expid=115457034-1.KtU7kcGAQAu9HxHVJLjLHg.1Prabeesh K. says: July 04, 2016 at 10:14 am
nice blog. well saidShashwat Srivastava says: July 04, 2016 at 12:12 pm
Thanks Subro. You can check out some of the best books on data science in one of our blogs. http://www.analyticsvidhya.com/blog/2014/06/books-data-scientists-or-aspiring-ones/Shashwat Srivastava says: July 04, 2016 at 12:18 pm
Hi Vinod, Sparklyr is quite a new package and is very useful for implementing Spark packages. There is no doubt about the fact that it is very useful for beginners in Spark. Infact it is still developing with more and more features getting embedded into it. Thanks for mentioning it. I'm sure it will be utilised by those interested in Spark.Shashwat Srivastava says: July 04, 2016 at 12:24 pm
Hi Neil, Thanks for the sharing the information. It is because of enthusiastic community members like you that we are constantly inspired to explore further. I am sure that the details you shared above would be extremely useful for the community.Shashwat Srivastava says: July 04, 2016 at 1:23 pm
Thanks Prabeesh.Neil Dewar says: July 04, 2016 at 2:11 pm
Thank you for your kind words Shashwat. Hot off the press ... I just saw that Spark 1.6.2 was released and the SparkR page on the Spark website now includes the words: "SparkR also supports distributed machine learning using MLlib." I haven't had a chance to research it yet, but it looks like SparkR may now have access to the full suite of machine learning tools available in other spark languages.Shashwat Srivastava says: July 04, 2016 at 2:37 pm
Although the Spark website may have such a mention, but the official documentation of SparkR hasn't confirmed the same. However, one thing that we can be sure of is the implementation of MLlib on SparkR in its future versions. RegardsRaghu says: July 05, 2016 at 7:25 am
Great Work Bro, I am gonna try this now. I am a beginner in this field and many a times I end up here for my questions.skjainmiah says: July 11, 2016 at 4:00 pm
Nice to hear about Spark R. I am going to study Msc Data Science what suggestions would you give me to have a head start in my career.Harneet says: July 16, 2016 at 9:16 am
Hi All, There is a tool called "Zeppelin" which has all the interpreters for Spark, like R, python etc.. I am trying to install it as per the instructions available over net but unable to do it successfully, has anyone installed the same, if yes can you please help me. Regards, Harneet.