A Beginner’s Guide to Tidyverse – The Most Powerful Collection of R Packages for Data Science

avcontentteam 14 Jun, 2020

10 min read

Introduction

Data scientists spend close to 70% (if not more) of their time cleaning, massaging and preparing data. That’s no secret – multiple surveys have confirmed that number. I can attest to it as well – it is simply the most time-taking aspect in a data science project.

Unfortunately, it is also among the least interesting things we do as data scientists. There is no getting around it, though. It is an inevitable part of our role. We simply cannot build powerful and accurate models without ensuring our data is well prepared.

So how can we make this phase of our job interesting?

Welcome to the wonderful world of Tidyverse! It is the most powerful collection of R packages for preparing, wrangling and visualizing data. Tidyverse has completely changed the way I work with messy data – it has actually made data cleaning and massaging fun!

Source: tidyverse.org

If you’re a data scientist and have not yet come across Tidyverse, this article will blow your mind. I will show you the top R packages bundled with in Tidyverse that make data preparation an enjoyable experience. We’ll also look at code snippets for each package to help you get started.

You can also check out my pick of the top eight useful R packages you should incorporate into your data science work.

What is Tidyverse?
Core R Packages in Tidyverse
1. Data Wrangling and Transformation
  - dplyr
  - tidyr
  - stringr
  - forcats
2. Data Import and Management
  - tibble
  - readr
3. Functional Programming
  - purrr
4. Data Visualization and Exploration
  - ggplot2
Some more useful Tidyverse libraries

What is Tidyverse?

Tidyverse is a collection of essential R packages for data science. The packages under the tidyverse umbrella help us in performing and interacting with the data. There are a whole host of things you can do with your data, such as subsetting, transforming, visualizing, etc.

Tidyverse was created by the great Hadley Wickham and his team with the aim of providing all these utilities to clean and work with data.

Let’s now look at some versatile Tidyverse libraries that the majority of data scientists use to manage and streamline their data workflows.

Core R Packages in Tidyverse

Ready to explore the tidyverse? Go ahead and install it directly from within RStudio:

install.packages("tidyverse")

We’ll be working on the food demand forecasting challenge in this article. I have taken a random 10% sample from the train file for faster computation. You can take the entire dataset if you want (and if your machine can support it!).

Let’s begin!

Data Wrangling and Transformation

dplyr

dplyr is one of my all-time favorite packages. It is simply the most useful package in R for data manipulation. One of the greatest advantages of this package is you can use the pipe function “%>%” to combine different functions in R. From filtering to grouping the data, this package does it all.

Here is the complete list of functions dplyr offers:

select(): Select columns from your dataset
filter(): Filter out certain rows that meet your criteria(s)
group_by(): Group different observations together such that the original dataset does not change. Only the way it is represented is changed in the form of a list
summarise(): Summarise any of the above functions
arrange(): Arrange your column data in ascending or descending order
join(): Perform left, right, full, and inner joins in R
mutate(): Create new columns by preserving the existing variables

Let’s look at an example to understand how to use these different functions in R.

Open up the food forecasting dataset we downloaded earlier. We have 2 other files apart from the training set. We can join them with our train file to add more features. Let’s use dplyr and merge all the files. Again, I’m just using 10% of the overall data to make the computation faster.

Output:

       id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured
1 1448490    1        55    2631         243.50     242.50                     0                 0
2 1446016    1        55    2290         311.43     310.43                     0                 0
3 1313873    1        55    2306         243.50     340.53                     0                 0
4 1440008    1        55    1962         582.03     612.13                     1                 0
5 1107611    1        24    1770         340.53     486.03                     0                 0
6 1298505    1        24    1198         147.50     191.09                     0                 0
  num_orders city_code region_code center_type op_area
1         40        NA          NA        <NA>      NA
2        162        NA          NA        <NA>      NA
3         28        NA          NA        <NA>      NA
4        231        NA          NA        <NA>      NA
5         54        NA          NA        <NA>      NA
6        148        NA          NA        <NA>      NA

Note: We see a lot of NAs here. This is because we randomly chose samples from each of the three files and then merged them. If you use the whole dataset, you will not observe this amount of missing values.

Next, let’s use three dplyr functions simultaneously to summarise the data. Here, we’ll select ‘TYPE_A’ from the ‘center_type’ variable and calculate the mean of the ‘num_orders’ variable at this particular center:

Here, %>% is called the piping operator. This comes in handy when we want to use one or more functions together.

Output:

   avg_A
1 286.3757

Go ahead and try out the other functions. Trust me, they will completely change the way you do data preparation.

tidyr

The tidyr package complements dplyr perfectly. It boosts the power of dplyr for data manipulation and pre-processing. Below is the list of functions tidyr offers:

gather(): The function “gathers” multiple columns from your dataset and converts them into key-value pairs
spread(): This takes two columns and “spreads” them into multiple columns
separate(): As the name suggests, this function helps in separating or splitting a single column into numerous columns
unite(): Works completely opposite to the separate() function. It helps in combining two or more columns into one

Let’s see a quick example of how to use tidyr. We’ll unite two binary variables and create only one column for both:

Output:

    id week center_id meal_id checkout_price base_price email_home num_orders city_code region_code
1 1448490    1        55    2631         243.50     242.50        0_0         40        NA          NA
2 1446016    1        55    2290         311.43     310.43        0_0        162        NA          NA
3 1313873    1        55    2306         243.50     340.53        0_0         28        NA          NA
4 1440008    1        55    1962         582.03     612.13        1_0        231        NA          NA
5 1107611    1        24    1770         340.53     486.03        0_0         54        NA          NA
6 1298505    1        24    1198         147.50     191.09        0_0        148        NA          NA
  center_type op_area
1        <NA>      NA
2        <NA>      NA
3        <NA>      NA
4        <NA>      NA
5        <NA>      NA

Here’s another example of how tidyr works:

Output:

  variable1 variable2 num
1         A   factor1   1
2         A   factor2   2
3         A   factor3   3
4         B   factor1   4
5         B   factor2   5
6         B   factor3   6
> spread(data,variable2,num)
  variable1 factor1 factor2 factor3
1         A       1       2       3
2         B       4       5       6
3         C       7       8       9

We easily converted the factor variables into a table that can be swiftly interpreted without much pre-processing.

stringr

Dealing with string variables is a tricky challenge. They can often trip up to our final analysis because we skipped over those variables initially thinking they won’t affect our model. That’s a mistake.

stringr is my go-to package in R for such situations. It plays a big role in processing raw data into a cleaner and an easily understandable format. stringr contains a variety of functions that make working with string data really easy.

Some basic functions that you can perform with the stringr package are:

str_sub(): Extract substrings from a character vector
str_trim():Trim white spaces
str_length(): Checks the length of the string
str_to_lower/str_to_upper: Converts the string into upper case or lower case

There are many more functions inside the stringr package. Let’s look at a couple of functions:

Output:

> str_to_lower(x)
[1] "analytics vidhya 001"
> str_to_upper(x)
[1] "ANALYTICS VIDHYA 001"

Combine two strings:

forcats

The forcats package is dedicated to dealing with categorical variables or factors. Anyone who has worked with categorical data knows what a nightmare they can be. forcats feels like a godsend.

It is quite frustrating when a factor appears in a place where we least expect it. If we’re using the tibble format, we don’t need to worry about this issue. The aim is to fill in those missing pieces so we can access the power of factors with minimum effort.

Use the following example to experiment with factors in your data:

Output:

# A tibble: 4 x 2
  f          n
  <fct>  <int>
1 TYPE_A  1890
2 TYPE_B   569
3 TYPE_C   537
4 NA     42657

Data Import and Management

Source: effiasoft.com

readr

We have plenty of ways to read data in R. So why use the readr package? The readr package solves the problem of parsing a flat file into a tibble. This provides an improvement over the standard file importing methods and significantly improves the computation speed.

You can easily read a .CSV file in the following way:

read_delim("filename.csv",delim=",")

Use this function and you’ll automatically see the difference in the time RStudio takes to read in huge data files.

tibble

We work with dataframes in R. It’s one of the first things we learn about R – convert your data into a dataframe before we can proceed with any sort of data science steps.

Tibble is a type of dataframe in R. It truly stands out when we’re trying to detect anomalies in our dataset. How? Tibble does not change variable names or types. It certainly doesn’t throw up errors when a variable does not exist or a value is missing.

Along with the print() function, the Tibble package helps in easy handling of big datasets containing complex objects. Such features enable us to treat the inherent data issues early on, hence producing cleaner code and data.

data<- as.tibble(train)
head(data)

Notice how the data type is mentioned along with the column names. This is a very useful way to present data. Using the above example we can easily see how R gives a “tibble” output to the users:

Output:

# A tibble: 456,548 x 9
       id  week center_id meal_id checkout_price base_price emailer_for_pro~ homepage_featur~
    <int> <int>     <int>   <int>          <dbl>      <dbl>            <int>            <int>
 1 1.38e6     1        55    1885           137.       152.                0                0
 2 1.47e6     1        55    1993           137.       136.                0                0
 3 1.35e6     1        55    2539           135.       136.                0                0
 4 1.34e6     1        55    2139           340.       438.                0                0
 5 1.45e6     1        55    2631           244.       242.                0                0
 6 1.27e6     1        55    1248           251.       252.                0                0
 7 1.19e6     1        55    1778           183.       184.                0                0
 8 1.50e6     1        55    1062           182.       183.                0                0
 9 1.03e6     1        55    2707           193.       192.                0                0
10 1.05e6     1        55    1207           326.       384.                0                1
# ... with 456,538 more rows, and 1 more variable: num_orders <int>

The train file that we converted to the tibble format now gives us a more clear look at the data types and number of variables. Looks pretty neat and tidy, right?

Functional Programming

purrr

The purrr package in R provides a complete toolkit for enhancing R’s functional programming. We can use the functions provided by purrr to avoid many loops with just one line of code.

Which function do you typically use to check the mean of every column in your data? Most data scientists using R tend to lean on the summary() function. It gives us the descriptive statistics for each column.

An even better way to just deduce the mean value, without using any ugly loops, is to use the “map” function. Let’s see how we can do that using our training set:

map_dbl(train,~mean(.x))

Output:

                  id                  week             center_id               meal_id 
         1.250096e+06          7.476877e+01          8.210580e+01          2.024337e+03 
       checkout_price            base_price emailer_for_promotion     homepage_featured 
         3.322389e+02          3.541566e+02          8.115247e-02          1.091999e-01 
           num_orders 
         2.618728e+02

Data Visualization and Exploration

ggplot2

I’m sure you must have heard of ggplot2. It is far and away from the best visualization package I have ever used. Data scientists universally love using ggplot2 to produce their charts and visualizations. It’s such a useful and popular package that they’ve integrated it into the Python language!

There is so much we can do with this package. Whether it’s building box plots, density plots, violin plots, tile plots, time series plots – you name it and ggplot2 has a function for it.

Let’s see a few examples of how to create some really interactive plots with ggplot2 in R.

‘num_orders’ is the target variable in our food forecasting dataset. Let’s look at its distribution by generating a density chart:

As you can see above, the dependent variable is right-skewed.

Now, how about drawing up a violin plot? It’s a nice alternative to boxplots for detecting outliers:

Woah. There are plenty of outliers in our data. Don’t you love how a simple visualization offers up so many insights?

Next, plot a scatterplot to check the relationship between the checkout price and the base price:

Interestingly, there seems to be a pretty strong linear relationship between the two variables. We can certainly dig deeper into this when we’re working on this challenge to understand how these variables affect our overall model building strategy.

The power of visualization never ceases to amaze me.

Some More Tidyverse Packages

These packages are not included directly in the tidyverse bundle. So you won’t be able to load them through the function library(tidyverse). Hence, I have provided the installation commands for each package in this section.

Importing Data

readxl: This package is very useful when you want to import Excel sheets in R:

install.packages("readxl")
library(readxl)
data <- read_xlsx("filename.xlxs")

haven: For importing SPSS, STATA and SAS data:

install.packages("haven")
library(haven)
dat = read_sas("path to file", "path to formats catalog")

googledrive: For importing Google Drive files:

Data Wrangling

lubridate: The best R package for working with date-time data. lubridate provides a series of functions that are a permutation of the letters “m”, “d” and “y” to represent the ordering of month, day and year:

Output:

"2019-01-11" "2018-09-12" "2019-04-01"

hms: This packages works similar to lubridate but only with time-based variables:

Output:

"9H 10M 1S" "9H 10M 2S" "9H 10M 3S"

Pretty awesome!

End Notes

Tidyverse is the most popular collection of R packages. Which isn’t all that surprising given how useful and easy to use they are. You’re definitely missing out on saving time and making your work much more efficient if you aren’t using the Tidyverse packages.

Have you used these R packages before? Are there any other packages you feel should be incorporated into Tidyverse? I want to hear hear your thoughts, feedback, and experience with Tidyverse. Let me know in the comments section below!

And if you get stuck at any point while using these packages, I’ll be happy to help you out.

We have summarised the use of every package under tidyverse in this amazing cheatsheet, you can access it here.

avcontentteam 14 Jun, 2020

Data Exploration Data Science Data Visualization Intermediate Libraries

Responses From Readers

Alex Rosental 13 May, 2019

Perfect timing Akshat!i am now starting my first 10k row assignment flying solo with no help from our instructor. He works for xtol I let you know later this week. You call it training set. Should I do my 80/20 split before tydyverse? Tks Alex Rosental 35 yr experience MsChE

Show 1 reply

Akshat Arora 13 May, 2019

Hi Alex! You should do the split after all the pre-processing in order to maintain the similar nature of the train and the test set. Plus, these packages will help in EDA, which will then aid you in feature engineering. Moreover, there should be the same number of features in the train and test set it is advisable that you use tidyverse before splitting.

sebastian 13 May, 2019

very good post, keep going!

Alain 13 May, 2019

Nice article, though there is an error when you mention "some basic functions that you can perform with the stringr package are: substr, paste, strsplit, tolower/toupper". Functions in the stringr package starts with str_ like in: str_sub, str_split, str_to_lower/str_to_upper. There is actually no function replacement for paste, nor paste0 on the stringr package.

Show 1 reply

Akshat Arora 14 May, 2019

Hi Alain! Thank you for going through the article. These errors have been rectified, thank you for the feedback :)

imran 14 May, 2019

Hi there , i'm beginner in R programming I see the output but i didnt see how the coding Can u advice me where can i see this Thanks

Show 1 reply

Akshat Arora 14 May, 2019

Hi Imran! Thank you for reading through the article. I'm not sure what you mean since I have attached the outputs with the codes above. If there is a specific output that you don't understand, feel free to reach out to me.

Robert Feyerharm 17 May, 2019

Thanks for posting Akshat, very helpful! I'm running R programs written by other data scientists at my new job and dplyr functions and pipe operators are used quite a lot. BTW, regarding your statement that "Data scientists spend close to 70% (if not more) of their time cleaning, massaging and preparing data" that will depend on the company/institution. My prior employer had separate data mgmt. dept.'s that built the datasets & did data prep before delivering the final dataset to the people in the analytics dept. At my current job with Cigna, data scientists are expected to do everything.

Show 1 reply

Akshat Arora 20 May, 2019

Thank you for the feedback Robert! I agree with you, this statement was based on my experience with different people. Although, this case may be different for different companies.

Paul 25 Jul, 2019

Based on your introduction in this article. Am a bit left on where to get the other two files besides the sample 10% for training.....\ Open up the food forecasting dataset we downloaded earlier. We have 2 other files apart from the training set. We can join them with our train file to add more features. Let’s use dplyr and merge all the files. Again, I’m just using 10% of the overall data to make the computation faster. library(dplyr) joined_data <- left_join(data,fc,by="center_id") where do i get object fc data from so that i can practice along. From Uganda

A Beginner’s Guide to Tidyverse – The Most Powerful Collection of R Packages for Data Science

Introduction

Table of contents

What is Tidyverse?

Core R Packages in Tidyverse

Data Wrangling and Transformation

dplyr

tidyr

stringr

forcats

Data Import and Management

readr

tibble

Functional Programming

purrr

Data Visualization and Exploration

ggplot2

Some More Tidyverse Packages

Importing Data

Data Wrangling

End Notes

Recommended Articles

Frequently Asked Questions

Responses From Readers

Write for us