data.table() vs data.frame() – Learn to work on large data sets in R

avcontentteam 05 Jul, 2020

8 min read

Introduction

R users (mostly beginners) struggle helplessly while dealing with large data sets. They get haunted by repetitive warnings, error messages of insufficient memory usage. Most of them come to an immediate conclusion, that their machine specification isn’t powerful enough. It’s time to upgrade the RAM or work on a new machine. Have you ever thought this way?

If you have seriously worked on data sets, I’m sure you would have. Even, I did too when I participated in The Black Friday. The data set contained more than 400,000 rows. I was totally clueless. Honestly, it was frustrating to see RStudio taking hours to execute one line of code. As we say, ‘necessity is the mother of all inventions’. I was in need of a solution.

After 2 hours of internet research, I came across an interesting set of R packages and APIs, specially made to work with large data sets without compromising with execution speed. One such package is data.table.

In this article, I’ve shared a smart approach which you should use when you work on large data sets. As you scroll down, you will come across the type of changes you can make to improve your R coding. It’s time to write codes which are fast and short. Consider it as a quick tutorial on data.table package.

Note: This article is best suited to beginners in data science using R who mainly work on data sets using data.frame() .If you are already a proficient user of data.table, this might not interest you.

data.table vs data.frame - learn to work on large data sets

Why does your machine fail to work with large data sets?

It’s important to understand the factors which deters your R code performance. Many a times, the incompetency of your machine is directly correlated with the type of work you do while running R code. Below are some practices which impedes R’s performance on large data sets:

Using read.csv function to load large files.
Using Google chrome: Opening several tabs in chrome consumes a significant amount of system’s memory. This can be checked using Shift + Esc key in chrome browser. (same applies to Mozilla web browser as well)
Machine Specification: R reads entire data set into RAM at once. That is, R objects live in memory entirely. If you are still working on a 2GB RAM machine, you are technically disabled. With 2GB RAM, there isn’t enough free RAM space available which could seamlessly work with large data. Hence, It is strongly recommend to work on atleast 4GB RAM machines.
Working in hot temperature: The processing speed slows the once the machine get heats up. During extreme summers, it evolves into a serious issue.

Note: My system specification is Intel(R) Core (TM) i5-3230M CPU @ 2.60GHz, 2 Core(s), 4 Logical Processors with 8GB RAM.

What is data.table ?

The package data.table is written by Matt Dowle in year 2008.

Think of data.table as an advanced version of data.frame. It inherits from data.frame and works perfectly even when data.frame syntax is applied on data.table. This package is good to use with any other package which accepts data.frame.

The syntax of data.table is quite similar to SQL. Therefore, if you’ve worked on SQL you would quickly understand it. The general form of syntax is:

DT[i, j, by]

where:

DT is referred to the data table.
i <=> where: refers to the row indexing takes place i.e. put the row condition here.
j <=> select: refers to the column indexing takes place i.e. put the conditions (to filter, to summarise) on columns here.
by <=> group_by: refers to any categorical variable i.e. put the variable on the basis of which the grouping shall be executed.

For example:

#creating a dummy data table
DT <- data.table( ID = 1:50,
Capacity = sample(100:1000, size = 50, replace = F),
Code = sample(LETTERS[1:4], 50, replace = T),
State = rep(c("Alabama","Indiana","Texas","Nevada"), 50))

#simple data.table command
> DT[Code == "C", mean(Capacity), State]

Let’s see how does this command work. After the data table is created, I asked data table to filter the rows whose code is C. Then I asked it to calculate the mean capacity of the rows which have code C for every state separately. It’s not necessary that you always mention all the three parts of the syntax. Try doing the following commands at your end :

DT[Code == "D"]
DT[, mean(Capacity), by = State]
DT[Code == "A", mean(Capacity)]

Write your answers in the comments! Let’s see how quickly you are getting this concept.

Why should you use data.table instead of data.frame?

After I delved deeper into data.table, I found several aspects at which data.table package outperforms data.frame. Therefore, I would recommend every R beginner to use data.table as much as they can. There is a lot to explore. The earlier you start, the better you’ll become. You should use data.table because:

1. It provides blazing fast speed when it comes to loading data. With the fread function in data.table package, loading large data sets need just few seconds. For example: I checked the loading time using a data set which contains 439,541 rows. Let’s see how fast is fread –

> system.time(dt <- read.csv("data.csv"))
user system elapsed
11.46 0.21 11.69

> system.time(dt <- fread("data.csv"))
user system elapsed
0.66 0.00 0.66

> dim(dt)
[1] 439541 18

As you saw, loading data with fread is 16x faster than the base function read.csv. fread() is faster than read.csv() because, read.csv() tries to first read rows into memory as character and then tries to convert them into integer and factor as data types. On the other hand, fread() simply reads everything as character.

2. It is even faster than the popular dplyr, plyr packages used for data manipulation. data.table provides enough room for tasks such as aggregating, filtering, merging, grouping and other related tasks. For example:

> system.time(dt %>% group_by(Store_ID) %>% filter(Gender == "F") %>% summarise(sum(Transaction_Amount), mean(Var2))) #with dplyr
user system elapsed
0.13 0.02 0.21

> system.time(dt[Gender == "F", .(sum(Transaction_Amount), mean(Var2)), by = Store_ID])
user system elapsed
0.02 0.00 0.01

data.table has processed this task 20x faster than dplyr. It happened because it avoids allocating memory to the intermediate steps such as filtering. Also, dplyr creates deep copies of the entire data frame where as data.table does a shallow copy of the data frame. Shallow copy means that the data is not physically copied in system’s memory. It’s just a copy of column pointers (names). Deep copy copies the entire data to another location in the memory. Hence, with memory efficiency, the speed of computation is enhanced.

3. Not just reading files, writing the files using data.table is much faster than write.csv(). This packages provides fwrite() function enabled with parallelised fast writing ability. So, next time you get to write 1 million rows, try this function.

4. In built features such as automatic indexing, rolling joins, overlapping range joins further enhances the user experience while working on large data sets.

Therefore, you see there is nothing wrong with data.frame, it just lacks the wide range of features and operations that data.table is enabled with.

Important Data Manipulation Commands

The idea of this tutorial is to provide you handy commands which can speed up your modeling process. Actually, there is so much to explore in this packages, chances are you might get puzzled from where to start, which command to stick with and when to use a particular command. Here, I provide answer to some of the most common questions which you come across while doing data exploration / manipulation.

The data set used below can be download from here: download. The data set contains 1714258 rows of 12 columns. It will be interesting to see, how long does the data.table takes in loading this data. Time for action!

Note: The data set contains uneven distribution of observations i.e. blank columns and NA values. The reason of taking this data is to check the performance of data.table on large data sets.

#set working directory
> setwd(".../AV/desktop/Data/")

#load data
> DT <- fread("GB_full.csv")
Read 1714258 rows and 12 (of 12) columns from 0.189 GB file in 00:00:07

It took only 7 seconds to read this file. Do try at your end.

1. How to subset rows & columns?

#subsetting rows
> sub_rows <- DT[V4 == "England" & V3 == "Beswick"]

#subsetting columns
> sub_columns <- DT[,.(V2,V3,V4)]

In a data table, columns are referred to as variables. Therefore, we don’t need to refer to variables as DT$column name, column name alone works just fine. If you do DT[,c(V2,V3,V4)], it would return a vector of values. Using .() symbol, wraps the variables within list() and returns data table. In fact, every data table or data frame is a compilation of list of equal length and different data types. Isn’t it?

Subsetting data can be done even faster setting keys in data table. Keys are nothing but supercharged rownames. A part of it has been demonstrated below.

2. How to order variables in ascending or descending order?

#ordering columns > dt_order <- DT[order(V4, -V8)]

Order function is data table is much faster than base function order(). Reason being, order in data table uses radix order sort which impart additional boost. - sign results in descending order.

3. How to add / update / delete a column or values in a data set?

#add a new column
> DT[, V_New := V10 + V11]

We did not assign the results back to DT. This is because, := operator modifies the input object by reference. It results in shallow copies in R which leads to better performance with less memory usage. The result is return invisibly.

#update row values
> DT[V8 == "Aberdeen City", V8 := "Abr City"]

With this line of code, we’ve updated Aberdeen City to Abr City in column V8.

#delete a column
> DT [,c("V6","V7") := NULL ]

Check View(DT). We see that the data contains blank columns in the data set. It can be removed using the code above. In fact, all the three steps can be done in command as well. This is known as chaining of commands.

> DT[V8 == "Aberdeen City", V8 := "Abr City"][, V_New := V10 + V11][,c("V6","V7") := NULL]

4. How to compute functions on variables based on grouping a column?

Let’s calculate mean of V10 variable on the bases of V4 (showing country).

#compute the average > DT[, .(average = mean(V1o)), by = V4]

#compute the count
> DT[, .N, by = V4]

.N is a special variable in data.table used to calculate the count of values in a variable. If you wish to obtain the order of the variable specified with by option, you can replace by with keyby. keyby automatically orders the grouping variable in ascending order.

5. How to use keys for subsetting data ?

keys in data table delivers incredibly fast results. We usually set keys on column names which can be of any type i.e. numeric, factor, integer, character. Once a key is set of a variable, it reorders the column observations in increasing order. Setting a key is helpful, specially when you know that you need to make multiple computations on one variable.

#setting a key
> setkey(DT, V4)

Once, the key is set, we can subset any value from the key. For example:

#subsetting England from V4
> DT[.("England")]

Once the key is set, we no longer need to provide the column name again and again. If we were to look for multiple values in a column, we can write it as:

> DT[.(c("England", "Scotland"))]

Similarly, we can set multiple keys as well. This can be done using:

> setkey(DT, V3, V4)

We can again, subset value from these two columns simultaneously using:

> DT[.("Shetland South Ward","Scotland")]

There are several other modifications which can be done in the 5 steps demonstrated above. These 5 steps illustrated above will help you to perform the basic data manipulation tasks using data.table. To learn more, I would suggest you to start using this package in your every day R work. You’d face various hurdles and that’s where your learning curve would accelerate. You can also check the official data.table guide here.

End Notes

This article is written to provide you a path using which you can easily deal with large data sets. No longer, you need to spend money on upgrading your machines, instead it’s time to upgrade your knowledge of dealing with such situations. Apart from data.table, there are several other packages for parallel computing available. But, I don’t see any need to any other package for data manipulation, once you become proficient with data.table.

In this article, I discussed about some important aspects which every beginner in R must know while working on large data sets. After data manipulation, the very next hurdle which comes is model building. With large data sets, packages like caret, random forest, xgboost takes a lot of time in computation. Has it occurred to you?

I plan to provide an interesting solution in my post next week! Do let me know your pain points in dealing with large data stets. Did you like reading this article? Which other package do you use when dealing with large data sets? Drop your suggestions / opinions in the comments.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

add a new column in R data manipulation data.table delete a column dplyr large data sets parallel computing r packages

avcontentteam 05 Jul, 2020

Advanced Libraries Machine Learning Programming R

Responses From Readers

Mohamed Salem 03 May, 2016

Really..thank you.. Keep it up

Show 1 reply

Analytics Vidhya Content Team 03 May, 2016

Welcome Salem! :)

Sowmiyan 03 May, 2016

Nice Article Manish. Eagerly looking forward to your next article on packages to be used while modelling on large data sets

Show 1 reply

Analytics Vidhya Content Team 03 May, 2016

Thanks Sowmiyan! Do come back next week.

james 03 May, 2016

excellent. I will change my codes to DT even though my datasets are small. Make it a habit man !

Show 1 reply

Analytics Vidhya Content Team 03 May, 2016

Thanks James! Even I did the same after I learnt about this package. Usually, new users in R tend to juggle among lots of R packages for data manipulation i.e. which one to use, when to use etc. data.table just take away all that confusion. Way to go. All the best!

Ivan 03 May, 2016

Awesome article!

Ivan Oboth 03 May, 2016

Rearly awesome read, can't wait for next week's article! Keep them coming!

Show 1 reply

Analytics Vidhya Content Team 03 May, 2016

Thanks Ivan! Next week is coming. :D

Anon 03 May, 2016

Nice timing: I'd been meaning to try this out for sometime. Windows 7 64-bit | i5-4300 CPU | 8 GB RAM > system.time(dt <- fread("train.csv")) Read 37670293 rows and 24 (of 24) columns from 3.791 GB file in 00:04:11 user system elapsed 191.85 19.35 251.31 And it also gives you an update of the percentage of data read every couple of seconds. Quite helpful.

Show 1 reply

Analytics Vidhya Content Team 03 May, 2016

Thanks Anon! With such a large data set, you can unleash the full power of this package once your start performing data manipulation steps. It'll be interesting to see how good data.table performs at such huge data set.

Ben Reid 03 May, 2016

Great article. For users who aren't ready to learn data.table syntax and ONLY want a fast way to read data, read_csv() from Hadley Wickham's readr package is almost as fast as fread() and much faster than read.csv(). But overall, the investment in learning data.table syntax is well worth it if you work with datasets over 1/2 GB.

Vamshi 03 May, 2016

Hi All, I need help of you all guys. Actually, I am working on retail data set so here the problem is imputation by groups. Manufacture>category>sub.brand>>brand>'units' Here i need to impute data in units by Manufacture>category>sub.brand>>brand>. Please Help me.

Nidhi 03 May, 2016

Thanks for this article. Can u give some information as how to deal with large datasets in python using pandas?

JimmyGao 04 May, 2016

Awesome article.This is what I need .Thank you very much.

Surya Prakash 04 May, 2016

Nice Article..Very well explained

Manasa 05 May, 2016

Hi. Thanks Manish. What is the the best way to create a 100000*3000 matrix in R. I have tried library matrix but I am still struggling with memory issues with a 16GB RAM

jangorecki 06 May, 2016

Very nice article. You may be interested in this fresh presentation which mentions about moving data.table algorithms into distributed H2O cluster on big data: http://library.fora.tv/2016/05/03/big_joins_scalable_data_munging_and_datatable

Shaul Abergil 06 May, 2016

Thanks! really helpful

Sakshi Bansal 10 May, 2016

Outstanding article...Thank you

Subro 10 May, 2016

Hi Manish, Great article, Can it read Excel files??

Show 1 reply

Analytics Vidhya Content Team 11 May, 2016

Hi Subro, fread doesn't support excel files. The best option is to convert excel file into .csv and used fread. Or, if you have a powerful machine, you can use readxl or XLConnect package.

Meziane 18 May, 2016

Hi Manish, Great article as usual ! I've just noticed that in the point 4, when you compute the mean of the variable v10, a bracket ')' is missing. May I am wrong ? Thank you for your job.

Show 1 reply

Analytics Vidhya Content Team 19 May, 2016

Hey Meziane, Thanks a lot for letting me know. Added the bracket.

Akshay 06 Aug, 2016

Hi Manish (im posting my question again because somehow the major part of my question in the previous post got deleted) in subsetting your data section, you used .() special symbol: "sub_columns <- DT[,.(V2,V3,V4)]" and you said because do DT[,c(V2,V3,V4)], it would return a vector of values. but in the updating row values section, you used c() instead: "sub_columns DT [,c(“V6″,”V7”) := NULL ]" so why do you use c() instead of .() here to delete colukn 6 & 7? thank you!

Show 1 reply

Analytics Vidhya Content Team 07 Aug, 2016

Hi Akshay, .() is a symbol for list. So, when you use list inside a DT[] frame, you are not required to quote the variable names. But, using c() vector symbol requires you to quote the variable names in vector. For some reason DT[,.(V6,V7) := NULL] wouldn't work. However, if you were to remove just one variable, DT[,V6 := NULL] would definitely work. Therefore, I've used c() to remove the variables.

ROBERT 06 Sep, 2016

Hi Manish. First of all very interesting article, congrats! I have a question about functions in data.table: Using ddply you can apply a own function (called DRIVERS for example) to the dataset by a column call "week" with: kpiweeks<-ddply (dataset, "week", DRIVERS) DRIVERS function return a dataframe with 8 new KPI based on dataset so kpiweeks will return a dataframe with the 8 new KPI for each week in dataset. Can we apply this function with data.table? Thank you!

Show 1 reply

Analytics Vidhya Content Team 06 Sep, 2016

Hi Robert, In data.table, you can use .SD and lapply function and perform this command like this: colnames <- names(kpiweeks) kpiweeks[,(colnames) := lapply(.SD,DRIVERS), .SDcols = colnames]

Vadim 10 Sep, 2016

On my 5 years old laptop (8G RAM) reads from archive just 9 sec. > DT = fread("unzip -cq GB_full.csv.zip") Read 1716366 rows and 12 (of 12) columns from 0.187 GB file in 00:00:09 Great package!

Animesh 27 Sep, 2016

awesome article and explained in very simple way. Many thanks for writing!

Yvette 20 Oct, 2016

This is a great resource! Thank you very much for sharing

Ruthger Righart 21 Oct, 2016

Great article!

Pengchuan 25 Oct, 2016

DT[,.(“England”)] is incorrect. DT[,.(England)] seems to be incorrect, either. Should be DT["England"] or DT["England",] ? Also, neither DT[.(“England”, “Scotland”)] nor DT[,.(England,Scotland)] is correct. Do we have a data.table way to select the columns with V4==“England” or V4==“Scotland” except the data.frame way DT[V4==“England” | V4==“Scotland”]?

Show 1 reply

Analytics Vidhya Content Team 25 Oct, 2016

Thanks Pengchuan for your comment. I feel sorry for the inconvenience caused to you. I've updated the correct codes in the article. To select "England" from a column (set as key), you should write: DT[.("England")] To select two values, say, "England" and "Scotland", we need to pass a vector in the list DT[.(c("England","Scotland"))] To select two values from two different columns (set as key), we write: DT[.("Column_1_Value","Column_2_Value")]

Johnny Chiu 09 Nov, 2016

Really useful article! Thanks a lot!

monika 29 Jun, 2017

I am totally beginner in R. really its a very useful article...

Vikas 19 Jul, 2017

Hi Manish Thank you for this great article. Can you please help me with syntax of fread() and how to use it.

LOKESH KUMAR D 27 Feb, 2018

Hi Thanks for sharing the great article.I use h20 package for model building.

Christoph Dobes 14 Mar, 2018

Liked the article very much! Easy to read and motivating to continue on the issue! Thanks! Christoph

data.table() vs data.frame() – Learn to work on large data sets in R

Introduction

Why does your machine fail to work with large data sets?

What is data.table ?

Why should you use data.table instead of data.frame?