R Programming Concepts Made Easy!
- How is Type Casting done in R?
- Discussing about the Date datatype in R
- Understanding the concept of Encoding in R
While the majority of the concepts remain somewhat similar across the various tools and technologies that we learn for data science, there are few concepts special to a particular tool or a language like R , for example. While we are easily able to deal with the “dates” in Excel and SQL , we need to import a module for working on dates in Python and if R is concerned there is a whole new concept of STRP codes that comes in while dealing with dates.
When one is done with learning the basics of R and R Programming which we discussed in this article: Get to know all about the R Language, it is the right time to look at some of the more complicated topics which we will be covering in this article.
Table of Contents
- Type Casting in R
- Dealing with Dates in R
- The Date data type in R
- Concept of Encoding in R
- Curse of Dimensionality
- Label Encoding
- One Hot Encoding
Type Casting in R
Does the name ‘Type Casting’ suggest anything about the concept? It does! ‘Type’ refers to the data type and casting refers to the phenomenon of conversion of the data type from one to another. Essentially , Type Casting is the process of changing the data type of an object in R to another data type.
For your reference : Some of the commonly used data types in R are numeric , logical , character etc.
Suppose we have an object “demo” with us having any particular data type. To see this object in the form of another data type say “new_datatype” we write the command as as.new_datatype(demo) and we are done.
Let’s try type casting! a is an object having value as the number 100 hence its class would be numeric.
To display this object a as a character we can write :
And we get the value of numeric object a as a character i.e “100”(Anything in quotes is considered as a character in R)
But if we again check the data type of a it comes out to be “numeric”
Why is this so? We didn’t save our result from the as.character(a) command into any object.
Time for another example! b is an object having a text value ‘ABC’ and hence its class would be character.
To display this object b as a number we can write the command as.numeric(b) but we get an error!
This seems like text in R cannot be converted to numbers. Let’s take another example !
c is an object having a text value ‘500’ and hence its class would be character.
To display this object c as a number we can write the command as.numeric(c) and we get:
This might seem confusing as some text can be converted into numbers while others can’t be but we will be able to clear this out by the end of this section !
The process of type casting obeys some rules : Not every data type can be transformed to another data type. There is a precedence that these data types follow according to which type casting is done.
Considering the most commonly used data types in R : character , numeric and logical , the precedence is as follows !
Here type casting can be done from bottom to top but not vice versa.
For any object generally the class cannot be converted from character to numeric or logical and from numeric to logical.
However there are certain special cases that need to be taken into consideration.
Case 1 : character to numeric
Consider objects a, b and c of character type and try converting to numeric.
Conclusion : This type of conversion is possible only if the data stored in double quotes (i.e the text ) is a number including decimal numbers.
Case 2 : character to logical
Consider objects a, b and c of character type and try converting to numeric.
Conclusion : This type of conversion is possible only if the text is one amongst these – T , F , True , False , TRUE , FALSE
Case 3 : numeric to character
Consider numbers 99 , -99 , 98765432198765
There are no conditions to this one , we can always convert numeric data type to character data type. In Fact any data type can be converted to character.
Case 4 : numeric to logical
Conclusion : For 0 we get FALSE and for any non-zero number we get TRUE.
This conversion is pretty direct but leads to data loss as we cannot convert the obtained logical value (TRUE/FALSE) back into the actual number we had.
Case 5 : logical to character
There are no conditions to this one , we can always convert logical data type (i.e TRUE , FALSE) to character data type. In Fact any data type can be converted to character.
Case 6 : logical to numeric
This conversion is always possible. The logical value TRUE is saved as 1 and logical value FALSE is saved as 0.
Dealing with Dates in R
Unlike other data types in R like character , numeric, integer , factor and logical , Date is not a naturally occurring data type instead it is a derived data type.
While in tools like Excel we can easily write the dates like 20/06/2017 and in tools like SQL dates are written as “20/06/2017” , these ways of writing do not work in R.
Let’s try creating such objects in R,
These objects x and y store the following data :
Let’s check their data type ,
None of them is a Date !
The Date Data Type in R
However we can convert these into proper dates using something known as the STRP codes but there is a prerequisite to that , if I want to convert a date into a proper date in R having data type as Date then my date should be of character type. In layman’s language we can convert the strings into dates in R.
Why do we need this process?
To make the computer understand that the characters I am writing in the string (which can be both numbers or alphabets) are actually the date components. For example for the object ‘y’:
We want the computer to understand that 20 means the day number , 06 means the month number and 2017 means the year with the century. For this we need to write some codes and then finally use the as.Date() function !
Some of the STRP codes for your reference are :
Now let’s convert our object y into a date.
Since y is of character data type we are good to go. Now understanding the date components and writing the corresponding codes we get ,
- 20 : Day Number : %d
- 06 : Month Number : %m
- 2017 : Year with Century : %Y
- Delimiter : /
Finally use the function as.Date(object_name , format)
For the object y the format is going to be ‘%d/%m/%Y’
And we get our date y_date corresponding to the character object y we had. Trust me this is a date , we can check it !
This is how we can convert any character object having a date into the Date data type. But there is another way to do this by using a third party library/package. In R we have a library called lubridate which contains a lot of functions that help us work with dates. Let’s try this out.
Creating an object demo with the data ‘01/23/18’ in it and using the function mdy defined in the lubridate library to type cast the character data type of demo into Date data type. (Don’t forget to install and further load the library first).
This is how we get the string ‘01/23/18’ as a date.
Concept of Encoding in R
What is our motive for learning R? To perform statistical analysis or run machine learning algorithms could be an answer. To run machine learning algorithms the data needs to be processed and we need to convert the data into the required format which is numeric i.e. in the form of numbers so that we only put the numeric variables into our model.
Don’t worry , we won’t be converting all the categorical variables into numeric variables because there is no way to do that. Instead we will give the categorical variables a numeric representation and this is what is called ENCODING.
This process of encoding can be categorized into 2 parts :
Label Encoding : For the Ordinal Categorical variables i.e. the variables in which the categories can be ordered.
One Hot Encoding : For the Nominal Categorical variables i.e. the variables where the categories are at the same level and cannot be ordered.
But before moving forward with the actual encoding part in R one needs to be familiar with the concepts of binning , multicollinearity and The Curse of Dimensionality which we will be referring to in the process of encoding mainly in one-hot encoding.
It is the process of grouping the variables in a data set on the basis of some criteria into bins so as to reduce the dimensionality. Binning can be done on both the categorical and the numerical variables. In the case of a numerical variable, values falling in a certain range can be binned together into categories and this is how we convert a numerical variable to a categorical variable. In the case of a categorical variable the numbers of categories can be reduced by clubbing together some of the existing categories.
As per our requirement to perform encoding in R just know that ‘A derived variable in a dataset causes multicollinearity’ and you will be good to go. If I had to give an example I would like to go with the profit calculating example , if our data has all the 3 variables cost , revenue and profit then since profit = revenue – cost so profit causes multicollinearity.
The Curse of Dimensionality
As the name suggests it has to do something with the dimensions and by dimensions we mean columns in a dataset that represent the features of the input data we have in the form of rows. By the dimensionality of a dataset we mean the number of columns characterizing the dataset. Curse of dimensionality happens when the number of columns increases which then leads to erroneous results while creating the machine learning models and hence we need to take care of this.
So we are all set up to learn about encoding now. Let’s discuss each of them one by one :
As previously mentioned , label encoding is done for giving the ordinal categorical variables a numeric representation.
The ordinal categorical variables have a property that we can actually order the categories(values) within the variable (column of a dataset) and according to that well defined order we provide natural numbers corresponding to them.
Let us take a dataset and actually perform label encoding on that ! Consider the following dataframe having the data of employees at an organization:
Let’s read the dataset as a data frame in R,
Identify the ordinal variable in the employee dataset? Yes, “Designation”
We can easily encode this variable assigning values in ascending order of designation as :
- Intern – 1
- Analyst – 2
- Senior Analyst – 3
- Manager – 4
Using factor data type does our work of assigning levels within the Designation variable.
Creating a new variable Designation_Encoded to show the Designation column as label encoded.
Now let’s have a look at our dataframe !
The data frame has a new column Designation_Enc containing the numeric representation of the original Designation column and hence we are done !
We can now drop the Designation column from our dataframe.
One – Hot Encoding
Since the ordinal categorical variables have been taken care of , now it’s the time to look at the second type of categorical variables which are the nominal categorical variables. When we give the nominal categorical variables a numeric representation it is known as one-hot encoding.
Since in the case of nominal variables we cannot explicitly order the categories (values) that we have in a variable (column of a dataset) so we go for a relatively new concept of dummy variable creation.
What do we need to do here? Just identify the distinct categories you have within a variable and create a dummy variable for each of them.
I did a survey with some of the corporate employees and recorded their responses as to what factors drive them to work harder in their job. Sharing the responses :
Let’s read the dataset as a data frame in R,
‘Response’ is a nominal categorical variable here , let’s perform one- hot encoding on this !
There are 5 different categories so creating 5 dummy variables i.e 5 new columns will be introduced in the dataset which will cause the ‘Curse of Dimensionality’ so instead of creating the dummy variables right now first we will go for ‘Binning’ the categories in the ‘Response’ variable.
Using the ifelse we binned the categories and reduced them to two : Monetary and Non-Monetary.
Now we just need to create 2 dummy variables !
Wait! We need to install a package fastDummies first,
Two dummy variables Response_Monetary and Response_Non-Monetary have been created. Why do we need 2 such variables? We don’t ! We just read about Multicollinearity and it’s evident that to avoid multicollinearity we need to drop one of them.
And we are done ! The “Response” variable has been One-hot encoded. We can further drop the “Response” variable from the dataframe.
So this is how we perform One-Hot Encoding , try creating dummy variables for multiple variables simultaneously.
So we are at the end of this article and I hope that by now you must be very well aware about how to perform type casting in R , how to type cast the character data type storing a date value into a proper Date data type and finally how to work towards the pre-processing of the data to be fed into a model with encoding.
Where there can be multiple ways to perform a particular task it is always good to know about the various options available.
Hope you liked my article on R Programming Concepts. Read more articles on our website.
You can connect with me on LinkedIn: https://www.linkedin.com/in/ayushi-gupta25/