Learn everything about Analytics

9 Challenges on Data Merging and Subsetting in R & Python (for beginners)

SHARE
, / 15
Log in or Register to save this content for later.

Introduction

Juggling with multiple data sets is a common task for a data scientist. And, it’s immensely important for a beginner or intermediate to learn this skill.

I got the idea of writing this article from the past data science competitions. Many a times, people end up getting undesirable NA values while combining or subsetting data sets. If it still happens with you, don’t freak out. You just have to practice these challenges well and you won’t get any NAs again.

After you finish these challenges successfully, it is expected that you will become proficient at manipulating data frames, merging multiple data sets to perform several basic operations on data frame(s). For your convenience, I’ve used R and Python to demonstrate the operations. Also, I’ve given 4 practice exercises.

P.S – For fun, I’ve used a dummy data set from a popular TV Series named ‘Game of Thrones’. If you are crazy about it, great! If not, you will still find it easy to understand.

Data set for every challenge is available for download here.

data merging and subsetting in R and Python

 

Table of Contents

  • Challenge 1 : Adding More Observations
  • Challenge 2 : Dropping Observations
  • Challenge 3 : Adding Column(s) Horizontally
  • Challenge 4: Adding Column(s) based on common attribute
  • Challenge 5 : Adding Column(s) based on observation serial (index)
  • Challenge 6: Removing Duplicate Observations
  • Challenge 7: Dropping Columns
  • Challenge 8: Modifying Value(s) of a DataFrame
  • Challenge 9: Renaming Column Name(s)

Before starting with the challenges, make sure you’ve downloaded the data.

 

Challenge 1 : Adding more observations

i) In Structured Data Set

We have a dummy data set from Game of Thrones named house. It contains information about various clans (houses). Think of houses as families. Let’s say, two new houses have emerged whose information is contained in the data set house_extra.

Task: To include the houses in house_extra in the data set house (i.e merging the two data sets)

Data : house
HouseRegion
StarkThe North
TargaryenSlaver’s Bay
LannisterThe Westerlands
BaratheonThe Stormlands
TyrellThe Reach
Data : house_extra
HouseRegion
TullyThe Riverlands
GreyjoyThe Iron Islands

You can make a row wise addition of house_extra to house data set by using the following code :

Python CodePython-

#row wise addition does vertical addition of new rows
> house = house.append(house_extra)
or
#axis=0 can also add new rows to data set
> house = pd.concat([house,house_extra],axis=0)

R CodeR_120

#using base function merge
> house <- merge(house,house_extra,all=TRUE,sort=FALSE)

The output will be :

HouseRegion
StarkThe North
TargaryenSlaver’s Bay
LannisterThe Westerlands
BaratheonThe Stormlands
TyrellThe Reach
TullyThe Riverlands
GreyjoyThe Iron Islands

 

ii) In Unstructured Data Set

Now there might be a case where you have an unstructured data. Think of unstructured data as a data without any matrix or data frame. Still, could we add new observations to it?

Suppose you have a new house in the house data frame. The new house is “Redwyne” which is present in “The Reach” region. Now, we want to add this new observation in our existing house data. Let’s see how to do it.

Python CodePython-

> house.loc[len(house)]=['Redwyne','The Reach']

R CodeR_120

> house <- rbind(house,data.frame(House=c("Redwyne"),Region=c("The Reach")))

The output will be :

Data : house
HouseRegion
StarkThe North
TargaryenSlaver’s Bay
LannisterThe Westerlands
BaratheonThe Stormlands
TyrellThe Reach
TullyThe Riverlands
GreyjoyThe Iron Islands
RedwyneThe Reach

As you can see, a new house (Redwyne) has been added to the house data frame with its region.

 

Important Points

1. At times there are situations when we are required to add observations from a data set which has a new column than the existing one. Confused ? Let’s do some changes in data set and understand this point:

Now, the house data set contains two columns (House, Region). Another data set, house_new contains columns (House, Region  and Religion). Keep in mind, Religion column is not available in other data set. We are asked to combine these data sets. How can we do?

Data : house
HouseRegion
StarkThe North
TargaryenSlaver’s Bay
LannisterThe Westerlands
BaratheonThe Stormlands
TyrellThe Reach
Data : house_new
HouseRegionReligion
TullyThe RiverlandsSeven Gods
GreyjoyThe Iron IslandsSea God

Let’s find the solution.

Python Code

> house=house.append(house_new)Python-
or
> house=pd.concat([house,house_new],axis=0)

R CodeR_120

> house <- merge(house,house_new,all=TRUE,sort=FALSE)

We can see that it assigned NaN to the elements of house data frame because “Religion” variable was not present in the house set.

HouseRegionReligion
StarkThe NorthNaN
TargaryenSlaver’s BayNaN
LannisterThe WesterlandsNaN
BaratheonThe StormlandsNaN
TyrellThe ReachNaN
TullyThe RiverlandsSeven Gods
GreyjoyThe Iron IslandsSea God

 

2. This point is only applicable for python users.

Continuing from point 1, after adding the new houses to the old data set, we end up getting repeated index values. This is definitely a problem.

Data : house
HouseRegionReligion
0StarkThe NorthNaN
1TargaryenSlaver’s BayNaN
2LannisterThe WesterlandsNaN
3BaratheonThe StormlandsNaN
4TyrellThe ReachNaN
0TullyThe RiverlandsSeven Gods
1GreyjoyThe Iron IslandsSea God

Now, if we try to access the  first element of the data frame, what will we get? Take a minute to think about it.

> house.ix[0]

The output will be:

HouseRegionReligion
0StarkThe NorthNaN
0TullyThe RiverlandsSeven Gods

We see two elements in the output. Why is that ?

This is because, after merging the two data frames, the index of new observations haven’t changed according to the new data set. So, if we try to access the first element of the data frame, the result will be same as above. To handle this problem, we can treat index also. It can be done by :

#ignore_index does not takes the old index in considerationPython-
>house_data=pd.concat([house,house_new],axis=0,ignore_index=True)

Now the output after vertically adding the data frames will be :

HouseRegionReligion
0StarkThe NorthNaN
1TargaryenSlaver’s BayNaN
2LannisterThe WesterlandsNaN
3BaratheonThe StormlandsNaN
4TyrellThe ReachNaN
5TullyThe RiverlandsSeven Gods
6GreyjoyThe Iron IslandsSea God
As we see that the index is no longer the old index of individual data frames. Now, we can access any row without any problem.

Exercise 1 : What will be the output of the append or concat operations on house and house_new if there is an extra variable present in house data frame?
Exercise 2 : Write the code for adding house data frame to house_new(house_new observations will be on top)?
Provide your answers in comments below.

 

Challenge 2 : Dropping Observations

Dropping Rows

Suppose we have a data set candidates which contains information about heirs (successors) of each house (family). They are sorted on the basis of age in descending order within the same house. (There is no order between the houses).

Data : candidates
HouseName
LannisterJamie Lannister
StarkRobb Stark
StarkArya Stark
LannisterCersi Lannister
TargaryenDaenerys Targaryen
BaratheonRobert Baratheon
MormontJorah Mormont

Now we want to remove the top two rows of the candidates data frame :

Python CodePython-

# 0 and 1 are the index of the rows we want to remove
> candidates=candidates.drop([0,1])

R CodeR_120

# Note: In R, index starts from 1 and not 0.
> candidates=candidates[-c(1:2),]

Output will be :

HouseName
StarkArya Stark
LannisterCersi Lannister
TargaryenDaenerys Targaryen
BaratheonRobert Baratheon
MormontJorah Mormont

 

Dropping rows based on conditions

In the TV Series, Robb Stark was killed at a wedding. Since he’s no more alive, he can’t be an heir to the Stark House. We should remove him from the data set.

Data : candidates
HouseName
LannisterJamie Lannister
StarkRobb Stark
StarkArya Stark
LannisterCersi Lannister
TargaryenDaenerys Targaryen
BaratheonRobert Baratheon
MormontJorah Mormont

Task: Prepare the new guest list with Robb Stark’s entry removed. This is how we do it :

Python CodePython-

# We have taken all candidates except "Rob Stark"
> candidates[candidates.Name!= "Robb Stark"]

R CodeR_120

> candidates[which(candidates$Name!="Robb Stark"),]

Output will be :

HouseName
LannisterJamie Lannister
StarkArya Stark
LannisterCersi Lannister
TargaryenDaenerys Targaryen
BaratheonRobert Baratheon
MormontJorah Mormont

 

Challenge 3 : Adding Column(s) Horizontally

At times, a data set is provided in different files. Each file contains some unique information. We are required to merge them in such a way that we can extract maximum information.

In such cases, how can we decide what kind of merging technique we should apply?

The answer is, it depends on the requirement of problem statement. Below are the different types of merge operations and insights on how to decide which merging technique to apply in various situations.

Sometimes the problem statement is straight forward.

Let’s take this case. The data frames required to combine are shown below. Military strength of each house is given in the same sequence as the sequence of houses in house data set. In such situation, we simply need to map the indexes with one another.

Data : house
HouseRegion
StarkThe North
TargaryenSlaver’s Bay
LannisterThe Westerlands
BaratheonThe Stormlands
TyrellThe Reach
Data : military
Military_Strength
20000
110000
60000
40000
30000

To get more information about each house’s military strength, we’ll simply add military data set to the house data set:

Python CodePython-

> house=pd.concat([house,military],axis=1)

R CodeR_120

> house=cbind(house,military)

The output will be:

HouseRegionMilitary_Strength
StarkThe North20000
TargaryenSlaver’s Bay110000
LannisterThe Westerlands60000
BaratheonThe Stormlands40000
TyrellThe Reach30000

Wasn’t that easy ? Actually, since the index alignment was similar that’s why we were able to merge these two data sets but this is always not the case. In fact, this would rarely happen in any data science competition. Challenge 4 shows the real trouble.

 

Challenge 4: Adding Column(s) based on common attribute

Now, how to merge the data sets if their indexes are not aligned?

In such situations, there is always a common attribute (key or keys) that aids us in combining data sets. But, you need to find the common attribute(s) present in the data frames. They could be column(s) or index(s).
Hint : Most of the times, common attribute will be some sort of ID. Keep an eye for it.

There can be different ways to merge the data depending upon the type of question asked. Here we have the house data set  and candidates data set. To show you different variations in applying these operations, I am going to solve different questions and situations around the data sets.

Think about the question for a while before diving into the solution.

Data : candidates
HouseName
LannisterJamie Lannister
StarkRobb Stark
StarkArya Stark
LannisterCersi Lannister
TargaryenDaenerys Targaryen
BaratheonRobert Baratheon
MormontJorah Mormont
Data : house
HouseRegionMilitary_Strength
StarkThe North20000
TargaryenSlaver’s Bay110000
LannisterThe Westerlands60000
BaratheonThe Stormlands40000
TyrellThe Reach30000

Question : Which of the candidate has the largest army ?

Just by looking at the heir’s name or his/her house name from the candidates data set will not answer this question. To answer this question, we have to extract the information about military strength from their corresponding houses.

How can we do that? Their index are not aligned, is there any common attribute between them?

Yes, both the data frames have “House” column. We will now see how can we merge the above data set on the basis of a common column.

Python CodePython-

> house = pd.merge(candidates,house,on="House",how="left")
#or
> house = pd.merge(candidates,house,left_on="House",right_on="House",how="left")

R CodeR_120

>house <- merge(candidates,house,by='House',all.x=TRUE,sort=FALSE)

The output will be:

HouseNameRegionMilitary_Strength
LannisterJamie LannisterThe Westerlands60000
StarkRobb StarkThe North20000
StarkArya StarkThe North20000
LannisterCersi LannisterThe Westerlands60000
TargaryenDaenerys TargaryenSlaver’s Bay110000
BaratheonRobert BaratheonThe Stormlands40000
MormontJorah MormontNaNNaN

By looking at the above data frame we can say that Daenerys Targaryen has the largest army. But, why did Jorah Mormont get NaN ?

There are a few things to notice here (refer above for codes) :

  • In first code we used on/by="House" which is the common attribute in both the data frames.
  • In case we merge data on the basis of a particular key, but they have different column name we can use left_on/by.x & right_on/by.y as shown in the second code.
  • In both the codes, we used how='left'/all.x=TRUE which uses the key (“House”) from the left frame only. We used left because we want information about all the candidates.
  • House Tyrell has no entry in the merged data frame because there was no candidate from house Tyrell, left merge took care of it.
  • House Mormont don’t have any information in house data frame, so candidate Jorah Mormont is assigned with NaN in the merged data frame.

Left merge is used here because we want information about the candidates. So, in the code above, candidates was the left data set and it uses “House” key from left frame only.

images

Left Merge

Note : I have used “merge” in headings and explanations. It is the general term that I will be using  for combining or joining data set. Where as “merge” in the code is the syntax that is being used for merging, joining or combining.

 

Question : List all the houses along with their military strength and the rightful heir

Now, we have to deliver information about each house. So, military strength is already present in house data set. All we have to do is to find the heir of each house who is present in candidates data set.

In simple words, we want to extract candidate’s name from  candidates data frame and place it with corresponding house. Let’s see.

Python CodePython-

> house = pd.merge(house,candidates,on="House",how="left")

This will provide all the candidates from each house along with their military strength with older candidates at the top. In candidates data frame, an older member of a particular house is placed above the others. We can also perform a right merge operation to do the same thing.

> house=pd.merge(candidates,house,on="House",how="right")

R CodeR_120

> house <- merge(candidates,house,by="House",all.y=TRUE,sort=FALSE)

The output will be:

HouseNameRegionMilitary_Strength
LannisterJamie LannisterThe Westerlands60000
LannisterCersi LannisterThe Westerlands60000
StarkRobb StarkThe North20000
StarkArya StarkThe North20000
TargaryenDaenerys TargaryenSlaver’s Bay110000
BaratheonRobert BaratheonThe Stormlands40000
TyrellNaNThe Reach30000

Here how="right"/all.y=True uses the key from right frame only.

images (1)

Right Merge


Question : List the houses that have atleast one heir?

Sometimes we just need the part of the combined data where information from both data set is present . In simple words, we can say that we want the intersection between the available information or elements of house & candidates data frame.

Rather than removing the observations having missing values we can directly merge the data set by using the following code :

Python CodePython-

> merge_inner = pd.merge(candidates,house,on="House",how="inner")

R CodeR_120

> merge_inner <- merge(candidates,house,by="House",all=FALSE,sort=FALSE)

The output will be:

HouseNameRegionMilitary_Strength
LannisterJamie LannisterThe Westerlands60000
LannisterCersi LannisterThe Westerlands60000
StarkRobb StarkThe North20000
StarkArya StarkThe North20000
TargaryenDaenerys TargaryenSlaver’s Bay110000
BaratheonRobert BaratheonThe Stormlands40000

Here how="inner"/all=FALSE uses intersection of keys from both frames.

download

Intersection merge

 

Question : List all the available information about houses and heirs?

We want all data about the candidates and houses together, regardless of availability of information. In simple words, we can say that we want the union of all the information of house & candidates data frames.

We  can do that with following codes :

Python CodePython-

> merge_outer = pd.merge(candidates,house,on="House",how="outer")

R CodeR_120

> merge_outer <- merge(candidates,house,all=TRUE,sort=FALSE)

HouseNameRegionMilitary_Strength
LannisterJamie LannisterThe Westerlands60000
LannisterCersi LannisterThe Westerlands60000
StarkRobb StarkThe North20000
StarkArya StarkThe North20000
TargaryenDaenerys TargaryenSlaver’s Bay110000
BaratheonRobert BaratheonThe Stormlands40000
MormontJorah MormontNaNNaN
TyrellNaNThe Reach30000

Here how="outer"/all=True uses union of key from both frames.

download (1)

Union Merge

 

Challenge 5 : Adding Column(s) based on observation serial (index)

Sometimes the common attribute or the key is index in both the data frames or index in one and a column in the other. We are going to handle a similar problem where both the keys are index. Later, you have to find the solution for the other case in the exercise given. Suppose the data sets are :

Data : candidates
Name
House
LannisterJamie Lannister
StarkRobb Stark
StarkArya Stark
LannisterCersi Lannister
TargaryenDaenerys Targaryen
BaratheonRobert Baratheon
MormontJorah Mormont
Data : house
RegionMilitary_Strength
House
StarkThe North20000
TargaryenSlaver’s Bay110000
LannisterThe Westerlands60000
BaratheonThe Stormlands40000
TyrellThe Reach30000

We can merge these two data frames by using the following codes :

Python CodePython-

> house = candidates.join(house,how='left')
#or
#right_index and left_index enables the merging on the index.
#Here also how will have four options 'inner','outer','right' and 'left'
> house = pd.merge(candidates, house, right_index=True, left_index=True,how ='left')

Join function is a convenient method for combining two data frames on the basis of index (by default). But, we can also merge if one of the keys is a column by using ‘on’ parameter.

R CodeR_120

> house<-merge(candidates,house,sort=FALSE,all.x=TRUE)

NameRegionMilitary_Strength
House
BaratheonRobert BaratheonThe Stormlands40000
LannisterJamie LannisterThe Westerlands60000
LannisterCersi LannisterThe Westerlands60000
MormontJorah MormontNaNNaN
StarkRobb StarkThe North20000
StarkArya StarkThe North20000
TargaryenDaenerys TargaryenSlaver’s Bay110000

Caution:

What if the data frames you are merging have a same column name other than the common attribute? What will happen? You can handle that very easily. Let’s see

Suppose you have the following data frames :

Data : house
HouseRegionMilitary_Strength
StarkThe North20000
TargaryenSlaver’s Bay110000
LannisterThe Westerlands60000
BaratheonThe Stormlands40000
TyrellThe Reach30000
Data : candidates
HouseNameRegion
LannisterJamie LannisterWestros
StarkRobb StarkNorth
StarkArya StarkWestros
LannisterCersi LannisterWestros
TargaryenDaenerys TargaryenMereene
BaratheonRobert BaratheonWestros
MormontJorah MormontMereene

If you run the merge code, it will add some suffix by default (_x,_y) but you can add your own suffix by using the following code :

Python CodePython-

> house = pd.merge(candidates,house, on='House', how='left', suffixes=('_left', '_right'))

R CodeR_120

> house <-merge(candidates,house,by="House",all.x=TRUE,
sort=FALSE,suffixes=c("_left","_right")

The output will be :

HouseNameRegion_leftRegion_rightMilitary_Strength
LannisterJamie LannisterWestrosThe Westerlands60000
StarkRobb StarkNorthThe North20000
StarkArya StarkWestrosThe North20000
LannisterCersi LannisterWestrosThe Westerlands60000
TargaryenDaenerys TargaryenMereeneSlaver’s Bay110000
BaratheonRobert BaratheonWestrosThe Stormlands40000
MormontJorah MormontMereeneNaNNaN

 

Exercise 3 : Comment the code that will left merge the following data set on the basis of “House” :

 

Data : house
RegionMilitary_Strength
House
StarkThe North20000
TargaryenSlaver’s Bay110000
LannisterThe Westerlands60000
BaratheonThe Stormlands40000
TyrellThe Reach30000
Data : candidates
Name
House
LannisterJamie Lannister
StarkRobb Stark
StarkArya Stark
LannisterCersi Lannister
TargaryenDaenerys Targaryen
BaratheonRobert Baratheon
MormontJorah Mormont

 

Challenge 6: Removing Duplicate Observations

Let me start this section with a question.

Again, you have two data sets house & candidates as shown below. You are asked about the details of the next heir for each house with his/her military strength (in this case next heir will be the older child).

Data : house
HouseRegionMilitary_Strength
StarkThe North20000
TargaryenSlaver’s Bay110000
LannisterThe Westerlands60000
BaratheonThe Stormlands40000
TyrellThe Reach30000
Data : candidates
HouseName
LannisterJamie Lannister
StarkRobb Stark
StarkArya Stark
LannisterCersi Lannister
TargaryenDaenerys Targaryen
BaratheonRobert Baratheon
MormontJorah Mormont

Now, lets apply merge operation using the “house” key from candidates frame :

Python CodePython-

#using left join
> heir = pd.merge(candidates,house,on="House",how='left')

R CodeR_120

> heir = merge(candidates,house,by="House",all.x=TRUE,sort=FALSE)

The output will be :

HouseNameRegionMilitary_Strength
LannisterJamie LannisterThe Westerlands60000
StarkRobb StarkThe North20000
StarkArya StarkThe North20000
LannisterCersi LannisterThe Westerlands60000
TargaryenDaenerys TargaryenSlaver’s Bay110000
BaratheonRobert BaratheonThe Stormlands40000
MormontJorah MormontNaNNaN

As you can see, we have repetition here which is not needed for the question I asked. You have to understand the structure of data sets and the method being applied for merging. Otherwise you can end up with a data set that you think is ready for analysis, but it is not the required one and can impact the output.

Just take a minute and think about the possible solutions for this problem.

To tackle this problem we can apply several methods. Here we are going to use the following method :

1. Removing the duplicates : We can remove the redundant entries from the candidates data frame by keeping the first entry of the candidate from the top. We are using the first entry because candidates are sorted age wise in descending order. So, the older child will be at higher position in the data frame. Below are the codes to implement it :
#keep = 'first' will keep the first occurrence
#keep = 'last' will keep the last occurrence
#keep = False will drop all the duplicates
# make sure to use inplace=True to save the modified data frame

Python CodePython-

> candidates.drop_duplicates(subset=['House'],keep='first',inplace=True)

R CodeR_120

> candidates <- candidates[!duplicated(candidates[,1]),]

Now we can merge the data frames :

#In python
> pd.merge(candidates,house,on="House",how='left')

#In R
> merge(candidates,house,by="House",all.x=TRUE,sort=FALSE)

The output will be :

HouseNameRegionMilitary_Strength
LannisterJamie LannisterThe Westerlands60000
StarkRobb StarkThe North20000
TargaryenDaenerys TargaryenSlaver’s Bay110000
BaratheonRobert BaratheonThe Stormlands40000
MormontJorah MormontNaNNaN

As we can see “Arya Stark” from house “Stark” and “Cersi Lannister” from house “Lannister” are removed in the resultant data set thus providing the rightful heirs from each house.

2. Aggregating the duplicates : Let understand this using an example.

Suppose if there are two members from a house then each member will have the military strength shown separately. As given in house data set, “Starks” have 20,000 soldiers. Then, both “Arya Stark” and “Robb Stark” gets 20,000 soldiers individually.

But, while forming the list of the rightful heirs from each house, the oldest heir gets the first right to claim the throne / title. Hence, we don’t need just one heir for a house.

How can we remove this redundancy? By simply removing the repetition as done in previous method will end up getting us wrong data, thus we have to add up the military strength for the members of same house. For different situation this technique will differ.

 

Challenge 7: Dropping Columns

After merging the data sets, we ended up with a data set consisting a large number of columns. Some of the columns are trivial or consist of information which is already present in other variables (co-related).

Suppose we have the data set :

Data : merged_candidates
HouseNameRegion_leftRegion_rightMilitary_Strength
LannisterJamie LannisterWestrosThe Westerlands60000
StarkRobb StarkNorthThe North20000
StarkArya StarkWestrosThe North20000
LannisterCersi LannisterWestrosThe Westerlands60000
TargaryenDaenerys TargaryenMereeneSlaver’s Bay110000
BaratheonRobert BaratheonWestrosThe Stormlands40000
MormontJorah MormontMereeneNaNNaN

The data frame shown above has two similar kind of variables Region_left and Region_right. Out of these two Region_right is insignificant we want to remove this variable :

Python CodePython-

# axis=1 will remove the mentioned columns
> merged_candidates = merged_candidates.drop('Region_right',axis=1)

R CodeR_120

> merged_candidates <- subset(merged_candidates,select=-c(Region_Right))

The output will be :

HouseNameRegion_leftMilitary_Strength
LannisterJamie LannisterWestros60000
StarkRobb StarkNorth20000
StarkArya StarkWestros20000
LannisterCersi LannisterWestros60000
TargaryenDaenerys TargaryenMereene110000
BaratheonRobert BaratheonWestros40000
MormontJorah MormontMereeneNaN

 

Challenge 8: Modifying Value(s) of a DataFrame

Modifying a particular element

Our work is not finished yet. There are some irregularities that we still have to take care of. A data set might contain incorrect information. In our merged data set from last section candidate “Arya Stark” is in “North” region but this data shows “Westros” as her region.

Data : merged
HouseNameRegion_leftMilitary_Strength
LannisterJamie LannisterWestros60000
StarkRobb StarkNorth20000
StarkArya StarkWestros20000
LannisterCersi LannisterWestros60000
TargaryenDaenerys TargaryenMereene110000
BaratheonRobert BaratheonWestros40000
MormontJorah MormontMereeneNaN

We have to change the region of “Arya Stark” to “North” :

Python CodePython-

# Arya Stark's region will be replaced from "Westros" to "North"
>merged[merged['Name']=="Arya Stark"]=merged[merged['Name']=="Arya Stark"].replace("Westros","North")

R CodeR_120

> merged$Region_left[which(merged$Name=="Arya Stark")] <- "North"

The output will look like :

HouseNameRegion_leftMilitary_Strength
LannisterJamie LannisterWestros60000
StarkRobb StarkNorth20000
StarkArya StarkNorth20000
LannisterCersi LannisterWestros60000
TargaryenDaenerys TargaryenMereene110000
BaratheonRobert BaratheonWestros40000
MormontJorah MormontMereeneNaN

After this operation region of “Arya Stark” is changed to “North”.

 

Modifying elements on the basis of a condition

In our data set, region “Westros” is a more general term consisting of many kingdoms so we have to change all the entries “Westros” to “Kings Landing”. Let’s see how to do it

Python CodePython-

# All the Westros are replaced by Kings Landing
> merged.replace("Westros","Kings Landing",inplace=True)

R CodeR_120

> merged$Region_left[which(merged$Region_left=="Westros")]="Kings Landing"

The output will be:

HouseNameRegion_leftMilitary_Strength
LannisterJamie LannisterKings Landing60000
StarkRobb StarkNorth20000
StarkArya StarkNorth20000
LannisterCersi LannisterKings Landing60000
TargaryenDaenerys TargaryenMereene110000
BaratheonRobert BaratheonKings Landing40000
MormontJorah MormontMereeneNaN

 

Challenge 9: Renaming Column Name(s)

We are just finishing up with all the challenges, lastly there might be a scenario where you want to change a column name. Let’s say in our data set, we want to change the column name from “Region_left” to “Region” .

Before changing the column name our data set was :

Data : merged
HouseNameRegion_leftMilitary_Strength
LannisterJamie LannisterKings Landing60000
StarkRobb StarkNorth20000
StarkArya StarkNorth20000
LannisterCersi LannisterKings Landing60000
TargaryenDaenerys TargaryenMereene110000
BaratheonRobert BaratheonKings Landing40000
MormontJorah MormontMereeneNaN

Changing the column name using the following code

Python CodePython-

> merged.columns = merged.columns.str.replace("Region_left","Region")

R CodeR_120

> colnames(merged)[which(names(merged)=="Region_left")] <- "Region"

The output will be:

HouseNameRegionMilitary_Strength
LannisterJamie LannisterKings Landing60000
StarkRobb StarkNorth20000
StarkArya StarkNorth20000
LannisterCersi LannisterKings Landing60000
TargaryenDaenerys TargaryenMereene110000
BaratheonRobert BaratheonKings Landing40000
MormontJorah MormontMereeneNaN

 

End Notes

If you have reached this line, I would like to applaud you for the patience and persistence you’ve shown in traversing over these challenges. I’ve considered all types of situations which could arise while merging, joining and subsetting data set. Hence, working on these challenges will make your knowledge comprehensive enough to deal with any situation.

For best results, make sure you do these 9 challenges and 4 exercises given. If there is anything else, you think could be made better, feel free to drop your suggestions.

Did you like reading this article ? Do you follow a different approach / package / library to perform these talks. I’d love to interact with you in comments.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

This is article is quiet old now and you might not get a prompt response from the author. We would request you to post this comment on Analytics Vidhya Discussion portal to get your queries resolved.

15 Comments

  • Dr.D.K.Samuel says:

    Really useful, Thanks

  • Nitesh says:

    In Challenge 2 : Dropping Observations, I think instead of following statement:

    > candidates=candidates[-c(1:2)]

    It should be

    > candidates = candidates[-c(1:2),]

    Reason: Without comma, given statement in article will remove columns instead rows which is the objective.

  • Nitesh says:

    “Dropping rows based on conditions”, In given statement, comma is missing which will help in eliminating the mentioned name record.

    It should be : candidates[which(candidates$Name!=”Robb Stark”),]

    instead of : candidates[which(candidates$Name!=”Robb Stark”)]

  • JC Martel says:

    Great post. Thank you for comparing the R and Python code and describing the steps.

  • Krishna says:

    Great post, Syed! Really helpful for R & Python beginners like me!

  • D Wald says:

    I just shared with my team. I think this will help to demystify these simple operations to the uninitiated.

  • RnMe says:

    Challenge 5
    Exercise 3

    house <- read.csv("house.csv")
    candidates <- read.csv("candidates.csv")
    library(dplyr)
    house <- rename(house,House=X)
    candidates <- rename(candidates,House=X)
    out <- merge(house,candidates,by="House",all.x=TRUE)

    • NSS says:

      @RnMe

      I am afraid you are wrong. While reading house and candidates dataset, remember House won’t be read as a column. So rename function will throw an error.

      Keep trying.

      Regards,
      Neeraj

  • SSajan says:

    Here’s what I got for Challenge 5, exercise 3.
    #Open CSV Files in R
    candidates <- read.csv("candidates.csv")
    house <- read.csv("house.csv")

    #View data set
    View(candidates)
    View(house)

    #Rename Columns
    colnames(house)<-c("House", "Region", "Military_Strength")
    colnames(candidates)<-c("House","Name")

    #Remove first observation
    candidates=candidates[-c(1:1),]
    house=house[-c(1:1),]

    #View Dataset
    View(house)
    View(candidates)

    #Left Merge the data sets
    house<-merge(candidates, house, by="House", all.x = TRUE, sort=FALSE)

    #View Dataset
    View(house)

    Is this the best way, or is there a more efficient method?

    Also, thanks for the great tutorial! This is great for beginners. One bit of feedback: For the left and right merge exercises, make it explicit that the all.x refers to the left data frame and the all.y refers to the right data frame.

  • hitesh says:

    Can we have the solutions please for both Python and R?

Leave A Reply

Your email address will not be published.

Join 100000+ Data Scientists in our Community

Receive awesome tips, guides, infographics and become expert at:




 P.S. We only publish awesome content. We will never share your information with anyone.

Subscribe!