Combining datasets in SAS – simplified!
One of the most common task, every analyst performs multiple times in a project is combining data sets. There are various ways to combine datasets in SAS, which are different from the way datasets are combined in SQL. Due to this, a lot of people with experience in SQL remain confused about how they can combine datasets in SAS.
For the same reason, it is also one of the most commonly asked question in SAS related interviews. Which method to use entirely depends on the business need and business scenario. Let’s look at some test cases to make it clear:
- I have two Sales datasets, one is YTD (01-Jan-14 to 30-Nov-14) and another one for MTD (01-Dec-14 to 18-Dec-14). Now, to prepare annual report till date, we need to do append MTD sales data to YTD dataset. This is an example of Appending a dataset to another.
- A sales oriented company has four regions N, E, S and W. Each region has their own manpower data set. Now to prepare a data set at company level, we need to combine all regional data sets and create a new data set All. This is an example of Concatenating.
- In an Insurance Company, all Front line managers are assigned a Sales target (this is stored in a dataset Target). Their actual performance is stored in another data set called Sales. Now to validate performance against target, we need to perform horizontal mapping between the datasets Target and Sales based on the common field in both the tables and create a third data set called Performance. In Performance data set, there is actual sales and target sales mapped at employee level. This is an example of Merging.
In SAS, we can perform all these operation. Let’s look at these methodology one by one. In this article, we would go into detail of Appending and Concatenation only and will discuss Merging in the next article.
To append one data set at the end of another, we use the APPEND procedure. Append procedure adds the observations of one data set at the end of other data set.
PROC Append BASE= <Data-Set> DATA=<Data-Set>;
Here BASE is the data set to which observation of data set DATA is added.
- Append procedure only works for two data sets.
- Observations of BASE data set are not read in the operation.
Example – 1 Here we have three data sets YTD, Apr and May.
Question -1:- Append Apr data set to YTD:-
Above code has appended Apr data set to YTD.
Question -2:- Append May data set to YTD:-
Proc Append Base=YTD Data=May;
Above code will not append May data set to YTD and you will get an error and a warning message, like shown below:
Above error and warning comes because the structure of these two datasets is not similar. May has one additional variable Profit. Now to append May data set to YTD, we need to use FORCE option in PROC append procedure.
Proc Append Base=YTD Data=May Force;
This will append May dataset to YTD dataset..
Concatenate copies all observations from two or more data sets into a new data set. SET statement in DATA step is used to perform concatenation.
Set DataSet-1 DataSet-2….DataSet-n;
<additional SAS Statements;>
- Any number of datasets can be read
- Input datasets are read and they appear in output dataset in the same order as they appear in SET statement
- If the input dataset do not have similar table structure, then the output dataset will contain unique variables of all the input datasets with missing values for observations which are not present in any of the datasets.
- If Input datasets have common variables then all instances of common variable in all data sets must have the same type attribute else SAS stops processing. Data step will produce an error message stating that the variables are incompatible.
- If the length attribute of common variable is different, SAS takes the length from the first data set that contains the common variable.
Example -2 Here we have two datasets, one for Jan and another one for Feb. Here Feb dataset has one additional variable Profit.
Above you can see that output dataset YTD has 2 observations and 3 variables.
Example -3 We have three data sets Jan, Feb and Mar. All three data sets have different structure. Variable month has different length, Sales is available in Jan and Feb and the field Profit is present Feb and Mar only.
In output dataset, you can see the following observation:-
- Values of variable Month for Feb and Mar dataset are truncated to length of first dataset Jan.
- Jan has missing value (., Period) for Profit because variable Profit is not available with Jan data set.
Above, we have done concatenation based on position of data sets in SET statement. There is another method to perform concatenation, if all input data sets have one or more common variable and this concatenation is based on that common variable. it is called INTERLEAVING. In this case, we specify the common variable using By statement.
Set DataSet-1 DataSet-2….DataSet-n;
<additional SAS Statements;>
Before using this step, input data sets must be sorted on common variable(s).
Example -4 We have two data sets First and Second with a common variable Code. Now as we mentioned before both data sets must be sorted, we have used PROC Sort procedure to sort.
Question 5- Concatenate these two data sets and create a data set THIRD.
Above command has created a dataset THIRD, which has data stored and sorted on the variable CODE.
Let’s look at how the above code works:-
- First, it takes the value of common variable from first observation of both datasets and compares which one is lower. SAS then copies that observation to output dataset and moves the record pointer to next observation of same input dataset and keeps the dataset record pointer for second dataset at first position only. In current example, It picks value 1 from dataset First and 2 from Second (for variable Code), then compares and copies the observation of data set First to THIRD and moves the record pointer of first to next observation.
- Now, above process is repeated again by comparing second observation of First dataset to First observation of Second dataset till the EOF reached for both data sets. In current example, It has Code value 4 from dataset First and 2 for Second then compare and copy observation of dataset Second to THIRD and move record pointer of Second to next observation.
In this article, we looked at various methods to combine data sets namely Appending, Concatenating and Interleaving. Individual methods are explained in detail with examples along with the business scenarios that go with these options. We have also explained, how these methods works and what would be the output if we go with any of these methods. In one of the future posts, we will discuss another way to combine datasets – MERGING.
Hope you found this article useful. We have simplified this topic and have tried to present it in a very simple and lucid manner. If you need any more help with combining data sets, please feel free to ask your questions through comments below.