One of the most common operation for any analyst is merging datasets. As per my estimate, an analyst spends at least 10 – 20% of his productive time joining and merging datasets. If you spend so much time doing joins / merges, it is extremely critical that you join datasets in most efficient manner. This is the thought behind this post.
Traditionally, databases have been designed in a manner where tables capture details of individual functional area.
Example below shows two tables, one capturing patient details in a clinic (from one time registrations) and second table showing their appointment details.
In order to analyze things like:
- Which customer has walked in how many times in last month?
- Which kind of customers have walked in more last month?
- What are the common reasons for people walking in?
we need to join the two tables.
I’ll cover various ways in which you can do this in SAS:
This is the most common approach used in SAS. In order to use data step command, we need to sort the datasets first and then merge using the common key:The control statement defines the kind of merge. By specifying “if a and b”, values present in both the tables will be picked.
If you are used to writing SQL, PROC SQL might be the easiest way to learn joins in SAS
This is one of the latest ways I have learnt, but the most efficient one. Using this method, we convert the smaller file into a format.
The first step creates a dataset format1 from patient_details. PROC FORMAT then converts it into a format. Finally we use
This way to join datasets typically takes 30 – 40% lower computation time compared to the two approaches mentioned above.
Since this might look advanced SAS, I will devote one more post explaining formats in more details.
In the meanwhile, if you know of any other way to join tables, please let me know.