In previous article, we discussed the Hadoop ecosystem ( link ). We also spoke about two most heavily used Hadoop tools i.e. PIG and HIVE. Both the languages have their followers and there is no specific preference between the two, in general. However, in cases where the team using these tools is more programming oriented, PIG is sometimes chosen above HIVE as it gives them more freedom while coding. In cases where the team is not very programming savvy, HIVE probably is a better option, given its similarity to SQL queries. Queries on PIG are written on PIG latin. In this article we will introduce you to PIG Latin using a simple practical example.
Installation of PIG
PIG engine operates on client server. It is simply an interpreter which converts your simple code into complex map-reduce operations. This mapreduce is now handled on distributed network of Hadoop. Note that the entire network will not even know that the query was executed from a PIG engine. PIG only remains at the user interface and is meant to make it easier for the user to code.
Follow the following steps in your shell to install PIG :
If you are thinking of running Pig on Windows, you should just get a virtual machine running on Linux and then work on it. You can use VMWare Player or Oracle VirtualBox to start one.
After installing the PIG package, you can start with the grunt shell.
Once you see “grunt>”, you can start coding on PIG.
You are the lead of analytics in Retail store named XYZ. XYZ maintains a record of all customers who shop at this store. Your task for this exercise is to create a new column called tax on Sale which is 5% of the sale. Then filter out people for whom the tax amount is less than $35. Once this subsetting is done, choose the top 2 customers with least customer number. Following is sample table for the Retail store which is saved in form of a .csv .
Writing a query in PIG Latin
Let’s build this query step by step. Following are the steps you need to follow :
Step 1 : Load dataset in the PIG understandable format and temporary storage from where PIG query can directly reference the table
Note that the above command does not load the variable age. While working with Big Data, you need to be very specific about the variables you need to use and hence make sure you pick only those variables which are of importance to you in the code.
Step 2 : Create a new table with tax values.
Above command generates a new table called tax which has all the three columns. The table will now look something like below :
The output to this command will look choose the yellow cells in the following table :
Step 5 : Store the temporary file into a permanent csv file
On this step our task is completed and you will get the required customer numbers with all the details.Following is the complete code which can be executed at once :
In this article, we learned how to write basic codes in PIG Latin. However, we have restricted this article to simple filter and sort statements, we will also talk about more complex merge and other statements in few of the coming articles.
Did you find the article useful? Share with us any practical application of PIG you encountered in your work . Do let us know your thoughts about this article in the box below.