Sapan Diwakar

Software developer

Follow me on Twitter Check out my code on GitHub View some of my designs on Dribbble Take a look at my Linked In profile

An introduction to Pig Latin

Pig, originally developed at Yahoo Research around 2006 is a way of creating and executing map-reduce jobs on very large data sets. The language for this platform is called PigLatin. PigLatin follows a hybrid approach using high level declarative syntax of SQL and low level procedural programming as in map- reduce. Writing a program in PigLatin is like writing a query plan. It follows a parallelized approach and provides only the functions that can be easily parallelized (e.g. equi-joins, grouping etc.) and leaves the rest of non-parallelizable functions to be written as user defined functions which can be used with any of the PigLatin operations which are then later compiled into Hadoop’s map-reduce jobs.

In addition to this, it also introduces a novel debugging environment and generates sandbox data in order to best resemble the real data to show the results of a PigLatin program on a small dataset which might potentially take much longer to run on the original big data. The most important data types that PigLatin introduced are Atom, Tuple, Bag and Map which can be used with functions like Load, Filter, CoGroup, Foreach, Group, Join and Store etc.

It also makes it very easy to define map- reduce jobs as opposed to Hadoop where defining such jobs was very difficult and required writing code which was mostly non-reusable. It is implemented using Hadoop while preserving the architectural requirements of plugging in any other map-reduce implementation at some later point in time.

Map-reduce model has its own set of limitations. Its one-input, two-stage data flow is extremely rigid. To perform tasks having a different data flow, e.g., joins or n stages, inelegant workarounds have to be devised.

For example, if we have a table containing the urls, category and pagerank of the pages indexed by Google, a SQL query that finds, for each sufficiently large category, the average pagerank of high-pagerank urls in that category would be

SELECT category, AVG(pagerank)  
FROM urls WHERE pagerank > 0.2  
GROUP BY category HAVING COUNT(*) > 106  

An equivalent Pig Latin program is the following

good_urls = FILTER urls BY pagerank > 0.2;  
groups = GROUP good_urls BY category;  
big_groups = FILTER groups BY COUNT(good_urls)>10e6;  
output = FOREACH big_groups GENERATE  
            category, AVG(good_urls.pagerank);

This is just like specifying a query plan saying that we first filter the urls by pagerank and choose only those which have a rank greater than 0.2. We then group those urls by category and then choose big groups which have more than 10^6 good urls. The final step is the one that produces the output. Note that although Pig Latin programs supply an explicit sequence of operations, it is not necessary that the oper- ations be executed in that order.

Pig is designed to support ad-hoc data analysis. It doesn't need database tables to run the queries. It can be run directly on files. The only thing that Pig Latin needs is a function that can parse the input data and supply it in the desired format which means that we don't need to do an import of the file into any database and spend time building indexes etc. Similarly, the output of a Pig program can be formatted in the manner of the user’s choosing, according to a user-provided function that converts tuples into a byte sequence.

The reasons that conventional database systems do require importing data into system-managed tables are to guarantee that the transactions executed on them are consistent, to enable efficient lookup using index, and to record and curate the data so that user can make sense out of it. Since, Pig is used for read only analysis, it doesn't require a complex database to guarantee consistency or index to read efficiently.

Another difference of Pig form traditional databases is with the data model which is fully nested for Pig as compared to the flat data model for traditional databases. Lets begin with the basis of Pig latin:

We can specify the data model using LOAD and read a file query_log.txt using a user defined function myLoad() which outputs three fields, userId, queryString and timestamp.

queries = LOAD ‘query_log.txt’ USING myLoad()  
              AS (userId, queryString, timestamp);

After loading the data, we can process the data on a per tuple basis using FOREACH. Let us process each tuple of the bag queries and produce an output tuple using a user defined function expandQuery().

expanded_queries = FOREACH queries GENERATE  
userId, expandQuery(queryString);  

We have already seen the use of FILTER to filter out unwanted data and GROUP to group the data by a field. GROUP is actually a special case of the more general COGROUP operation provided by Pig Latin that groups together the tuples from one or more data sets that are related in some way.

grouped_data = COGROUP students BY course, professors BY course;  

Although COGROUP gives a lot of flexibility, there might be cases where we want just an equi-join between two data sets which can be achieved using JOIN.

joined_data = JOIN students BY course, professors BY course;  

The basic difference between JOIN and COGROUP is that with COGROUP the tuples of one data set will actually be mapped to a set/bag of tuples from the second data set whereas the join result would contain pairs of tuples from both the data sets that match on the join attribute.