Apache Pig group example

posted on Nov 20th, 2016

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system

2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)

3) Apache Pig pre installed (How to install Pig on Ubuntu 14.04)

Pig Group Example

The group operator is used to group the data in one or more relations. It collects the data having the same key.

Step 1 - Change the directory to /usr/local/pig/bin

$ cd /usr/local/pig/bin

Step 2 - Enter into grunt shell in MapReduce mode.

$ ./pig -x mapreduce

Step 3 - Create a student_details.txt file.

student_details.txt

Step 4 - Add these following lines to student_details.txt file.

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

Step 5 - Copy student_details.txt from local file system to HDFS. In my case, the employee.txt file is stored in /home/hduser/Desktop/PIG directory.

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/student_details.txt /user/hduser/pig/

Step 6 - Load Data.

student_details = LOAD 'hdfs://localhost:9000/user/hduser/pig/student_details.txt' USING PigStorage(',')as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);

Step 7 - Group data by age as a key.

group_data = GROUP student_details by age;

Dump group_data;

Describe group_data;

Illustrate group_data;

Step 8 - Group data by multiple keys.

group_multiple = GROUP student_details by (age, city);

Dump group_multiple;

Step 9 - Group by all.

group_all = GROUP student_details All;

Dump group_all;

Grouping Two Relations using Cogroup

The cogroup operator works more or less in the same way as the group operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.

Step 10 - Create a employee_details.txt file.

employee_details.txt

Step 11 - Add these following lines to employee_details.txt file.

001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai

Step 12 - Create a student_details.txt file.

student_details.txt

Step 13 - Add these following lines to student_details.txt file.

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

Step 14 - Copy student_details.txt and employee_details.txt from local file system to HDFS. In my case, the employee.txt and employee_details.txt file is stored in /home/hduser/Desktop/PIG directory.

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/student_details.txt /user/hduser/pig/
$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/employee_details.txt /user/hduser/pig/

Step 15 - Load student_details and employee_details data.

student_details = LOAD 'hdfs://localhost:9000/user/hduser/pig/student_details.txt' USING PigStorage(',')as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);

employee_details = LOAD 'hdfs://localhost:9000/user/hduser/pig/employee_details.txt' USING PigStorage(',')as (id:int, name:chararray, age:int,city:chararray);

Step 16 - Cogroup by student age and employee age as keys

cogroup_data = COGROUP student_details by age, employee_details by age;

Dump cogroup_data;

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Pig Installation   Pig Execution Mechanism   Pig GRUNT Shell Usage   Pig Load and Store Operations   Pig Diagnostic Operators   Pig Join Example   Pig Cross Example   Pig Union Example   Pig Split Example   Pig Filter Example   Pig Distinct Example   Pig Foreach Example   Pig OrderBy Example   Limit Example   Pig Eval Functions Example   Pig BagToString Example   Pig Concat Example   Pig Tokenize Example   Pig UDF's Java Example   Pig SCRIPT