Apache Pig eval function examples

posted on Nov 20th, 2016

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system

2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)

3) Apache Pig pre installed (How to install Pig on Ubuntu 14.04)

Pig Eval Functions Examples

Apache Pig provides various built-in functions namely eval, load/store, math, string, bag and tuple functions. Given below is the list of eval functions provided by Apache Pig.

Function                    Description

1) AVG - To compute the average of the numerical values within a bag.

2) MAX - To calculate the highest value for a column (numeric values or chararrays) in a single-column bag.

3) MIN - To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag.

4) COUNT - To get the number of elements in a bag, while counting the number of tuples in a bag.

5) DIFF - The DIFF() function of Pig Latin is used to compare two bags (fields) in a tuple. It takes two fields of a tuple as input and matches them. If they match, it returns an empty bag. If they do not match, it finds the elements that exist in one filed (bag) and not found in the other, and returns these elements by wrapping them within a bag.

6) SUBTRACT - The subtract() function of Pig Latin is used to subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples of the first bag that are not in the second bag.

7) IsEmpty - The isEmpty() function of Pig Latin is used to check if a bag or map is empty.

8) Pluck Tuple - After performing operations like join to differentiate the columns of the two schemas, we use the function PluckTuple(). To use this function, first of all, we have to define a string Prefix and we have to filter for the columns in a relation that begin with that prefix.

Step 1 - Change the directory to /usr/local/pig/bin

$ cd /usr/local/pig/bin

Step 2 - Enter into grunt shell in MapReduce mode.

$ ./pig -x mapreduce

Step 3 - Create a student_gpa.txt file.

student_gpa.txt

Step 4 - Add these following lines to student_gpa.txt file. Save and close.

001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72

Step 5 - Copy student_gpa.txt from local file system to HDFS. In my case, the student_gpa.txt file are stored in /home/hduser/Desktop/PIG/ directory.

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/student_gpa.txt /user/hduser/pig/

Step 6 - Load student data.

student_details = LOAD 'hdfs://localhost:9000/user/hduser/pig/student_gpa.txt'
USING PigStorage(',')as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray, gpa:int);

AVG

student_group_all = Group student_details All;

Dump student_group_all;

student_gpa_avg = foreach student_group_all Generate (student_details.firstname, student_details.gpa), AVG(student_details.gpa);

Dump student_gpa_avg;

MAX

student_gpa_max = foreach student_group_all Generate (student_details.firstname, student_details.gpa), MAX(student_details.gpa);

Dump student_gpa_max;

MIN

student_gpa_min = foreach student_group_all Generate (student_details.firstname, student_details.gpa), MIN(student_details.gpa);

Dump student_gpa_min;

COUNT

student_count = foreach student_group_all Generate COUNT(student_details.gpa);

Dump student_count;

Step 7 - Create a emp_sales.txt file.

emp_sales.txt

Step 8 - Add these following lines to emp_sales.txt file. Save and close.

1,Robin,22,25000,sales
2,BOB,23,30000,sales
3,Maya,23,25000,sales
4,Sara,25,40000,sales
5,David,23,45000,sales
6,Maggy,22,35000,sales

Step 9 - Create a emp_bonus.txt file.

emp_bonus.txt

Step 10 - Add these following lines to emp_bonus.txt file. Save and close.

1,Robin,22,25000,sales
2,Jaya,23,20000,admin
3,Maya,23,25000,sales
4,Alia,25,50000,admin
5,David,23,45000,sales
6,Omar,30,30000,admin

Step 11 - Copy emp_bonus.txt and emp_sales.txt from local file system to HDFS. In my case, the emp_bonus.txt and emp_sales.txt file are stored in /home/hduser/Desktop/PIG/ directory.

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/emp_sales.txt /user/hduser/pig/

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/emp_bonus.txt /user/hduser/pig/

Step 12 - Load employee sales data.

emp_sales = LOAD 'hdfs://localhost:9000/user/hduser/pig/emp_sales.txt' USING PigStorage(',')as (sno:int, name:chararray, age:int, salary:int, dept:chararray);

Step 13 - Load employee bonus data.

emp_bonus = LOAD 'hdfs://localhost:9000/user/hduser/pig/emp_bonus.txt' USING PigStorage(',')as (sno:int, name:chararray, age:int, salary:int,dept:chararray);

DIFF

cogroup_data = COGROUP emp_sales by sno, emp_bonus by sno;

Dump cogroup_data;

diff_data = FOREACH cogroup_data GENERATE DIFF(emp_sales,emp_bonus);

Dump diff_data;

SUBSTRACT

cogroup_data = COGROUP emp_sales by sno, emp_bonus by sno;

Dump cogroup_data;

sub_data = FOREACH cogroup_data GENERATE SUBTRACT(emp_sales, emp_bonus);

Dump sub_data;

sub_data = FOREACH cogroup_data GENERATE SUBTRACT(emp_bonus, emp_sales);

Dump sub_data;

IsEmpty

cogroup_data = COGROUP emp_sales by sno, emp_bonus by age;

Dump cogroup_data;

isempty_data = filter cogroup_data by IsEmpty(emp_sales);

Dump isempty_data;

join_data = join emp_sales by sno, emp_bonus by sno;

Pluck Tuple

DEFINE pluck PluckTuple('a::');

data = foreach join_data generate FLATTEN(pluck(*));

Describe join_data;

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Pig Installation   Pig Execution Mechanism   Pig GRUNT Shell Usage   Pig Load and Store Operations   Pig Diagnostic Operators   Pig Group Example   Pig Join Example   Pig Cross Example   Pig Union Example   Pig Split Example   Pig Filter Example   Pig Distinct Example   Pig Foreach Example   Pig OrderBy Example   Limit Example   Pig BagToString Example   Pig Concat Example   Pig Tokenize Example   Pig UDF's Java Example   Pig SCRIPT