Apache Pig Join example

posted on Nov 20th, 2016

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system

2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)

3) Apache Pig pre installed (How to install Pig on Ubuntu 14.04)

Pig Join Example

The join operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped. Joins can be of the following types:

1) Inner-join

2) Self-join

3) Outer-join : left join, right join, and full join

Step 1 - Change the directory to /usr/local/pig/bin

$ cd /usr/local/pig/bin

Step 2 - Enter into grunt shell in MapReduce mode.

$ ./pig -x mapreduce

Step 3 - Create a customers.txt file.

customers.txt

Step 4 - Add these following lines to customers.txt file.

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

Step 5 - Create a orders.txt file.

orders.txt

Step 6 - Add these following lines to orders.txt file.

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

Step 7 - Copy customers.txt and orders.txt from local file system to HDFS. In my case, the customers.txt and orders.txt file are stored in /home/hduser/Desktop/PIG/ directory.

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/customers.txt /user/hduser/pig/
$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/orders.txt /user/hduser/pig/

Step 8 - Load customers data.

customers = LOAD 'hdfs://localhost:9000/user/hduser/pig/customers.txt' USING
PigStorage(',')as (id:int, name:chararray, age:int, address:chararray,
salary:int);

Step 9 - Load orders data.

orders = LOAD 'hdfs://localhost:9000/user/hduser/pig/orders.txt' USING
PigStorage(',')as (oid:int, date:chararray, customer_id:int, amount:int);

1) Inner Join

Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables.

Step 10 - Join customers and orders data by id.

customer_orders = JOIN customers BY id, orders BY customer_id;

Dump customer_orders;

2) Self - join

Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation.

customers1 = LOAD 'hdfs://localhost:9000/user/hduser/pig/customers.txt' USING
PigStorage(',')as (id:int, name:chararray, age:int, address:chararray,
salary:int);

customers2 = LOAD 'hdfs://localhost:9000/user/hduser/pig/customers.txt' USING
PigStorage(',')as (id:int, name:chararray, age:int, address:chararray,
salary:int);

customers3 = JOIN customers1 BY id, customers2 BY id;

Dump customers3;

3) Outer Join

Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways -

a) Left outer join

b) Right outer join

c) Full outer join

a) Left outer join

The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation.

outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

Dump outer_left;

b) Right outer join

The right outer join operation returns all rows from the right table, even if there are no matches in the left table.

outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Dump outer_right;

c) Full outer join

The full outer join operation returns rows when there is a match in one of the relations.

outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Dump outer_full;

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Pig Installation   Pig Execution Mechanism   Pig GRUNT Shell Usage   Pig Load and Store Operations   Pig Diagnostic Operators   Pig Group Example   Pig Cross Example   Pig Union Example   Pig Split Example   Pig Filter Example   Pig Distinct Example   Pig Foreach Example   Pig OrderBy Example   Limit Example   Pig Eval Functions Example   Pig BagToString Example   Pig Concat Example   Pig Tokenize Example   Pig UDF's Java Example   Pig SCRIPT