Apache Pig load and store operations

posted on Nov 20th, 2016

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system

2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)

3) Apache Pig pre installed (How to install Pig on Ubuntu 14.04)

Pig Load and Store Operations

In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig and you can store the loaded data in the file system using the store operator.

Step 1 - Make a pig directory in HDFS. Make sure hadoop daemons are running.

$ hdfs dfs -mkdir /user/hduser/pig

Step 2 - Create a student_data.txt file.


Step 3 - Add the following lines to student_data.txt file. Save and close.


Step 4 - Copy student_data.txt file from local file system to HDFS. In my case, the student_data.txt file is stored in /home/hduser/Desktop/PIG/ directory.

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/student_data.txt /user/hduser/pig/

Step 5 - Verify the copy by using cat command.

$ hdfs dfs -cat hdfs://localhost:9000/user/hduser/pig/student_data.txt

Step 6 - Change the directory to /usr/local/pig/bin

$ cd /usr/local/pig/bin

Step 7 - Enter into grunt shell in MapReduce mode.

$ ./pig -x mapreduce

Step 8 - Load data.

student = LOAD 'hdfs://localhost:9000/user/hduser/pig/student_data.txt' USING PigStorage(',')as ( id:int,firstname:chararray,lastname:chararray,phone:chararray, city:chararray );

Step 9 - Store data.

STORE student INTO 'hdfs://localhost:9000/pig_Output1/' USING PigStorage (',');

Step 10 - Verify.

$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output1/part-m-00000'

The load/store functions in Apache Pig are used to determine how the data goes ad comes out of Pig. These functions are used with the load and store operators. Given below is the list of load and store functions available in Pig.

Function              Description

1) PigStorage - To load and store structured files.

2) TextLoader - To load unstructured data into Pig.

3) BinStorage - To load and store data into Pig using machine readable format.

4) Handling Compression - In Pig Latin, we can load and store compressed data.


1) PigStorage

The PigStorage function loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated as a parameter. By default, it takes '\t' as a parameter.

student = LOAD 'hdfs://localhost:9000/user/hduser/pig/student_details.txt' USING PigStorage(',')as ( id:int, firstname:chararray, lastname:chararray,phone:chararray, city:chararray );

STORE student INTO 'hdfs://localhost:9000/Pig-output/' USING PigStorage(',');

$ hdfs dfs -cat 'hdfs://localhost:9000/Pig-output/part-m-00000'

2) TextLoader

The Pig Latin function TextLoader() is a Load function which is used to load unstructured data in UTF-8 format.

details = LOAD 'hdfs://localhost:9000/user/hduser/pig/student_details.txt' USING TextLoader();

dump details;

3) BinStorage

The BinStorage() function is used to load and store the data into Pig using machine readable format. BinStorge() in Pig is generally used to store temporary data generated between the MapReduce jobs. It supports multiple locations as input.

student_details = LOAD 'hdfs://localhost:9000/user/hduser/pig/student_details.txt' USING PigStorage(',')as (id:int, firstname:chararray, age:int, city:chararray);

STORE student_details INTO 'hdfs://localhost:9000/pig_Output/mydata' USING BinStorage();

result = LOAD 'hdfs://localhost:9000/pig_Output/mydata/part-m-00000' USING BinStorage();

Dump result;

4) Handling Compression

Assume we have a file named employee.txt.tar.gz in the HDFS directory /user/hduser/pig/. Then, we can load the compressed file into pig as shown below.




$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/employee.txt.tar.gz /user/hduser/pig/

data = LOAD 'hdfs://localhost:9000/user/hduser/pig/employee.txt.tar.gz' USING

result = store data INTO 'hdfs://localhost:9000/pig_Output/employee.txt.tar.gz' USING PigStorage(',');

Dump data;

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Pig Installation   Pig Execution Mechanism   Pig GRUNT Shell Usage   Pig Diagnostic Operators   Pig Group Example   Pig Join Example   Pig Cross Example   Pig Union Example   Pig Split Example   Pig Filter Example   Pig Distinct Example   Pig Foreach Example   Pig OrderBy Example   Limit Example   Pig Eval Functions Example   Pig BagToString Example   Pig Concat Example   Pig Tokenize Example   Pig UDF's Java Example   Pig SCRIPT