Apache Pig load and store operations
1) A machine with Ubuntu 14.04 LTS operating system
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Pig pre installed (How to install Pig on Ubuntu 14.04)
Pig Load and Store Operations
In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig and you can store the loaded data in the file system using the store operator.
Step 1 - Make a pig directory in HDFS. Make sure hadoop daemons are running.
Step 2 - Create a student_data.txt file.
Step 3 - Add the following lines to student_data.txt file. Save and close.
Step 4 - Copy student_data.txt file from local file system to HDFS. In my case, the student_data.txt file is stored in /home/hduser/Desktop/PIG/ directory.
Step 5 - Verify the copy by using cat command.
Step 6 - Change the directory to /usr/local/pig/bin
Step 7 - Enter into grunt shell in MapReduce mode.
Step 8 - Load data.
Step 9 - Store data.
Step 10 - Verify.
The load/store functions in Apache Pig are used to determine how the data goes ad comes out of Pig. These functions are used with the load and store operators. Given below is the list of load and store functions available in Pig.
1) PigStorage - To load and store structured files.
2) TextLoader - To load unstructured data into Pig.
3) BinStorage - To load and store data into Pig using machine readable format.
4) Handling Compression - In Pig Latin, we can load and store compressed data.
The PigStorage function loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated as a parameter. By default, it takes '\t' as a parameter.
The Pig Latin function TextLoader() is a Load function which is used to load unstructured data in UTF-8 format.
The BinStorage() function is used to load and store the data into Pig using machine readable format. BinStorge() in Pig is generally used to store temporary data generated between the MapReduce jobs. It supports multiple locations as input.
4) Handling Compression
Assume we have a file named employee.txt.tar.gz in the HDFS directory /user/hduser/pig/. Then, we can load the compressed file into pig as shown below.
Please share this blog post and follow me for latest updates on
Labels : Pig Installation Pig Execution Mechanism Pig GRUNT Shell Usage Pig Diagnostic Operators Pig Group Example Pig Join Example Pig Cross Example Pig Union Example Pig Split Example Pig Filter Example Pig Distinct Example Pig Foreach Example Pig OrderBy Example Limit Example Pig Eval Functions Example Pig BagToString Example Pig Concat Example Pig Tokenize Example Pig UDF's Java Example Pig SCRIPT