Apache Pig user defined functions (UDFs) java example

posted on Nov 20th, 2016

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system

2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)

3) Apache Pig pre installed (How to install Pig on Ubuntu 14.04)

Pig User Defined functions (UDF's) Java Example

Apache Pig provides extensive support for User Defined Functions (UDF's). Using these UDF's, we can define our own functions and use them. The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.

Using Java, you can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation. Since Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.

While writing UDF;s using Java, we can create and use the following three types of functions

1) Filter Functions

The filter functions are used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value.

2) Eval Functions

The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result.

3) Algebraic Functions

The Algebraic functions act on inner bags in a FOREACHGENERATE statement. These functions are used to perform full MapReduce operations on an inner bag.

Add these jars to your Java project

/usr/local/pig/pig-0.15.0-core-h1.jar
/usr/local/pig/pig-0.15.0-core-h2.jar

Sample_Eval.java

package pig;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class Sample_Eval extends EvalFunc<String> {
	public String exec(Tuple input) throws IOException {
		if (input == null || input.size() == 0)
			return null;
		String str = (String) input.get(0);
		return str.toUpperCase();
	}
}

Step 1 - Change the directory to /usr/local/pig/bin

$ cd /usr/local/pig/bin

Step 2 - Enter into grunt shell in MapReduce mode.

$ ./pig -x mapreduce

Step 3 - Create jar file of your java project. Creating jar file is left to you.

Step 4 - Copy the jar file into HDFS.

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/PIG/sample_udf.jar /user/hduser/pig/

Step 5 - Register. The Register operator is used to registers a JAR file which contains the UDF. By registering the Jar file, users can intimate the location of the UDF to Pig.

REGISTER 'hdfs://localhost:9000/user/hduser/pig/sample_udf.jar'

Step 6 - Create a employee_new.txt file.

employee_new.txt

Step 7 - Add these following lines to employee_new.txt file. Save and close. Store into HDFS.

1,John,2007-01-24,250
2,Ram,2007-05-27,220
3,Jack,2007-05-06,170
3,Jack,2007-04-06,100
4,Jill,2007-04-06,220
5,Zara,2007-06-06,300
5,Zara,2007-02-06,35

Step 8 - Load employee data.

employee_data = LOAD 'hdfs://localhost:9000/user/hduser/pig/employee_new.txt' USING PigStorage(',') as (id:int, name:chararray,workdate:chararray,daily_typing_pages:int);

Step 9

Let us now convert the names of the employees in to upper case using the UDF sample_eval.

Upper_case = FOREACH employee_data GENERATE pig.Sample_Eval(name);

Dump Upper_case;

Define

The Define operator is used to assign an alias to a UDF or streaming command.

DEFINE sample_eval pig.Sample_Eval;

employee_data = LOAD 'hdfs://localhost:9000/user/hduser/pig/employee_new.txt' USING PigStorage(',') as (id:int, name:chararray,workdate:chararray,daily_typing_pages:int);

Upper_case = FOREACH employee_data GENERATE sample_eval(name);

Dump Upper_case;

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Pig Installation   Pig Execution Mechanism   Pig GRUNT Shell Usage   Pig Load and Store Operations   Pig Diagnostic Operators   Pig Group Example   Pig Join Example   Pig Cross Example   Pig Union Example   Pig Split Example   Pig Filter Example   Pig Distinct Example   Pig Foreach Example   Pig OrderBy Example   Limit Example   Pig Eval Functions Example   Pig BagToString Example   Pig Concat Example   Pig Tokenize Example   Pig SCRIPT