How to use Apache PIG ?
Pig Latin
Pig Commands
LOAD ---> Load command is used to specify the files to be loaded to pig
PigStorage ---> is used to denote the line delimiter inside the file loaded into pig
dump cust; ---> Load the data in to the temporary variable cust
filter ---> Filter is like where clause of sql where you can extract data based on specific condition
Foreach ... generate .....; ---> Filter columns
Stream ... Through `cut -f 1,2,4`; ---> Execute unix commands on pig grunt
Example 1 :
Sample input
custs
4000017,Neal,Lawrence,72,Computer support specialist
4000018,Jean,Griffin,45,Childcare worker
4000019,Kristine,Dougherty,63,Financial analyst
4000020,Crystal,Powers,67,Engineering technician
4000021,Alex,May,39,Environmental scientist
4000022,Eric,Steele,66,Doctor
cust = load '/pig/custs' using PigStorage(',') AS ( custid:chararray,firstname:chararray,lastname:chararray, age:long, profession:chararray); filter_1 = filter cust by (age >= 40);
Sample output
(4009986,Jesse,Smith,57,Designer)
(4009991,Paul,Mullins,47,Reporter)
(4009993,Becky,Wolfe,67,Musician)
(4009994,Clyde,Welch,40,Photographer)
(4009996,Tonya,McIntosh,56,Engineering technician)
(4009998,Tracey,Bullock,60,Compute
NameAge= foreach cust generate firstname,age;
Sample output
(Paul,47)
(Erin,33)
(Becky,67)
(Clyde,40)
(Rebecca,37)
(Tonya,56)
(Ron,36)
(Tracey,60)
(Ray,64)
Example 2 :
Analyse the given datasets and print the student names who have successfully cleared the exam
Sample input
-> results
1 fail
2 fail
3 pass
4 pass
5 fail
6 pass
7 fail
8 pass
-> student
vineet 1
hisham 2
raj 3
ajeet 4
sujit 5
ramesh 6
priya 7
PIG SCRIPT FILE ====> Student_results.pig S = load '/pig/student' as (Name:chararray,Rollnumber:int); R = load '/pig/results' as (Rollnumber:int,Result:chararray); R1 = Join S by Rollnumber , R by Rollnumber; R2 = Foreach R1 Generate Name,Result; R3 = Filter R2 by (Result!='fail'); Store R3 into '/pig_outputs/Final_students_results' using PigStorage ('-'); Dump R3; ======> Run the above pig script from unix prompt as below pig Student_results.pig ---------------------------or------------------------ PIG SCRIPT FILE ====> Student_results_1.pig S = load '/pig/student' as (Name:chararray,Rollnumber:int); R = load '/pig/results' as (Rollnumber:int,Result:chararray); R1 = Join S by Rollnumber , R by Rollnumber; R2 = Filter R1 by (Result!='fail'); R3 = Stream R2 Through `cut -f 1,2,4`; Store R3 into '/pig_outputs/Final_students_results_2' using PigStorage ('-'); Dump R3; Run the above pig script from unix prompt as below pig Student_results_1.pig
Sample output
raj-3-pass
ajeet-4-pass
ramesh-6-pass
priyanka-8-pass
suresh-9-pass
ritesh-10-pass
Official Reference :: Using CSVLoader to dump .csc data
Download the piggybank.jar from the below link.
register /home/hadoop/pig-0.10.1/contrib/piggybank/java/piggybank.jar; A = LOAD '/home/edureka/Desktop/f1.csv' USING org.apache.pig.piggybank.storage.CSVLoader(); Dump A;