Apache Pig Scripts and examples:
I am listing few pig scripting examples with the results to help get a better understanding around apache pig coding.
1. Total record count with Apache Pig script:
In the below post we are reading a normal text file and producing the output as total record count on the console as well as storing the output on disk.
@ Source data file details:
Name: filesgrads-2.txt
Delimeter: tab
File Header: CDS_CODE ETHNIC GENDER GRADS UC_GRADS YEAR
Total records: 20,747
File data sample:

...

@ Running Pig grunt shell in local mode:
Execute below on Unix prompt:
$ pig -x local

@ Output criteria:
Finding the total record count in the file.
@ Pig script:
Please execute the following script at the grunt shell prompt and review the final output.
/* load data file in pig memory/reference variable */
data = load '/Users/Neo/Downloads/filesgrads-2.txt' using PigStorage('\t');
/* group all data in the file and store in pig reference variable */
grp_stud_all = GROUP data ALL;
/* apply count function on file data and store in pig reference variable */
total_stud_count = FOREACH grp_stud_all GENERATE COUNT(data);
/*store or display final out put on console */
dump total_stud_count;
@ Final output review:
Please note that, with dump statement the output is displayed on the console of the grunt shell.

@ Storing the results:
To have the output in the file on the disk, we use store command with appropriate folder location. Please issue following on grunt prompt.
grunt> store total_stud_count into '/Users/xxx/tot_stud_recs.txt'
You can review below to if the command was successful or not.

@ Review results from the stored file:
Please note, the above command creates a folder named tot_stud_recs.txt and the actual output is stored in a file called part-r-00000 and not in a file tot_stud_recs.txt

Thanks!
@ Reference/s:
@ Source data file details:
Name: filesgrads-2.txt
Delimeter: tab
File Header: CDS_CODE ETHNIC GENDER GRADS UC_GRADS YEAR
Total records: 20,747
File data sample:

...

@ Running Pig grunt shell in local mode:
Execute below on Unix prompt:
$ pig -x local

@ Output criteria:
Finding the total record count in the file.
@ Pig script:
Please execute the following script at the grunt shell prompt and review the final output.
/* load data file in pig memory/reference variable */
data = load '/Users/Neo/Downloads/filesgrads-2.txt' using PigStorage('\t');
/* group all data in the file and store in pig reference variable */
grp_stud_all = GROUP data ALL;
/* apply count function on file data and store in pig reference variable */
total_stud_count = FOREACH grp_stud_all GENERATE COUNT(data);
/*store or display final out put on console */
dump total_stud_count;
@ Final output review:
Please note that, with dump statement the output is displayed on the console of the grunt shell.

@ Storing the results:
To have the output in the file on the disk, we use store command with appropriate folder location. Please issue following on grunt prompt.
grunt> store total_stud_count into '/Users/xxx/tot_stud_recs.txt'
You can review below to if the command was successful or not.

@ Review results from the stored file:
Please note, the above command creates a folder named tot_stud_recs.txt and the actual output is stored in a file called part-r-00000 and not in a file tot_stud_recs.txt

Thanks!
@ Reference/s:
- https://pig.apache.org
- http://pig.apache.org/docs/r0.14.0/
0 comments:
Post a Comment