Apache Pig Scripts and examples:
I am listing few pig scripting examples with the results to help get a better understanding around apache pig coding.
1. Total record count with Apache Pig script:
In the below post we are reading a normal text file and producing the output as total record count on the console as well as storing the output on disk.
@ Source data file details:
Name: filesgrads-2.txt
Delimeter: tab
File Header: CDS_CODE ETHNIC GENDER GRADS UC_GRADS YEAR
Total records: 20,747
File data sample:
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpg0y1Y_y5I646klkKdxwj1HL8QF_8XNKPKaIL9BT3Hkcu7pe9WQBeHbMV3nU3d46ePDgtNcI6-9mCHV9brreD6-e_QDjzE_vBxW4mAmq7pUd-QYOkNwqyupAxby-5v4o67sqKq9KivkwK/s1600/SS_filesgrads-2.txt.png)
...
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQcRXKYS4zJWixju-ECa67pqP0qxQoNa5mqedRvmVc8xO9zocvdL5bGkT_bITLblg-DYa3S2gKLfZZBotOC8BnG0V-W4cqrJDY36OkjUcKJ_A23O2eqPMPOm1cPTFQRg9LDUWv4uCw_6_G/s1600/SS_filesgrads-2.txt2.png)
@ Running Pig grunt shell in local mode:
Execute below on Unix prompt:
$ pig -x local
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJ_PmmjDT3sbWUU3vufJzps_6O1tU3APJvvzkjhBxKDqgzNveaIKPeEaZFKUpdIE9JkM2pnO3ByhNpgW9KFXJmjKauhXcj6AyfQM1OOFSY3D8Y4FnR4I2wt0fnowB5o_zGQ6rjDyErrajZ/s640/ss_pig_grunt_shell.png)
@ Output criteria:
Finding the total record count in the file.
@ Pig script:
Please execute the following script at the grunt shell prompt and review the final output.
/* load data file in pig memory/reference variable */
data = load '/Users/Neo/Downloads/filesgrads-2.txt' using PigStorage('\t');
/* group all data in the file and store in pig reference variable */
grp_stud_all = GROUP data ALL;
/* apply count function on file data and store in pig reference variable */
total_stud_count = FOREACH grp_stud_all GENERATE COUNT(data);
/*store or display final out put on console */
dump total_stud_count;
@ Final output review:
Please note that, with dump statement the output is displayed on the console of the grunt shell.
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvWX3ZJFF_R2Tfzft1V8e9L1y8DhC9z3wOpnLeUM9T3ApT5bAylsD7E4gzdDnRs47wFc_DfoGe9YK5ELExc1ovxqfVgbjrmAZCfqnVjW0LF8qrtWNiYNTm-SS500WOuQ8gYOY71_Klp8zX/s640/ss_pig_op1.png)
@ Storing the results:
To have the output in the file on the disk, we use store command with appropriate folder location. Please issue following on grunt prompt.
grunt> store total_stud_count into '/Users/xxx/tot_stud_recs.txt'
You can review below to if the command was successful or not.
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEMUQ05R8JkyWwqR8vdU8RDCUrS4s8YrtqwYh0ykifu_UuL_DtlYSZdT2bDClzTAHGzWLLnj3O0mKwgRD7CtIZgAE6vJQyqB8kiSNQotbEHjJjrZhTAUzeFJMLbpN3wlyS_aHU3N9_EnF2/s640/ss_ex1_pig_op1a.png)
@ Review results from the stored file:
Please note, the above command creates a folder named tot_stud_recs.txt and the actual output is stored in a file called part-r-00000 and not in a file tot_stud_recs.txt
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVZWNg-kWcus96wv40WwMDxBp8h5S8a7pP6K1GUC29UMwRHEdxHsnZeCRnwTesXTQ2DKfefXP9iigPZ1zZXLcO69Mwfc5DhsG7u99T8oRuFwhlwynXJCIAbZ7okyKuzIAnfc1KqrwMBpIh/s1600/ss_ex1_pig_op2.png)
Thanks!
@ Reference/s:
@ Source data file details:
Name: filesgrads-2.txt
Delimeter: tab
File Header: CDS_CODE ETHNIC GENDER GRADS UC_GRADS YEAR
Total records: 20,747
File data sample:
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpg0y1Y_y5I646klkKdxwj1HL8QF_8XNKPKaIL9BT3Hkcu7pe9WQBeHbMV3nU3d46ePDgtNcI6-9mCHV9brreD6-e_QDjzE_vBxW4mAmq7pUd-QYOkNwqyupAxby-5v4o67sqKq9KivkwK/s1600/SS_filesgrads-2.txt.png)
...
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQcRXKYS4zJWixju-ECa67pqP0qxQoNa5mqedRvmVc8xO9zocvdL5bGkT_bITLblg-DYa3S2gKLfZZBotOC8BnG0V-W4cqrJDY36OkjUcKJ_A23O2eqPMPOm1cPTFQRg9LDUWv4uCw_6_G/s1600/SS_filesgrads-2.txt2.png)
@ Running Pig grunt shell in local mode:
Execute below on Unix prompt:
$ pig -x local
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJ_PmmjDT3sbWUU3vufJzps_6O1tU3APJvvzkjhBxKDqgzNveaIKPeEaZFKUpdIE9JkM2pnO3ByhNpgW9KFXJmjKauhXcj6AyfQM1OOFSY3D8Y4FnR4I2wt0fnowB5o_zGQ6rjDyErrajZ/s640/ss_pig_grunt_shell.png)
@ Output criteria:
Finding the total record count in the file.
@ Pig script:
Please execute the following script at the grunt shell prompt and review the final output.
/* load data file in pig memory/reference variable */
data = load '/Users/Neo/Downloads/filesgrads-2.txt' using PigStorage('\t');
/* group all data in the file and store in pig reference variable */
grp_stud_all = GROUP data ALL;
/* apply count function on file data and store in pig reference variable */
total_stud_count = FOREACH grp_stud_all GENERATE COUNT(data);
/*store or display final out put on console */
dump total_stud_count;
@ Final output review:
Please note that, with dump statement the output is displayed on the console of the grunt shell.
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvWX3ZJFF_R2Tfzft1V8e9L1y8DhC9z3wOpnLeUM9T3ApT5bAylsD7E4gzdDnRs47wFc_DfoGe9YK5ELExc1ovxqfVgbjrmAZCfqnVjW0LF8qrtWNiYNTm-SS500WOuQ8gYOY71_Klp8zX/s640/ss_pig_op1.png)
@ Storing the results:
To have the output in the file on the disk, we use store command with appropriate folder location. Please issue following on grunt prompt.
grunt> store total_stud_count into '/Users/xxx/tot_stud_recs.txt'
You can review below to if the command was successful or not.
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEMUQ05R8JkyWwqR8vdU8RDCUrS4s8YrtqwYh0ykifu_UuL_DtlYSZdT2bDClzTAHGzWLLnj3O0mKwgRD7CtIZgAE6vJQyqB8kiSNQotbEHjJjrZhTAUzeFJMLbpN3wlyS_aHU3N9_EnF2/s640/ss_ex1_pig_op1a.png)
@ Review results from the stored file:
Please note, the above command creates a folder named tot_stud_recs.txt and the actual output is stored in a file called part-r-00000 and not in a file tot_stud_recs.txt
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVZWNg-kWcus96wv40WwMDxBp8h5S8a7pP6K1GUC29UMwRHEdxHsnZeCRnwTesXTQ2DKfefXP9iigPZ1zZXLcO69Mwfc5DhsG7u99T8oRuFwhlwynXJCIAbZ7okyKuzIAnfc1KqrwMBpIh/s1600/ss_ex1_pig_op2.png)
Thanks!
@ Reference/s:
- https://pig.apache.org
- http://pig.apache.org/docs/r0.14.0/
0 comments:
Post a Comment