Sublogical: HBaseStorage and PIG

Cross posted from my company blog post

We’ve been using PIG for analytics and for processing data for use in our site for some time now. PIG is a high level language for building data analysis programs that can run across a distributed Hadoop cluster. It has allowed us to scale up our data processing while decreasing the amount of time it takes to run jobs.

When it came time to update our runtime data storage for the site, it was natural for us to consider using HBase to achieve horizontal scalability. HBase is a distributed, versioned, column-oriented store based on Hadoop. One of the great advantages of using HBase is the ability to integrate it with our existing PIG data processing. In this post I will introduce you to the basics of working with HBase from your PIG scripts.

Getting Started

Before getting into the details of using HBaseStorage there are a couple of environment variables you will need to make sure are set so that HBaseStorage can work correctly.

export HBASE_HOME=/usr/lib/hbase
export PIG_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH"

First, you will need to let HBaseStorage know where to find the HBase configuration, thus the HBASE_HOME environment variable. Second, the PIG_CLASSPATH needs to be extended to include the classpath for loading HBase. If you are using PIG 0.8.x there is a slight variation:

export HADOOP_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$HADOOP_CLASSPATH"

Hello World

Let’s write a simple script to load some data from a file and write it out to an HBase table. To begin, use the shell to create your table:

jhoover@jhoover2:~$ hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.90.3-cdh3u1, r, Mon Jul 18 08:23:50 PDT 2011

hbase(main):002:0> create 'sample_names', 'info' 
0 row(s) in 0.5580 seconds

Next, we’ll put some simple data in a file ‘input.csv’:

1, John, Smith
2, Jane, Doe
3, George, Washington
4, Ben, Franklin

Then we’ll write a simple script to extract this data and write it into fixed columns in HBase:

raw_data = LOAD 'sample_data.csv' USING PigStorage( ',' ) AS (
    listing_id: chararray,
    fname: chararray,
    lname: chararray );

STORE raw_data INTO 'hbase://sample_names' USING  
    org.apache.pig.backend.hadoop.hbase.HBaseStorage (
        'info:fname info:lname');

Then run the pig script locally:

jhoover@jhoover2:~/hbase_sample$ pig -x local hbase_sample.pig 
…
Success!

Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0001 raw_data MAP_ONLY hbase://hello_world,

Input(s):
Successfully read records from: "file:///autohome/jhoover/hbase_sample/sample_data.csv"

Output(s):
Successfully stored records in: "hbase://sample_names"

Job DAG:
job_local_0001

You can then see the results of your script in the hbase shell:

hbase(main):001:0> scan 'hello_world'
ROW   COLUMN+CELL                                                                                                  
   column=info:fname, timestamp=1356134399789, value= John                                                      
   column=info:lname, timestamp=1356134399789, value= Smith                                                     
   column=info:fname, timestamp=1356134399789, value= Jane                                                      
   column=info:lname, timestamp=1356134399789, value= Doe                                                       
   column=info:fname, timestamp=1356134399789, value= George                                                    
   column=info:lname, timestamp=1356134399789, value= Washington                                                
   column=info:fname, timestamp=1356134399789, value= Ben                                                       
   column=info:lname, timestamp=1356134399789, value= Franklin                                                  
row(s) in 0.4850 seconds

Sample Code

You can download the sample code from this blog post here

Next: Column Families

In PIG 0.9.0 we get some new functionality around being able to treat entire column families using maps. I’ll post some examples as well as some UDFs we wrote to support that next.

Sublogical

October 19, 2011

HBaseStorage and PIG

Getting Started

Hello World

Sample Code

Next: Column Families

No comments:

Followers

Labels

About Me

Links

Twitter Updates

Twitter Updates

Blog Archive