October 19, 2011

HBaseStorage and PIG

Cross posted from my company blog post


We’ve been using PIG for analytics and for processing data for use in our site for some time now. PIG is a high level language for building data analysis programs that can run across a distributed Hadoop cluster. It has allowed us to scale up our data processing while decreasing the amount of time it takes to run jobs.

When it came time to update our runtime data storage for the site, it was natural for us to consider using HBase to achieve horizontal scalability. HBase is a distributed, versioned, column-oriented store based on Hadoop. One of the great advantages of using HBase is the ability to integrate it with our existing PIG data processing. In this post I will introduce you to the basics of working with HBase from your PIG scripts.

Getting Started


Before getting into the details of using HBaseStorage there are a couple of environment variables you will need to make sure are set so that HBaseStorage can work correctly.

export HBASE_HOME=/usr/lib/hbase
export PIG_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH"

First, you will need to let HBaseStorage know where to find the HBase configuration, thus the HBASE_HOME environment variable. Second, the PIG_CLASSPATH needs to be extended to include the classpath for loading HBase. If you are using PIG 0.8.x there is a slight variation:

export HADOOP_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$HADOOP_CLASSPATH"


Hello World


Let’s write a simple script to load some data from a file and write it out to an HBase table. To begin, use the shell to create your table:

jhoover@jhoover2:~$ hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.90.3-cdh3u1, r, Mon Jul 18 08:23:50 PDT 2011

hbase(main):002:0> create 'sample_names', 'info'
0 row(s) in 0.5580 seconds

Next, we’ll put some simple data in a file ‘input.csv’:

1, John, Smith
2, Jane, Doe
3, George, Washington
4, Ben, Franklin

Then we’ll write a simple script to extract this data and write it into fixed columns in HBase:

raw_data = LOAD 'sample_data.csv' USING PigStorage( ',' ) AS (
    listing_id: chararray,
    fname: chararray,
    lname: chararray );

STORE raw_data INTO 'hbase://sample_names' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage (
        'info:fname info:lname');

Then run the pig script locally:

jhoover@jhoover2:~/hbase_sample$ pig -x local hbase_sample.pig

Success!

Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0001 raw_data MAP_ONLY hbase://hello_world,

Input(s):
Successfully read records from: "file:///autohome/jhoover/hbase_sample/sample_data.csv"

Output(s):
Successfully stored records in: "hbase://sample_names"

Job DAG:
job_local_0001

You can then see the results of your script in the hbase shell:

hbase(main):001:0> scan 'hello_world'
ROW COLUMN+CELL
1 column=info:fname, timestamp=1356134399789, value= John
1 column=info:lname, timestamp=1356134399789, value= Smith
2 column=info:fname, timestamp=1356134399789, value= Jane
2 column=info:lname, timestamp=1356134399789, value= Doe
3 column=info:fname, timestamp=1356134399789, value= George
3 column=info:lname, timestamp=1356134399789, value= Washington
4 column=info:fname, timestamp=1356134399789, value= Ben
4 column=info:lname, timestamp=1356134399789, value= Franklin
4 row(s) in 0.4850 seconds

Sample Code


You can download the sample code from this blog post here

Next: Column Families


In PIG 0.9.0 we get some new functionality around being able to treat entire column families using maps. I’ll post some examples as well as some UDFs we wrote to support that next.

No comments: