how do i load a fixed-width formatted file into hive? (with a little help from pig)

UPDATE: Feb 5, 2015

See the comments section which identifies a better way to do this using FixedWidthLoader.

I was talking with a client earlier this week that is using the Hortonworks Sandbox to jumpstart his Hadoop learning (hey, that's exactly why we put it out there).

Thanks to all the Sandbox tutorials out there he already knew how to take a delimited file, get it into HDFS, use HCatalog to build a table, and then run Hive queries against it. What he didn't know how to do was to load up a good old-fashioned fixed-width formatted file into a table. My first hunch was to use a simple Pig script to convert it into a delimited file and I told him I'd pull together a simple set of instructions to help him with his Sandbox self-training. Robert, here ya go!!

Create & Upload a Test File

To try this all out we first need a little test file. Create a file called emps-fixed.txt on your workstation with the following contents (yes... the first two lines are part of the exploration, but they don't break anything either).

12345678901234567890123456789012345678901234567890123456789012345678901234567890
EMP-ID    FIRST-NAME          LASTNAME            JOB-TITLE           MGR-EMP-ID
12301     Johnny              Begood              Programmer          12306     
12302     Ainta               Listening           Programmer          12306     
12303     Neva                Mind                Architect           12306     
12304     Joseph              Blow                Tester              12308     
12305     Sallie              Mae                 Programmer          12306     
12306     Bilbo               Baggins             Development Manager 12307     
12307     Nuther              One                 Director            11111     
12308     Yeta                Notherone           Testing Manager     12307     
12309     Evenmore            Dumbnames           Senior Architect    12307     
12310     Last                Sillyname           Senior Tester       12308

Then upload the file into HDFS and just land it in your home directory on the Sandbox.

I'm taking some creative liberty and assuming that you've worked through (or at least understand the concepts of) some/most/all of the Sandbox tutorials. For example, check out Tutorial 1 to see how to upload this file. If at any point in this blog posting you are unfamiliar with how to do an activity, check back with the tutorials for help. If you can't find which one can help you, let me know in the comments section and I'll reply with more info.

At this point, you should have a /user/hue/emps-fixed.txt file sitting in HDFS.

Build a Converter Script

Now we just need to build a simple converter script. There are plenty of good resources out there on Pig including the tutorials; I'm even trying to build up a Pig Cheat Sheet. For the one you'll see below, I basically stripped down the info found at https://bluewatersql.wordpress.com/2013/04/17/shakin-bacon-using-pig-to-process-data/ (remember, a lazy programmer is a good programmer & imitation is the sincerest form of flattery) which is worth checking out.

Using the Sandbox lets create a Pig script called convert-emp which will start out with some plumbing. Yes, that base "library" is called the piggy bank – I love the name, too!!

REGISTER piggybank.jar;
define SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

fixed_length_input = load '/user/hue/emps-fixed.txt' as (row:chararray);

Then build a structure using the SUBSTRING function to pluck the right values out of each line. Lastly, just write this new structure into a file on HDFS.

employees = foreach fixed_length_input generate
    (int)TRIM(SUBSTRING(row, 0,   9)) AS emp_id,
         TRIM(SUBSTRING(row, 10, 29)) AS first_name,
         TRIM(SUBSTRING(row, 30, 49)) AS last_name,
         TRIM(SUBSTRING(row, 50, 69)) AS job_title,
    (int)TRIM(SUBSTRING(row, 70, 79)) AS mgr_emp_id;
    
store employees into '/user/hue/emps-delimited';

Verify the Output File

Now you can navigate into /user/hue/emps-delimited on HDFS and you'll see the familiar (or not so familiar?) "part" file via the Sandbox's File Browser which shows the following tab-delimited content.

You're welcome to rerun the exercise without the first two rows, but again, they won't cause any real harm due to the forgiving nature of Pig who tries its best to do what you tell it to do. For the "real world", you'd probably want to write some validation logic to make sure things like the value "EMP-ID" (from the source file) fails miserable when trying to be cast to an integer.

Load a Table and Run a Query

Now you need to create a table based on this delimited file (hint, see Tutorial 1 again for one way to do that) and then run a simple query to verify all works. I ran the following HiveQL.

select emp_id, last_name, job_title
  from employees
 where mgr_emp_id = 12306;

It returned the following in the Sandbox.

There you have it – a super simple way to get a fixed-width file into a table that you can query from hive.