/
how do i load a fixed-width formatted file into hive? (with a little help from pig)

how do i load a fixed-width formatted file into hive? (with a little help from pig)

UPDATE: Feb 5, 2015

See the comments section which identifies a better way to do this using FixedWidthLoader.

I was talking with a client earlier this week that is using the Hortonworks Sandbox to jumpstart his Hadoop learning (hey, that's exactly why we put it out there)

Thanks to all the Sandbox tutorials out there he already knew how to take a delimited file, get it into HDFS, use HCatalog to build a table, and then run Hive queries against it.  What he didn't know how to do was to load up a good old-fashioned fixed-width formatted file into a table.  My first hunch was to use a simple Pig script to convert it into a delimited file and I told him I'd pull together a simple set of instructions to help him with his Sandbox self-training.  Robert, here ya go!!

Create & Upload a Test File

To try this all out we first need a little test file.  Create a file called emps-fixed.txt on your workstation with the following contents (yes... the first two lines are part of the exploration, but they don't break anything either).

12345678901234567890123456789012345678901234567890123456789012345678901234567890 EMP-ID FIRST-NAME LASTNAME JOB-TITLE MGR-EMP-ID 12301 Johnny Begood Programmer 12306 12302 Ainta Listening Programmer 12306 12303 Neva Mind Architect 12306 12304 Joseph Blow Tester 12308 12305 Sallie Mae Programmer 12306 12306 Bilbo Baggins Development Manager 12307 12307 Nuther One Director 11111 12308 Yeta Notherone Testing Manager 12307 12309 Evenmore Dumbnames Senior Architect 12307 12310 Last Sillyname Senior Tester 12308

Then upload the file into HDFS and just land it in your home directory on the Sandbox.

I'm taking some creative liberty and assuming that you've worked through (or at least understand the concepts of) some/most/all of the Sandbox tutorials.  For example, check out Tutorial 1 to see how to upload this file.  If at any point in this blog posting you are unfamiliar with how to do an activity, check back with the tutorials for help.  If you can't find which one can help you, let me know in the comments section and I'll reply with more info.

At this point, you should have a /user/hue/emps-fixed.txt file sitting in HDFS.

Build a Converter Script

Now we just need to build a simple converter script.  There are plenty of good resources out there on Pig including the tutorials; I'm even trying to build up a Pig Cheat Sheet.  For the one you'll see below, I basically stripped down the info found at https://bluewatersql.wordpress.com/2013/04/17/shakin-bacon-using-pig-to-process-data/ (remember, a lazy programmer is a good programmer & imitation is the sincerest form of flattery) which is worth checking out.

Using the Sandbox lets create a Pig script called convert-emp which will start out with some plumbing.  Yes, that base "library" is called the piggy bank – I love the name, too!!

REGISTER piggybank.jar; define SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING(); fixed_length_input = load '/user/hue/emps-fixed.txt' as (row:chararray);

Then build a structure using the SUBSTRING function to pluck the right values out of each line.  Lastly, just write this new structure into a file on HDFS.

employees = foreach fixed_length_input generate (int)TRIM(SUBSTRING(row, 0, 9)) AS emp_id, TRIM(SUBSTRING(row, 10, 29)) AS first_name, TRIM(SUBSTRING(row, 30, 49)) AS last_name, TRIM(SUBSTRING(row, 50, 69)) AS job_title, (int)TRIM(SUBSTRING(row, 70, 79)) AS mgr_emp_id; store employees into '/user/hue/emps-delimited';

Verify the Output File

Now you can navigate into /user/hue/emps-delimited on HDFS and you'll see the familiar (or not so familiar?) "part" file via the Sandbox's File Browser which shows the following tab-delimited content.

You're welcome to rerun the exercise without the first two rows, but again, they won't cause any real harm due to the forgiving nature of Pig who tries its best to do what you tell it to do.  For the "real world", you'd probably want to write some validation logic to make sure things like the value "EMP-ID" (from the source file) fails miserable when trying to be cast to an integer.

Load a Table and Run a Query

Now you need to create a table based on this delimited file (hint, see Tutorial 1 again for one way to do that) and then run a simple query to verify all works.  I ran the following HiveQL.

select emp_id, last_name, job_title from employees where mgr_emp_id = 12306;

It returned the following in the Sandbox.

There you have it – a super simple way to get a fixed-width file into a table that you can query from hive.

Related content