Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info
titleUPDATE: Feb 5, 2015

See the comments section which identifies a better way to do this using FixedWidthLoader.

I was talking with a client earlier this week that is using the Hortonworks Sandbox to jumpstart his Hadoop learning (hey, that's exactly why we put it out there)

Image AddedThanks to all the Sandbox tutorials out there he already knew how to take a delimited file, get it into HDFS, use HCatalog to build a table, and then run Hive queries against it.  What he didn't know how to do was to load up a good old-fashioned fixed-width formatted file into a table.  My first hunch was to use a simple Pig script to convert the file it into a delimited file and I told him I'd pull together a simple set of instructions to help him with his Sandbox self-training.  Robert, here ya go!!

...

Info

I'm taking some creative liberty and assume assuming that you've worked through (or at least understand the concepts of) some/most/all of the Sandbox tutorials.  For example, check out Tutorial 1 to see how to upload this file.  If at any point in this blog posting you are unfamiliar with how to do an activity, check back with the tutorials for help.  If you can't find which one can help you, let me know in the comments section and I'll reply with more info.

...

Now we just need to build a simple converter script.  Plenty There are plenty of good resources out there on Pig including the tutorials; I'm even trying to build up a Pig Cheat Sheet.  For the one you'll see below, I basically stripped down the info found at https://bluewatersql.wordpress.com/2013/04/17/shakin-bacon-using-pig-to-process-data/ (remember, a lazy programmer is a good programmer & imitation is the sincerest form of flattery) which is worth checking out.

...

Code Block
languagetext
REGISTER piggybank.jar;
define SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

fixed_length_input = load '/user/hue/emps-fixed.txt' as (row:chararray);

Then bring into build a structure using the SUBSTRING function to pluck the right values out of each line.  Lastly, just write this new structure into a file on HDFS.

...

Now you need to create a table based on this delimited file and (hint, see Tutorial 1 again for one way to do that) and then run a simple query to verify all works.  I ran the following SQLHiveQL.

Code Block
languagetext
select emp_id, last_name, job_title
  from employees
 where mgr_emp_id = 12306;

...

There you have it – a super simple way to get some a fixed-width files file into a table that you can query from hive.