viewing the content of ORC files (using the Java ORC tool jar)

The Apache ORC file format has been used heavily by Apache Hive for many years now, but being a bit of a “binary file format” there just isn’t much we can do with basic tools to see the contents of these files as shown below.

$ cat orcfile ORC P1 ???>P>??be!Q%~.ע?d!?????T ?; DoeSmith(P4??be!%..&wG!?? ? ??'LesterEricJohnSusie FdEBR F6PDoeMartinSmithGATXOKMA??? ??]?M?Ku??????9?sT?#?ްͲ㖆O:^xh?>??FWe?Pve??桿F?Ӳ?LuS????b?` `??`???/p?_?]C?8???kQf?kpiqf??PB?K (???쒟 X?X8X?9X?89.? Ź?????B"$?b4?`X?$???,??(???????#?????"Ŝ??"Ś????*Ś??KKR??8???? ????b??a%???????Z??,?\Z??*????J?q1???s3K2$4??rVB@q..&wG!?? ???????" (^0??ORC

Fortunately, the ORC project has a couple of options for CLI tools. For this posting, I settled on the Java Tools. Now, you could be a good citizen and build these yourself from source, but I (the lazy programmer that I am) decided to just download a compiled “uber jar” file.

First, I needed to figure out which version of ORC I was using. I am currently using HDP 3.1.0 and I took a peek into the Hive lib folder.

$ ls /usr/hdp/current/hive-client/lib/orc* /usr/hdp/current/hive-client/lib/orc-core-1.5.1.3.1.0.0-78.jar /usr/hdp/current/hive-client/lib/orc-shims-1.5.1.3.1.0.0-78.jar

The HDP jar file naming convention let me know I was using ORC 1.5.1, so I surfed over to http://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.1/ and then pulled down the appropriate file.

wget https://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.1/orc-tools-1.5.1-uber.jar

Now, I’m ready to use the tools, but… I realized I didn’t have an ORC file to test it out with, so I decided I would use Apache Pig to build a small file. I first created a simple CSV file with vi and then pushed it to HDFS. The contents of the file are as follows.

I then wrote a little read & write conversion script and then executed it.

As expected, it created a simple little ORC file which I pulled down to my linux home directory.

NOW, we can finally try out the ORC Tools jar. First up, we can look at the metadata of this file.

That had some interesting info (and I definitely deleted a bunch to not be so verbose), but what this post was really about all this time is to show the contents of the file, so we just switch the subcommand.

Perfect, we can see the four rows represented as JSON documents which is so much easier to read than that stuff we started out with originally.