Blog from December, 2019

This is a quick blog post to show how minor and major compaction for Hive transactional tables occurs. Let’s use the situation that the hive acid transactions with partitions (a behind the scenes perspective) post leaves us in. Here it is!

$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        815 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        796 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        807 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        813 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000/bucket_00000

As that blog post, and the directory listing above, shows, there where a total of six ACID transactions that have occurred across three partitions to get to this point. The table looks like the following.

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 2          | noise         | bogus3        | p2          |
| 3          | noise         | bogus2        | p3          |
| 1          | noise         | bogus2        | p3          |
+------------+---------------+---------------+-------------+
3 rows selected (0.307 seconds)

As https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Compactor shows, we need to make sure hive.compactor.initiator.on is set to true for the compactor to be run.

Minor Compaction

This type of compaction is scheduled after the number of delta directories passes the value set in the hive.compactor.delta.num.threshold property, but you can also trigger it to run on-demand.

ALTER TABLE try_it COMPACT 'minor';

ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
You must specify a partition to compact for partitioned tables

This error helps us by making the point that we must run compaction on a specific partition unless the table is not partitioned. Let’s try it again and be sure to wait until it completes!

ALTER TABLE try_it partition (prt='p1') COMPACT 'minor';

show compactions;
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
| compactionid  |  dbname   | tabname  |  partname  |  type  |   state    | workerid  |   starttime    |   duration    |       hadoopjobid       |
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
| CompactionId  | Database  | Table    | Partition  | Type   | State      | Worker    | Start Time     | Duration(ms)  | HadoopJobId             |
| 1             | default   | try_it   | prt=p1     | MINOR  | succeeded  |  ---      | 1576145642000  | 179000        | job_1575915931720_0012  |
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
2 rows selected (0.031 seconds)

Let’s look at the file system again for the p1 partition.

$ hdfs dfs -ls -R /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005
-rw-rw----+  3 hive hadoop          1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/_orc_acid_version
-rw-rw----+  3 hive hadoop        654 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005
-rw-rw----+  3 hive hadoop          1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/_orc_acid_version
-rw-rw----+  3 hive hadoop        670 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/bucket_00000

Since there were changes in this partition from transaction #s 1, 3, and 5, we now see rolled together versions of the delta directories spanning these transaction #s. Let’s verify that the contents of the files have the rolled up details in a single file for the delta and delete_delta transactions.

$ hdfs dfs -get /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/bucket_00000 p1Minor-delete_delta
$ java -jar orc-tools-1.5.1-uber.jar data p1Minor-delete_delta 
Processing data file p1Minor-delete_delta [length: 654]
{"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":null}
{"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":5,
"row":null}

$ hdfs dfs -get /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/bucket_00000 p1Minor-delta
$ java -jar orc-tools-1.5.1-uber.jar data p1Minor-delta 
Processing data file p1Minor-delta [length: 670]
{"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,
"row":{"id":1,"a_val":"noise","b_val":"bogus"}}
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":{"id":1,"a_val":"noise","b_val":"bogus2"}}

Even though the minor compacted delete_delta file only shows transaction IDs 3 and 5 (see the currentTransaction attributes) and the delta only includes 1 and 3, both of these compacted files show the comprehensive width of 0000001_0000005.

Now, initiate minor compacting for the the other two partitions.

ALTER TABLE try_it partition (prt='p2') COMPACT 'minor';
ALTER TABLE try_it partition (prt='p3') COMPACT 'minor';

show compactions
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20191212201440_3a58f8c2-ebc2-4ebe-96e3-a774bf8bec65); Time taken: 0.002 seconds
INFO  : OK
+---------------+-----------+----------+------------+--------+------------+-------------------------------------------------+----------------+---------------+-------------------------+
| compactionid  |  dbname   | tabname  |  partname  |  type  |   state    |                    workerid                     |   starttime    |   duration    |       hadoopjobid       |
+---------------+-----------+----------+------------+--------+------------+-------------------------------------------------+----------------+---------------+-------------------------+
| CompactionId  | Database  | Table    | Partition  | Type   | State      | Worker                                          | Start Time     | Duration(ms)  | HadoopJobId             |
| 2             | default   | try_it   | prt=p2     | MINOR  | working    | ip-172-30-10-206.us-west-2.compute.internal-37  | 1576181672000  |  ---          | job_1575915931720_0013  |
| 3             | default   | try_it   | prt=p3     | MINOR  | working    | ip-172-30-10-206.us-west-2.compute.internal-36  | 1576181677000  |  ---          | job_1575915931720_0014  |
| 1             | default   | try_it   | prt=p1     | MINOR  | succeeded  |  ---                                            | 1576145642000  | 179000        | job_1575915931720_0012  |
+---------------+-----------+----------+------------+--------+------------+-------------------------------------------------+----------------+---------------+-------------------------+
4 rows selected (0.03 seconds)

Once those two compacting process finish, the table’s underlying filesystem footprint should look like the following.

$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 10:17 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005
-rw-rw----+  3 hive hadoop          1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/_orc_acid_version
-rw-rw----+  3 hive hadoop        654 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005
-rw-rw----+  3 hive hadoop          1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/_orc_acid_version
-rw-rw----+  3 hive hadoop        670 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2
drwxrwx---+  - hive hadoop          0 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delete_delta_0000002_0000004
-rw-rw----+  3 hive hadoop          1 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delete_delta_0000002_0000004/_orc_acid_version
-rw-rw----+  3 hive hadoop        654 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delete_delta_0000002_0000004/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000004
-rw-rw----+  3 hive hadoop          1 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000004/_orc_acid_version
-rw-rw----+  3 hive hadoop        670 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000004/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3
drwxrwx---+  - hive hadoop          0 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delete_delta_0000002_0000006
-rw-rw----+  3 hive hadoop          1 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delete_delta_0000002_0000006/_orc_acid_version
-rw-rw----+  3 hive hadoop        650 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delete_delta_0000002_0000006/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000006
-rw-rw----+  3 hive hadoop          1 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000006/_orc_acid_version
-rw-rw----+  3 hive hadoop        678 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000006/bucket_00000

Major Compaction

This type of compaction is scheduled based on the value set in the hive.compactor.delta.pct.threshold property whose formal definition expresses it best.

Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. (1.0 = 100%, so the default 0.1 = 10%.)

And, of course, you can also trigger it to run on-demand as shown below; again, let these processes finish before checking again.

ALTER TABLE try_it partition (prt='p1') COMPACT 'major';
ALTER TABLE try_it partition (prt='p2') COMPACT 'major';
ALTER TABLE try_it partition (prt='p3') COMPACT 'major';

+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
| compactionid  |  dbname   | tabname  |  partname  |  type  |   state    | workerid  |   starttime    |   duration    |       hadoopjobid       |
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
| CompactionId  | Database  | Table    | Partition  | Type   | State      | Worker    | Start Time     | Duration(ms)  | HadoopJobId             |
| 1             | default   | try_it   | prt=p1     | MINOR  | succeeded  |  ---      | 1576145642000  | 179000        | job_1575915931720_0012  |
| 2             | default   | try_it   | prt=p2     | MINOR  | succeeded  |  ---      | 1576181672000  | 22000         | job_1575915931720_0013  |
| 3             | default   | try_it   | prt=p3     | MINOR  | succeeded  |  ---      | 1576181677000  | 37000         | job_1575915931720_0014  |
| 4             | default   | try_it   | prt=p1     | MAJOR  | succeeded  |  ---      | 1576183603000  | 31000         | job_1575915931720_0015  |
| 5             | default   | try_it   | prt=p2     | MAJOR  | succeeded  |  ---      | 1576187575000  | 50000         | job_1575915931720_0017  |
| 6             | default   | try_it   | prt=p3     | MAJOR  | succeeded  |  ---      | 1576187583000  | 57000         | job_1575915931720_0018  |
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
7 rows selected (0.029 seconds)

This will build a single “base” file for each partition.

$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 20:47 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 20:46 /wa/t/m/h/try_it/prt=p1/base_0000005
-rw-rw----+  3 hive hadoop         48 2019-12-12 20:46 /wa/t/m/h/try_it/prt=p1/base_0000005/_metadata_acid
-rw-rw----+  3 hive hadoop          1 2019-12-12 20:46 /wa/t/m/h/try_it/prt=p1/base_0000005/_orc_acid_version
-rw-rw----+  3 hive hadoop        228 2019-12-12 20:46 /wa/t/m/h/try_it/prt=p1/base_0000005/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2
drwxrwx---+  - hive hadoop          0 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2/base_0000004
-rw-rw----+  3 hive hadoop         48 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2/base_0000004/_metadata_acid
-rw-rw----+  3 hive hadoop          1 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2/base_0000004/_orc_acid_version
-rw-rw----+  3 hive hadoop        815 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2/base_0000004/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 21:54 /wa/t/m/h/try_it/prt=p3
drwxrwx---+  - hive hadoop          0 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p3/base_0000006
-rw-rw----+  3 hive hadoop         48 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p3/base_0000006/_metadata_acid
-rw-rw----+  3 hive hadoop          1 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p3/base_0000006/_orc_acid_version
-rw-rw----+  3 hive hadoop        835 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p3/base_0000006/bucket_00000

Looking at p1’s data we make sure there is no data for this partition.

$ hdfs dfs -get /wa/t/m/h/try_it/prt=p1/base_0000005/bucket_00000 p1Base
$ java -jar orc-tools-1.5.1-uber.jar data p1Base 
Processing data file p1Base [length: 228]

We can see that p2 has a single row.

$ hdfs dfs -get /wa/t/m/h/try_it/prt=p2/base_0000004/bucket_00000 p2Base
$ java -jar orc-tools-1.5.1-uber.jar data p2Base 
Processing data file p2Base [length: 815]
{"operation":0,"originalTransaction":4,"bucket":536870912,"rowId":0,"currentTransaction":4,
"row":{"id":2,"a_val":"noise","b_val":"bogus3"}}

And lastly, we verify that p3’s data contains two rows.

$ hdfs dfs -get /wa/t/m/h/try_it/prt=p3/base_0000006/bucket_00000 p3Base
$ java -jar orc-tools-1.5.1-uber.jar data p3Base 
Processing data file p3Base [length: 835]
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":{"id":3,"a_val":"noise","b_val":"bogus2"}}
{"operation":0,"originalTransaction":6,"bucket":536870912,"rowId":0,"currentTransaction":6,
"row":{"id":1,"a_val":"noise","b_val":"bogus2"}}

Again, the contents of the now fully compacted into base files table looks like the following.

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 2          | noise         | bogus3        | p2          |
| 3          | noise         | bogus2        | p3          |
| 1          | noise         | bogus2        | p3          |
+------------+---------------+---------------+-------------+
3 rows selected (0.307 seconds)

Ever since Hive Transactions have surfaced, and especially since Apache Hive 3 was released, I’ve been meaning to capture a behind-the-scenes look at the underlying delta ORC files that are created; and yes, compacted. If you are new to Hive’s ACID transactions, then the first link in this post as well as the Understanding Hive ACID Transaction Table blog posting are great places to start.

Bonus points to those who remember what ACID stands for – add a comment at the bottom of this posting if you know! If you don’t it might be time to review RDBMS fundamentals. (wink)

Transactional Table DDL

Let’s create a transactional table with some Data Definition Language to test our use cases out on.

CREATE TABLE try_it (id int, a_val string, b_val string)
 PARTITIONED BY (prt string)  STORED AS ORC;
 
desc try_it;
+--------------------------+------------+----------+
|         col_name         | data_type  | comment  |
+--------------------------+------------+----------+
| id                       | int        |          |
| a_val                    | string     |          |
| b_val                    | string     |          |
| prt                      | string     |          |
|                          | NULL       | NULL     |
| # Partition Information  | NULL       | NULL     |
| # col_name               | data_type  | comment  |
| prt                      | string     |          |
+--------------------------+------------+----------+

Check to make sure the HDFS file structure was created.

hdfs dfs -ls /warehouse/tablespace/managed/hive/
drwxrwx---+  - hive hadoop          0 2019-12-12 07:38 /wa/t/m/h/try_it

The /warehouse/tablespace/managed/hive/ path is abbrevited as /wa/t/m/h/ in the above snippet and in the remainder of this blog posting.

DML Use Cases

Let’s explore some CRUD (Create, Retrieve, Update, Delete) uses cases as expressed in Data Manipulation Language.

Txn 1: INSERT Single Row

INSERT INTO try_it VALUES (1, 'noise', 'bogus', 'p1');

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 1          | noise         | bogus         | p1          |
+------------+---------------+---------------+-------------+
1 row selected (0.515 seconds)

Verify that the p1 partition now has a delta file and that it only includes changes belonging to transaction #1 (see the delta_0000001_0000001 indicator).

hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000

Pull down this delta file and use the knowledge from viewing the content of ORC files (using the Java ORC tool jar) to inspect it.

hdfs dfs -get /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000 add1
ls -l add*
-rw-r--r--. 1 hive hadoop 788 Dec 12 08:09 add1

java -jar orc-tools-1.5.1-uber.jar data add1
Processing data file add1 [length: 788]
{"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,
"row":{"id":1,"a_val":"noise","b_val":"bogus"}}

You can see the single row we added into the p1 partition is present. You’ll also notice it is annotated as coming from transaction #1 as indicated by "currentTransaction":1.

Txn 2: INSERT Multiple Rows Across Multiple Partitions

Insert statements allow multiple rows to be added at once and they all belong to a single ACID transaction. This use case is to exercise that, but to make it a bit more fun we can span more than one partition.

INSERT INTO try_it VALUES 
(2, 'noise', 'bogus', 'p2'),
(3, 'noise', 'bogus', 'p3');

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 1          | noise         | bogus         | p1          |
| 2          | noise         | bogus         | p2          |
| 3          | noise         | bogus         | p3          |
+------------+---------------+---------------+-------------+
3 rows selected (0.193 seconds)

Verify that both p1 and p2 partitions now have delta file directories/files which each contain changes belonging to transaction #2.

$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        796 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/bucket_00000

Pull down the delta files and inspect them.

[hive@ip-172-30-8-35 ~]$ hdfs dfs -get /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/bucket_00000 add2-p2
[hive@ip-172-30-8-35 ~]$ hdfs dfs -get /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/bucket_00000 add2-p3
[hive@ip-172-30-8-35 ~]$ ls -l add2-*
-rw-r--r--. 1 hive hadoop 788 Dec 12 09:22 add2-p2
-rw-r--r--. 1 hive hadoop 796 Dec 12 09:22 add2-p3

$ java -jar orc-tools-1.5.1-uber.jar data add2-p2
Processing data file add2-p2 [length: 788]
{"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,
"row":{"id":2,"a_val":"noise","b_val":"bogus"}}

$ java -jar orc-tools-1.5.1-uber.jar data add2-p3
Processing data file add2-p3 [length: 796]
{"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,
"row":{"id":3,"a_val":"noise","b_val":"bogus"}}

You can now see that both the p2 and p3 record additions are linked to "currentTransaction":2. To reiterate, transaction #2 spanned multiple partitions and each partition’s delta folder/file was properly aligned with the same transaction.

Txn 3: UPDATE Multiple Rows Across Multiple Partitions

Updates are tricky with Hive Transactions as there is no real in-place update occurring. Basically, Hive deletes the record to be updated and then does a net-new insert to account for what the updated recorded should look like at the end of the SQL statement. The section will show what this looks like behind-the-scenes.

To make the use case more interesting, we’ll make the update span records in multiple partitions so that we can see a similar behavior to the prior use case of a particular transaction number spanning these affected partitions.

Let’s start off with the SQL.

UPDATE try_it SET b_val = 'bogus2' WHERE a_val = 'noise';

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 1          | noise         | bogus2        | p1          |
| 2          | noise         | bogus2        | p2          |
| 3          | noise         | bogus2        | p3          |
+------------+---------------+---------------+-------------+
3 rows selected (0.192 seconds)

Verify that all three partitions are modified by each having delete_delta_ and delta_ directories.

$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/bucket_00000
    ... delta_0000001_0000001_ DELETED FOR BRIEVITY ...
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/bucket_00000
    ... delta_0000002_0000002_ DELETED FOR BRIEVITY ...
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/bucket_00000
    ... delta_0000002_0000002_ DELETED FOR BRIEVITY ...
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        807 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/bucket_00000

Let’s just focus on p2’s files as the other two partitions are basically the same thing.

First, look at the delete file which shows that this is "currentTransaction":3 which is a delete of "originalTransaction":2 created earlier in transaction #2.

$ hdfs dfs -get /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/bucket_00000 updateAllPartitionsExample-delete_delta
$ java -jar orc-tools-1.5.1-uber.jar data updateAllPartitionsExample-delete_delta 
Processing data file updateAllPartitionsExample-delete_delta [length: 727]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":null}

The delta file then shows a new "currentTransaction":3 record which is the projection of what the update statement modified to the record that was just deleted.

$ hdfs dfs -get  /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/bucket_00000 updateAllPartitionsExample-delta
$ java -jar orc-tools-1.5.1-uber.jar data updateAllPartitionsExample-delta 
Processing data file updateAllPartitionsExample-delta [length: 816]
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":{"id":2,"a_val":"noise","b_val":"bogus2"}}

Txn 4: UPDATE Single Row (Leveraging Partitioning)

This use case is just calling out that that we should be using the partitioned virtual column in the update statement as much as possible to make Hive’s just much easier, but only looking in the folders that can possibly be affected instead of walking the full table’s contents.

UPDATE try_it SET b_val = 'bogus3' WHERE b_val = 'bogus2' AND prt = 'p2';

+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 1          | noise         | bogus2        | p1          |
| 2          | noise         | bogus3        | p2          |
| 3          | noise         | bogus2        | p3          |
+------------+---------------+---------------+-------------+
3 rows selected (0.201 seconds)

In this example, without the partition condition we would have updated all three partitions again. Make sure only the p2 partition shows delete_delta_0000004_0000004_ and delta_0000004_0000004_ folders.

$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        815 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        796 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        807 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/bucket_00000

There is no need to look at the ORC files for this change as it is similar to what we saw in transaction #3.

Txn 5 & 6: UPDATE Single Row to Change Partition

A student in a recent class asked if it would be possible to run an update statement that changed the partition virtual column value. I wasn’t sure and could argue it both ways. So, like most things in Hadoop the best way to answer a question like this is to just TRY IT!

UPDATE try_it SET prt = 'p3' WHERE a_val = 'noise' AND prt = 'p1';

Error: Error while compiling statement: FAILED: SemanticException [Error 10292]: 
  Updating values of partition columns is not supported (state=42000,code=10292)

Well, as you can see, NO JOY!! But would could just run to separate transactions; one to delete it and one to add it, but we do not have an atomic transaction with this and the responsibility of the data integrity would fall to the application, or person, to make sure both are committed.

DELETE FROM try_it WHERE a_val = 'noise' AND prt = 'p1';

INSERT INTO try_it VALUES (1,'noise','bogus2','p3');

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 2          | noise         | bogus3        | p2          |
| 3          | noise         | bogus2        | p3          |
| 1          | noise         | bogus2        | p3          |
+------------+---------------+---------------+-------------+
3 rows selected (0.227 seconds)

Here are the applicable HDFS details for these two transactions.

drwxrwx---+  - hive hadoop          0 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        813 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000/bucket_00000

Parting Thoughts

These ACID transaction capabilities were mentioned way back during my presenting at hadoop summit (archiving evolving databases in hive) talk a few years ago and provide a much cleaner way at develop-time to address this need for allowing mutable data in Hive’s (previously) immutable world.

A concern that surfaces is that small files and hadoop's hdfs (bonus: an inode formula) causes not just “Namenode pressure”, but also forces the processing to read potentially lots and lots of small(ish) delta files instead of a few big(ger) “base” files. To make it worse, this work cannot be done in parallel for a given partition due to the need to walk these edits files in order so that Hive can present an accurate representation of what the end results should be.

Some of these concerns are addressed with hive delta file compaction (minor and major) processing that exists and anther big helper is to leverage the MERGE command to lump many changes into a single transaction. Both of these topics will be presented in future blog posts.

The Apache ORC file format has been used heavily by Apache Hive for many years now, but being a bit of a “binary file format” there just isn’t much we can do with basic tools to see the contents of these files as shown below.

$ cat orcfile 
ORC
P1

       ???>P>??be!Q%~.ע?d!?????T	?;


DoeSmith(P4??be!%..&wG!??       ?
                                 ??'LesterEricJohnSusie	FdEBR	F6PDoeMartinSmithGATXOKMA???
??]?M?Ku??????9?sT?#?ްͲ㖆O:^xh?>??FWe?Pve??桿F?Ӳ?LuS????b?`	`??`???/p?_?]C?8???kQf?kpiqf??PB?K
                  (???쒟
X?X8X?9X?89.?   Ź?????B"$?b4?`X?$???,??(???????#?????"Ŝ??"Ś????*Ś??KKR??8????
             ????b??a%???????Z??,?\Z??*????J?q1???s3K2$4??rVB@q..&wG!?? ???????"
                                                                               (^0??ORC

Fortunately, the ORC project has a couple of options for CLI tools. For this posting, I settled on the Java Tools. Now, you could be a good citizen and build these yourself from source, but I (the lazy programmer that I am) decided to just download a compiled “uber jar” file.

First, I needed to figure out which version of ORC I was using. I am currently using HDP 3.1.0 and I took a peek into the Hive lib folder.

$ ls /usr/hdp/current/hive-client/lib/orc*
/usr/hdp/current/hive-client/lib/orc-core-1.5.1.3.1.0.0-78.jar
/usr/hdp/current/hive-client/lib/orc-shims-1.5.1.3.1.0.0-78.jar

The HDP jar file naming convention let me know I was using ORC 1.5.1, so I surfed over to http://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.1/ and then pulled down the appropriate file.

wget https://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.1/orc-tools-1.5.1-uber.jar

Now, I’m ready to use the tools, but… I realized I didn’t have an ORC file to test it out with, so I decided I would use Apache Pig to build a small file. I first created a simple CSV file with vi and then pushed it to HDFS. The contents of the file are as follows.

$ hdfs dfs -cat pig/customers.csv
1001,Lester,Martin,GA
1002,Eric,Martin,TX
1003,John,Doe,OK
1004,Susie,Smith,MA

I then wrote a little read & write conversion script and then executed it.

$ cat createORC.pig 
custs = LOAD 'pig/customers.csv' USING PigStorage(',')
          AS (cid:int, fname:chararray, lname:chararray, state:chararray);
STORE custs INTO 'orcfile' USING OrcStorage;
$ pig createORC.pig

As expected, it created a simple little ORC file which I pulled down to my linux home directory.

$ hdfs dfs -ls orcfile
Found 2 items
-rw-r--r--   3 zeppelin hdfs          0 2019-12-12 08:35 orcfile/_SUCCESS
-rw-r--r--   3 zeppelin hdfs        569 2019-12-12 08:35 orcfile/part-v000-o000-r-00000
$ hdfs dfs -get orcfile/part-v000-o000-r-00000 orcfile
$ ls -l orcf*
-rw-r--r--. 1 zeppelin hadoop 569 Dec 12 08:36 orcfile

NOW, we can finally try out the ORC Tools jar. First up, we can look at the metadata of this file.

$ java -jar orc-tools-1.5.1-uber.jar meta orcfile
Processing data file orcfile [length: 569]
Structure for orcfile
File Version: 0.12 with ORC_135
Rows: 4
Compression: ZLIB
Compression size: 262144
Type: struct<cid:int,fname:string,lname:string,state:string>

**** REMOVED CONTENT FROM THESE SECTIONS FOR BREVITY ****
Stripe Statistics:
File Statistics:
Stripes:

File length: 569 bytes
Padding length: 0 bytes
Padding ratio: 0%

That had some interesting info (and I definitely deleted a bunch to not be so verbose), but what this post was really about all this time is to show the contents of the file, so we just switch the subcommand.

$ java -jar orc-tools-1.5.1-uber.jar data orcfile
Processing data file orcfile [length: 569]
{"cid":1001,"fname":"Lester","lname":"Martin","state":"GA"}
{"cid":1002,"fname":"Eric","lname":"Martin","state":"TX"}
{"cid":1003,"fname":"John","lname":"Doe","state":"OK"}
{"cid":1004,"fname":"Susie","lname":"Smith","state":"MA"}

Perfect, we can see the four rows represented as JSON documents which is so much easier to read than that stuff we started out with originally.