Ever since Hive Transactions have surfaced, and especially since Apache Hive 3 was released, I’ve been meaning to capture a behind-the-scenes look at the underlying ORC files that are created; and yes, compacted. If you are new to Hive’s ACID transactions, then the first link in this post as well as the Understanding Hive ACID Transaction Table blog posting are great places to start.
Transactional Table DDL
Let’s create a transactional table to test our use cases out on.
CREATE TABLE try_it (id int, a_val string, b_val string) PARTITIONED BY (prt string) STORED AS ORC; desc try_it; +--------------------------+------------+----------+ | col_name | data_type | comment | +--------------------------+------------+----------+ | id | int | | | a_val | string | | | b_val | string | | | prt | string | | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | prt | string | | +--------------------------+------------+----------+
Check to make sure the HDFS file structure was created.
hdfs dfs -ls /warehouse/tablespace/managed/hive/ drwxrwx---+ - hive hadoop 0 2019-12-12 07:38 /w/t/m/h/try_it
The /warehouse/tablespace/managed/hive/
path is abbrevited as /w/t/m/h/
in the above snippet and in the remainder of this blog posting.