Ever since Hive Transactions have surfaced, and especially since Apache Hive 3 was released, I’ve been meaning to capture a behind-the-scenes look at the underlying delta ORC files that are created; and yes, compacted. If you are new to Hive’s ACID transactions, then the first link in this post as well as the Understanding Hive ACID Transaction Table blog posting are great places to start.

Info
Bonus points to those who remember what ACID stands for – add a comment at the bottom of this posting if you know! If you don’t it might be time to review RDBMS fundamentals.

Table of Contents

Transactional Table DDL

Let’s create a transactional table with some Data Definition Language to test our use cases out on.

Code Block

breakoutMode	wide
language	sql

CREATE TABLE try_it (id int, a_val string, b_val string)
 PARTITIONED BY (prt string)  STORED AS ORC;
 
desc try_it;

+--------------------------+------------+----------+
|         col_name         | data_type  | comment  |
+--------------------------+------------+----------+
| id                       | int        |          |
| a_val                    | string     |          |
| b_val                    | string     |          |
| prt                      | string     |          |
|                          | NULL       | NULL     |
| # Partition Information  | NULL       | NULL     |
| # col_name               | data_type  | comment  |
| prt                      | string     |          |
+--------------------------+------------+----------+

Check to make sure the HDFS file structure was created.

Code Block

breakoutMode	wide
language	bash

hdfs dfs -ls /warehouse/tablespace/managed/hive/
drwxrwx---+  - hive hadoop          0 2019-12-12 07:38 /w/t/m/h/try_it

Info
The `/warehouse/tablespace/managed/hive/` path is abbrevited as `/w/t/m/h/` in the above snippet and in the remainder of this blog posting.

DML Use Cases

Let’s explore some CRUD (Create, Retrieve, Update, Delete) uses cases as expressed in Data Manipulation Language.

Txn 1: INSERT Single Row

Code Block

breakoutMode	wide
language	sql

INSERT INTO try_it VALUES (1, 'noise', 'bogus', 'p1');

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 1          | noise         | bogus         | p1          |
+------------+---------------+---------------+-------------+
1 row selected (0.515 seconds)

Verify that the p1 partition now has a delta file and that it only includes changes belonging to transaction #1 (see the delta_0000001_0000001 indicator).

Code Block

breakoutMode	wide
language	bash

hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000

Pull down this delta file and use the knowledge from viewing the content of ORC files (using the Java ORC tool jar) to inspect it.

Code Block

breakoutMode	wide
language	bash

hdfs dfs -get /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000 add1
ls -l add*
-rw-r--r--. 1 hive hadoop 788 Dec 12 08:09 add1

java -jar orc-tools-1.5.1-uber.jar data add1
Processing data file add1 [length: 788]
{"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,
"row":{"id":1,"a_val":"noise","b_val":"bogus"}}

You can see the single row we added into the p1 partition is present. You’ll also notice it is annotated as coming from transaction #1 as indicated by "originalTransaction":1.

Txn 2: INSERT Multiple Rows Across Multiple Partitions

Insert statements allow multiple rows to be added at once and they all belong to a single ACID transaction. This use case is to exercise that, but to make it a bit more fun we can span more than one partition.

Code Block

breakoutMode	wide
language	sql

INSERT INTO try_it VALUES 
(2, 'noise', 'bogus', 'p2'),
(3, 'noise', 'bogus', 'p3');

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 1          | noise         | bogus         | p1          |
| 2          | noise         | bogus         | p2          |
| 3          | noise         | bogus         | p3          |
+------------+---------------+---------------+-------------+
3 rows selected (0.193 seconds)

Verify that

Versions Compared

Old Version 2

New Version 3

Key

Transactional Table DDL

DML Use Cases

Txn 1: INSERT Single Row

Txn 2: INSERT Multiple Rows Across Multiple Partitions

Page Comparison

Versions Compared

Old Version 2

New Version 3

Key

Transactional Table DDL

DML Use Cases

Txn 1: INSERT Single Row

Txn 2: INSERT Multiple Rows Across Multiple Partitions