This is a quick blog post to show how minor and major compaction for Hive transactional tables occurs. Let’s use the situation that the https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2019/12/22/1405059073 post leaves us in. Here it is!

$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+  - hive hadoop          0 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        733 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        788 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        815 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        796 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        807 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+  - hive hadoop          0 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000
-rw-rw----+  3 hive hadoop          1 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        813 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000/bucket_00000

As that blog post, and the directory listing above, shows, there where a total of six ACID transactions that have occurred across three partitions to get to this point. The table looks like the following.

select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id  | try_it.a_val  | try_it.b_val  | try_it.prt  |
+------------+---------------+---------------+-------------+
| 2          | noise         | bogus3        | p2          |
| 3          | noise         | bogus2        | p3          |
| 1          | noise         | bogus2        | p3          |
+------------+---------------+---------------+-------------+
3 rows selected (0.307 seconds)

As https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Compactor shows, we need to make sure hive.compactor.initiator.on is set to true for the compactor to be run.

Minor Compaction

This type of compaction is scheduled after the number of delta directories passes the value set in the hive.compactor.delta.num.threshold property, but you can also trigger it to run on-demand.

ALTER TABLE try_it COMPACT 'minor';

ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
You must specify a partition to compact for partitioned tables

This error helps us by making the point that we must run compaction on a specific partition unless the table is not partitioned. Let’s try it again and be sure to wait until it completes!

Let’s look at the file system again for the p1 partition.

Since there were changes in this partition from transaction #s 1, 3, and 5, we now see rolled together versions of the delta directories spanning these transaction #s. Let’s verify that the contents of the files have the rolled up details in a single file for the delta and delete_delta transactions.

Even though the minor compacted delete_delta file only shows transaction IDs 3 and 5 (see the currentTransaction attributes) and the delta only includes 1 and 3, both of these compacted files show the comprehensive width of 0000001_0000005.

Now, initiate minor compacting for the the other two partitions.

Once those two compacting process finish, the table’s underlying filesystem footprint should look like the following.

Major Compaction

This type of compaction is scheduled based on the value set in the hive.compactor.delta.pct.threshold property whose formal definition expresses it best.

Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. (1.0 = 100%, so the default 0.1 = 10%.)

And, of course, you can also trigger it to run on-demand as shown below; again, let these processes finish before checking again.

This will build a single “base” file for each partition.

Looking at p1’s data we make sure there is no data for this partition.

We can see that p2 has a single row.

And lastly, we verify that p3’s data contains two rows.

Again, the contents of the now fully compacted into base files table looks like the following.

Lester Martin

hive delta file compaction (minor and major)

Minor Compaction

Major Compaction