...
As that blog post, and the directory listing above, shows, there where a total of six ACID transactions that have occurred across three partitions to get to this point.
As https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Compactor shows, we need to make sure hive.compactor.initiator.on
is set to true
for the compactor to be run.
Minor Compaction
This type of compaction is scheduled after the number of delta directories passes the value set in the hive.compactor.delta.num.threshold
property, but you can also trigger it to run on-demand.
Code Block |
---|
breakoutMode | wide |
---|
language | sql |
---|
|
ALTER TABLE try_it COMPACT 'minor';
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
You must specify a partition to compact for partitioned tables |
This error helps us by making the point that we must run compaction on an specific partition unless the table is not partitioned. Let’s try it again and be sure to wait until it completes!
...
breakoutMode | wide |
---|
language | sql |
---|
...
The table looks like the following.
Code Block |
---|
breakoutMode | wide |
---|
language | sql |
---|
|
select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id | try_it.a_val | try_it.b_val | try_it.prt |
+------------+---+------------+----------+-----+-------+--------+------------+-----------+----------------+---------------+-------------------------+
| compactionid | dbname | tabname | partname | type | state | workerid | starttime | duration | hadoopjobid+
| 2 | noise | bogus3 | p2 |
| 3 | noise | bogus2 | p3 |
| 1 | noise | bogus2 |
+--- p3 |
+------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
| CompactionId | Database | Table | Partition | Type | State | Worker | Start Time | Duration(ms) | HadoopJobId |
| 1 | default | try_it | prt=p1 | MINOR | succeeded | --- | 1576145642000 | 179000 | job_1575915931720_0012 |
+---------------+
3 rows selected (0.307 seconds) |
As https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Compactor shows, we need to make sure hive.compactor.initiator.on
is set to true
for the compactor to be run.
Minor Compaction
This type of compaction is scheduled after the number of delta directories passes the value set in the hive.compactor.delta.num.threshold
property, but you can also trigger it to run on-demand.
Code Block |
---|
breakoutMode | wide |
---|
language | sql |
---|
|
ALTER TABLE try_it COMPACT 'minor';
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
You must specify a partition to compact for partitioned tables |
This error helps us by making the point that we must run compaction on a specific partition unless the table is not partitioned. Let’s try it again and be sure to wait until it completes!
Code Block |
---|
breakoutMode | wide |
---|
language | sql |
---|
|
ALTER TABLE try_it partition (prt='p1') COMPACT 'minor';
show compactions;
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
2 rows selected (0.031 seconds) |
Let’s look at the file system again for the p1 partition.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -ls -R /wa/t/m/h/try_it/prt=p1
drwxrwx---+ - hive hadoop 0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005
-rw-rw----+ 3 hive hadoop 1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/_orc_acid_version
-rw-rw----+ 3 hive hadoop 654 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005
-rw-rw----+ 3 hive hadoop 1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/_orc_acid_version
-rw-rw----+ 3 hive hadoop 670 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/bucket_00000 |
Since there were changes in this partition from transaction #s 1, 3, and 5, we now see rolled together versions of the delta directories spanning these transaction #s. Let’s verify that the contents of the files has the rolled up details in a single file for the delta
and delete_delta
transactions.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/bucket_00000 p1Minor-delete_delta
$ java -jar orc-tools-1.5.1-uber.jar data p1Minor-delete_delta
Processing data file p1Minor-delete_delta [length: 654]
{"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":null}
{"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":5,
"row":null}
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/bucket_00000 p1Minor-delta
$ java -jar orc-tools-1.5.1-uber.jar data p1Minor-delta
Processing data file p1Minor-delta [length: 670]
{"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,
"row":{"id":1,"a_val":"noise","b_val":"bogus"}}
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":{"id":1,"a_val":"noise","b_val":"bogus2"}}| compactionid | dbname | tabname | partname | type | state | workerid | starttime | duration | hadoopjobid |
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
| CompactionId | Database | Table | Partition | Type | State | Worker | Start Time | Duration(ms) | HadoopJobId |
| 1 | default | try_it | prt=p1 | MINOR | succeeded | --- | 1576145642000 | 179000 | job_1575915931720_0012 |
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
2 rows selected (0.031 seconds) |
Let’s look at the file system again for the p1 partition.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -ls -R /wa/t/m/h/try_it/prt=p1
drwxrwx---+ - hive hadoop 0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005
-rw-rw----+ 3 hive hadoop 1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/_orc_acid_version
-rw-rw----+ 3 hive hadoop 654 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005
-rw-rw----+ 3 hive hadoop 1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/_orc_acid_version
-rw-rw----+ 3 hive hadoop 670 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/bucket_00000 |
Since there were changes in this partition from transaction #s 1, 3, and 5, we now see rolled together versions of the delta directories spanning these transaction #s. Let’s verify that the contents of the files have the rolled up details in a single file for the delta
and delete_delta
transactions.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/bucket_00000 p1Minor-delete_delta
$ java -jar orc-tools-1.5.1-uber.jar data p1Minor-delete_delta
Processing data file p1Minor-delete_delta [length: 654]
{"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":null}
{"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":5,
"row":null}
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/bucket_00000 p1Minor-delta
$ java -jar orc-tools-1.5.1-uber.jar data p1Minor-delta
Processing data file p1Minor-delta [length: 670]
{"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,
"row":{"id":1,"a_val":"noise","b_val":"bogus"}}
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":{"id":1,"a_val":"noise","b_val":"bogus2"}} |
Even though the minor compacted delete_delta
file only shows transaction IDs 3 and 5 (see the currentTransaction
attributes) and the delta
only includes 1 and 3, both of these compacted files show the comprehensive width of 0000001_0000005
.
Now, initiate minor compacting for the the other two partitions.
Code Block |
---|
breakoutMode | wide |
---|
language | sql |
---|
|
ALTER TABLE try_it partition (prt='p2') COMPACT 'minor';
ALTER TABLE try_it partition (prt='p3') COMPACT 'minor';
show compactions
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20191212201440_3a58f8c2-ebc2-4ebe-96e3-a774bf8bec65); Time taken: 0.002 seconds
INFO : OK
+---------------+-----------+----------+------------+--------+------------+-------------------------------------------------+----------------+---------------+-------------------------+
| compactionid | dbname | tabname | partname | type | state | workerid | starttime | duration | hadoopjobid |
+---------------+-----------+----------+------------+--------+------------+-------------------------------------------------+----------------+---------------+-------------------------+
| CompactionId | Database | Table | Partition | Type | State | Worker | Start Time | Duration(ms) | HadoopJobId |
| 2 | default | try_it | prt=p2 | MINOR | working | ip-172-30-10-206.us-west-2.compute.internal-37 | 1576181672000 | --- | job_1575915931720_0013 |
| 3 | default | try_it | prt=p3 | MINOR | working | ip-172-30-10-206.us-west-2.compute.internal-36 | 1576181677000 | --- | job_1575915931720_0014 |
| 1 | default | try_it | prt=p1 | MINOR | succeeded | --- | 1576145642000 | 179000 | job_1575915931720_0012 |
+---------------+-----------+----------+------------+--------+------------+-------------------------------------------------+----------------+---------------+-------------------------+
4 rows selected (0.03 seconds) |
Once those two compacting process finish, the table’s underlying filesystem footprint should look like the following.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+ - hive hadoop 0 2019-12-12 10:17 /wa/t/m/h/try_it/prt=p1
drwxrwx---+ - hive hadoop 0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005
-rw-rw----+ 3 hive hadoop 1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/_orc_acid_version
-rw-rw----+ 3 hive hadoop 654 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delete_delta_0000001_0000005/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005
-rw-rw----+ 3 hive hadoop 1 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/_orc_acid_version
-rw-rw----+ 3 hive hadoop 670 2019-12-12 10:16 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000005/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2
drwxrwx---+ - hive hadoop 0 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delete_delta_0000002_0000004
-rw-rw----+ 3 hive hadoop 1 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delete_delta_0000002_0000004/_orc_acid_version
-rw-rw----+ 3 hive hadoop 654 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delete_delta_0000002_0000004/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000004
-rw-rw----+ 3 hive hadoop 1 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000004/_orc_acid_version
-rw-rw----+ 3 hive hadoop 670 2019-12-12 20:14 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000004/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3
drwxrwx---+ - hive hadoop 0 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delete_delta_0000002_0000006
-rw-rw----+ 3 hive hadoop 1 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delete_delta_0000002_0000006/_orc_acid_version
-rw-rw----+ 3 hive hadoop 650 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delete_delta_0000002_0000006/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000006
-rw-rw----+ 3 hive hadoop 1 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000006/_orc_acid_version
-rw-rw----+ 3 hive hadoop 678 2019-12-12 20:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000006/bucket_00000 |
Major Compaction
This type of compaction is scheduled based on the value set in the hive.compactor.delta.pct.threshold
property whose formal definition expresses it best.
Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. (1.0 = 100%, so the default 0.1 = 10%.)
And, of course, you can also trigger it to run on-demand as shown below; again, let these processes finish before checking again.
Code Block |
---|
breakoutMode | wide |
---|
language | sql |
---|
|
ALTER TABLE try_it partition (prt='p1') COMPACT 'major';
ALTER TABLE try_it partition (prt='p2') COMPACT 'major';
ALTER TABLE try_it partition (prt='p3') COMPACT 'major';
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
| compactionid | dbname | tabname | partname | type | state | workerid | starttime | duration | hadoopjobid |
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
| CompactionId | Database | Table | Partition | Type | State | Worker | Start Time | Duration(ms) | HadoopJobId |
| 1 | default | try_it | prt=p1 | MINOR | succeeded | --- | 1576145642000 | 179000 | job_1575915931720_0012 |
| 2 | default | try_it | prt=p2 | MINOR | succeeded | --- | 1576181672000 | 22000 | job_1575915931720_0013 |
| 3 | default | try_it | prt=p3 | MINOR | succeeded | --- | 1576181677000 | 37000 | job_1575915931720_0014 |
| 4 | default | try_it | prt=p1 | MAJOR | succeeded | --- | 1576183603000 | 31000 | job_1575915931720_0015 |
| 5 | default | try_it | prt=p2 | MAJOR | succeeded | --- | 1576187575000 | 50000 | job_1575915931720_0017 |
| 6 | default | try_it | prt=p3 | MAJOR | succeeded | --- | 1576187583000 | 57000 | job_1575915931720_0018 |
+---------------+-----------+----------+------------+--------+------------+-----------+----------------+---------------+-------------------------+
7 rows selected (0.029 seconds) |
This will build a single “base” file for each partition.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+ - hive hadoop 0 2019-12-12 20:47 /wa/t/m/h/try_it/prt=p1
drwxrwx---+ - hive hadoop 0 2019-12-12 20:46 /wa/t/m/h/try_it/prt=p1/base_0000005
-rw-rw----+ 3 hive hadoop 48 2019-12-12 20:46 /wa/t/m/h/try_it/prt=p1/base_0000005/_metadata_acid
-rw-rw----+ 3 hive hadoop 1 2019-12-12 20:46 /wa/t/m/h/try_it/prt=p1/base_0000005/_orc_acid_version
-rw-rw----+ 3 hive hadoop 228 2019-12-12 20:46 /wa/t/m/h/try_it/prt=p1/base_0000005/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2
drwxrwx---+ - hive hadoop 0 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2/base_0000004
-rw-rw----+ 3 hive hadoop 48 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2/base_0000004/_metadata_acid
-rw-rw----+ 3 hive hadoop 1 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2/base_0000004/_orc_acid_version
-rw-rw----+ 3 hive hadoop 815 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p2/base_0000004/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 21:54 /wa/t/m/h/try_it/prt=p3
drwxrwx---+ - hive hadoop 0 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p3/base_0000006
-rw-rw----+ 3 hive hadoop 48 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p3/base_0000006/_metadata_acid
-rw-rw----+ 3 hive hadoop 1 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p3/base_0000006/_orc_acid_version
-rw-rw----+ 3 hive hadoop 835 2019-12-12 21:53 /wa/t/m/h/try_it/prt=p3/base_0000006/bucket_00000 |
Looking at p1’s data we make sure there is no data for this partition.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p1/base_0000005/bucket_00000 p1Base
$ java -jar orc-tools-1.5.1-uber.jar data p1Base
Processing data file p1Base [length: 228] |
We can see that p2 has a single row.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p2/base_0000004/bucket_00000 p2Base
$ java -jar orc-tools-1.5.1-uber.jar data p2Base
Processing data file p2Base [length: 815]
{"operation":0,"originalTransaction":4,"bucket":536870912,"rowId":0,"currentTransaction":4,
"row":{"id":2,"a_val":"noise","b_val":"bogus3"}} |
And lastly, we verify that p3’s data contains two rows.
Code Block |
---|
breakoutMode | wide |
---|
language | bash |
---|
|
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p3/base_0000006/bucket_00000 p3Base
$ java -jar orc-tools-1.5.1-uber.jar data p3Base
Processing data file p3Base [length: 835]
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":{"id":3,"a_val":"noise","b_val":"bogus2"}}
{"operation":0,"originalTransaction":6,"bucket":536870912,"rowId":0,"currentTransaction":6,
"row":{"id":1,"a_val":"noise","b_val":"bogus2"}} |
Again, the contents of the now fully compacted into base
files table looks like the following.
Code Block |
---|
breakoutMode | wide |
---|
language | sql |
---|
|
select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id | try_it.a_val | try_it.b_val | try_it.prt |
+------------+---------------+---------------+-------------+
| 2 | noise | bogus3 | p2 |
| 3 | noise | bogus2 | p3 |
| 1 | noise | bogus2 | p3 |
+------------+---------------+---------------+-------------+
3 rows selected (0.307 seconds) |