...
You can see the single row we added into the p1 partition is present. You’ll also notice it is annotated as coming from transaction #1 as indicated by "originalTransactioncurrentTransaction":1
.
Txn 2: INSERT Multiple Rows Across Multiple Partitions
...
You can now see that both the p2 and p3 record additions are linked to "originalTransactioncurrentTransaction":2
. To reiterate, transaction #2 spanned multiple partitions and each partition’s delta folder/file was properly aligned with the same transaction.
...
Let’s start off with the SQL.
Code Block | ||||
---|---|---|---|---|
| ||||
UPDATE try_it SET b_val = 'bogus2' WHERE a_val = 'noise';
select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id | try_it.a_val | try_it.b_val | try_it.prt |
+------------+---------------+---------------+-------------+
| 1 | noise | bogus2 | p1 |
| 2 | noise | bogus2 | p2 |
| 3 | noise | bogus2 | p3 |
+------------+---------------+---------------+-------------+
3 rows selected (0.192 seconds) |
Verify that all three partitions are modified by each having delete_delta_
and delta_
directories.
Code Block | ||||
---|---|---|---|---|
| ||||
$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 733 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/bucket_00000
... delta_0000001_0000001_ DELETED FOR BRIEVITY ...
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/bucket_00000
... delta_0000002_0000002_ DELETED FOR BRIEVITY ...
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/bucket_00000
... delta_0000002_0000002_ DELETED FOR BRIEVITY ...
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 807 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/bucket_00000 |
Let’s just focus on p2’s files as the other two partitions are basically the same thing.
First, look at the delete file which shows that this is "currentTransaction":3
which is a delete of "originalTransaction":2
created earlier in transaction #2.
Code Block | ||||
---|---|---|---|---|
| ||||
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/bucket_00000 updateAllPartitionsExample-delete_delta
$ java -jar orc-tools-1.5.1-uber.jar data updateAllPartitionsExample-delete_delta
Processing data file updateAllPartitionsExample-delete_delta [length: 727]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":null} |
The delta file then shows a new "currentTransaction":3
record which is the projection of what the update statement modified to the record that was just deleted.
Code Block | ||||
---|---|---|---|---|
| ||||
$ hdfs dfs -get /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/bucket_00000 updateAllPartitionsExample-delta
$ java -jar orc-tools-1.5.1-uber.jar data updateAllPartitionsExample-delta
Processing data file updateAllPartitionsExample-delta [length: 816]
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,
"row":{"id":2,"a_val":"noise","b_val":"bogus2"}} |
Txn 4: UPDATE Single Row (Leveraging Partitioning)
This use case is just calling out that that we should be using the partitioned virtual column in the update statement as much as possible to make Hive’s just much easier, but only looking in the folders that can possibly be affected instead of walking the full table’s contents.
Code Block | ||||
---|---|---|---|---|
| ||||
UPDATE try_it SET b_val = 'bogus3' WHERE b_val = 'bogus2' AND prt = 'p2';
+------------+---------------+---------------+-------------+
| try_it.id | try_it.a_val | try_it.b_val | try_it.prt |
+------------+---------------+---------------+-------------+
| 1 | noise | bogus2 | p1 |
| 2 | noise | bogus3 | p2 |
| 3 | noise | bogus2 | p3 |
+------------+---------------+---------------+-------------+
3 rows selected (0.201 seconds) |
In this example, without the partition condition we would have updated all three partitions again. Make sure only the p2 partition shows delete_delta_0000004_0000004_
and delta_0000004_0000004_
folders.
Code Block | ||||
---|---|---|---|---|
| ||||
$ hdfs dfs -ls -R /wa/t/m/h/try_it
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 733 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 788 2019-12-12 07:43 /wa/t/m/h/try_it/prt=p1/delta_0000001_0000001_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p1/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 733 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delete_delta_0000004_0000004_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 788 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p2/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 816 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p2/delta_0000003_0000003_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:45 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 815 2019-12-12 09:44 /wa/t/m/h/try_it/prt=p2/delta_0000004_0000004_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 727 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delete_delta_0000003_0000003_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 796 2019-12-12 08:15 /wa/t/m/h/try_it/prt=p3/delta_0000002_0000002_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 807 2019-12-12 09:34 /wa/t/m/h/try_it/prt=p3/delta_0000003_0000003_0000/bucket_00000 |
There is no need to look at the ORC files for this change as it is similar to what we saw in transaction #3.