...
Code Block | ||||
---|---|---|---|---|
| ||||
hdfs dfs -ls /warehouse/tablespace/managed/hive/ drwxrwx---+ - hive hadoop 0 2019-12-12 07:38 /wwa/t/m/h/try_it |
Info |
---|
The |
...
There is no need to look at the ORC files for this change as it is similar to what we saw in transaction #3.
Txn 5 & 6: UPDATE Single Row to Change Partition
A student in a recent class asked if it would be possible to run an update statement that changed the partition virtual column value. I wasn’t sure and could argue it both ways. So, like most things in Hadoop the best way to answer a question like this is to just TRY IT!
Code Block | ||||
---|---|---|---|---|
| ||||
UPDATE try_it SET prt = 'p3' WHERE a_val = 'noise' AND prt = 'p1';
Error: Error while compiling statement: FAILED: SemanticException [Error 10292]:
Updating values of partition columns is not supported (state=42000,code=10292) |
Well, as you can see, NO JOY!! But would could just run to separate transactions; one to delete it and one to add it, but we do not have an atomic transaction with this and the responsibility of the data integrity would fall to the application, or person, to make sure both are committed.
Code Block | ||||
---|---|---|---|---|
| ||||
DELETE FROM try_it WHERE a_val = 'noise' AND prt = 'p1';
INSERT INTO try_it VALUES (1,'noise','bogus2','p3');
select * from try_it;
+------------+---------------+---------------+-------------+
| try_it.id | try_it.a_val | try_it.b_val | try_it.prt |
+------------+---------------+---------------+-------------+
| 2 | noise | bogus3 | p2 |
| 3 | noise | bogus2 | p3 |
| 1 | noise | bogus2 | p3 |
+------------+---------------+---------------+-------------+
3 rows selected (0.227 seconds) |
Here are the applicable HDFS details for these two transactions.
Code Block | ||||
---|---|---|---|---|
| ||||
drwxrwx---+ - hive hadoop 0 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 733 2019-12-12 09:57 /wa/t/m/h/try_it/prt=p1/delete_delta_0000005_0000005_0000/bucket_00000
drwxrwx---+ - hive hadoop 0 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000
-rw-rw----+ 3 hive hadoop 1 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000/_orc_acid_version
-rw-rw----+ 3 hive hadoop 813 2019-12-12 10:06 /wa/t/m/h/try_it/prt=p3/delta_0000006_0000006_0000/bucket_00000 |
Parting Thoughts
These ACID transaction capabilities were mentioned way back during my presenting at hadoop summit (archiving evolving databases in hive) talk a few years ago and provide a much cleaner way at develop-time to address this need for allowing mutable data in Hive’s (previously) immutable world.
A concern that surfaces is that small files and hadoop's hdfs (bonus: an inode formula) causes not just “Namenode pressure”, but also forces the processing to read potentially lots and lots of small(ish) delta files instead of a few big(ger) “base” files. To make it worse, this work cannot be done in parallel for a given partition due to the need to walk these edits files in order so that Hive can present an accurate representation of what the end results should be.
Some of these concerns are addressed with hive delta file compaction (minor and major) processing that exists and anther big helper is to leverage the MERGE command to lump many changes into a single transaction. Both of these topics will be presented in future blog posts.