are partially-written hdfs files accessible? (not exactly, but much more yes than I previously thought)

Today is one of those days when I thought I knew something, stood firm with my assumption, and then found out that I wasn’t as right as I thought. Yes, a more humble person might even say they were wrong. But… I’m not totally wrong, but surely not totally right!

Much like my discoveries in learning something new every day (seems hdfs is not as immutable as i thought), this time was also about Hadoop’s HDFS. I stood firm on my understanding that clients were prevented from “seeing” an in-flight / partially-written file by the fact that the HDFS NameNode (NN) would not let them be aware of the not-completed file. Much of that assumption came from repetitively teaching the bit visualized below.

It seems I was convinced the “lease” that the NN provides (step 1) indicated to the NN to not provide any details on the file until the client indicates that it is complete (step 12). From some quick Googling, it seems this is a contentious question with multiple folks weighing in on locations such as Stack Overflow Question # 26634057 (visualized below).

As with most things in technology, if you aren’t sure of the answer then go find out with a simple test. I prepared a 7.5 GB file to be loaded into HDFS.

training@cmhost:~$ ls -l /ngrams/unzipped/aTHRUe.txt 
-rw-rw-r-- 1 training training 7498258729 Mar 20 09:16 /ngrams/unzipped/aTHRUe.txt

I then created a little script that would push it to Hadoop and print out the before and after timestamps so we could see how long it took to get loaded.

training@cmhost:~$ cat putFile.sh 
date
hdfs dfs -put /ngrams/unzipped/aTHRUe.txt /tmp/ngramsAthruE/
date

Lastly, I kicked it off in the background and started polling the directory on HDFS.

training@cmhost:~$ ./putFile.sh & 
[1] 10138
Wed Mar 20 09:32:51 PDT 2019

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 2818572288 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 4831838208 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 7113539584 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_
training@cmhost:~$ Wed Mar 20 09:33:59 PDT 2019

[1]+  Done                    ./putFile.sh

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 7498258729 2019-03-20 09:33 /tmp/ngramsAthruE/aTHRUe.txt
training@cmhost:~$

If you take a good look you’ll see I was right, but again, far less correct that I was hoping to be.

Yes, the file in question, aTHRUe.txt, is NOT accessible, but one with the same name and a ._COPYING_ suffix is available during the write operation that represents all completed blocks until that point.

If a client was looking for a specific file then this would be perfectly fine, but more than likely clients would be reading the contents of all files in directory at once and this half-baked file will surely cause issues.

This is worth considering when you are building your data engineering pipelines and should be addressed in a way as to not cause concern.

For a simple batch ingestion workflow, this could simply be using a working directory name to write to until all data is finalized and then a simple hdfs dfs -mv command could be executed to aid any client that is triggered on the appropriate data availability.

As always is the case, never be too sure of yourself to not listen to others or to give yourself a few minutes to validate your understanding with a simple test. And yes, enjoy the big piece of humble pie when it is served to you.