are partially-written hdfs files accessible? (not exactly, but much more yes than I previously thought)

Today is one of those days when I thought I knew something, stood firm with my assumption, and then found out that I wasn’t as right as I thought. Yes, a more humble person might even say they were wrong. But… I’m not totally wrong, but surely not totally right!

Much like my discoveries in learning something new every day (seems hdfs is not as immutable as i thought), this time was also about Hadoop’s HDFS. I stood firm on my understanding that clients were prevented from “seeing” an in-flight / partially-written file by the fact that the HDFS NameNode (NN) would not let them be aware of the not-completed file. Much of that assumption came from repetitively teaching the bit visualized below.

It seems I was convinced the “lease” that the NN provides (step 1) indicated to the NN to not provide any details on the file until the client indicates that it is complete (step 12). From some quick Googling, it seems this is a contentious question with multiple folks weighing in on locations such as Stack Overflow Question # 26634057 (visualized below).

As with most things in technology, if you aren’t sure of the answer then go find out with a simple test. I prepared a 7.5 GB file to be loaded into HDFS.

training@cmhost:~$ ls -l /ngrams/unzipped/aTHRUe.txt -rw-rw-r-- 1 training training 7498258729 Mar 20 09:16 /ngrams/unzipped/aTHRUe.txt

I then created a little script that would push it to Hadoop and print out the before and after timestamps so we could see how long it took to get loaded.

training@cmhost:~$ cat putFile.sh date hdfs dfs -put /ngrams/unzipped/aTHRUe.txt /tmp/ngramsAthruE/ date

Lastly, I kicked it off in the background and started polling the directory on HDFS.

training@cmhost:~$ ./putFile.sh & [1] 10138 Wed Mar 20 09:32:51 PDT 2019 training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE Found 1 items -rw-r--r-- 3 training supergroup 2818572288 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_ training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE Found 1 items -rw-r--r-- 3 training supergroup 4831838208 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_ training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE Found 1 items -rw-r--r-- 3 training supergroup 7113539584 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_ training@cmhost:~$ Wed Mar 20 09:33:59 PDT 2019 [1]+ Done ./putFile.sh training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE Found 1 items -rw-r--r-- 3 training supergroup 7498258729 2019-03-20 09:33 /tmp/ngramsAthruE/aTHRUe.txt training@cmhost:~$

If you take a good look you’ll see I was right, but again, far less correct that I was hoping to be.

Yes, the file in question, aTHRUe.txt, is NOT accessible, but one with the same name and a ._COPYING_ suffix is available during the write operation that represents all completed blocks until that point.

If a client was looking for a specific file then this would be perfectly fine, but more than likely clients would be reading the contents of all files in directory at once and this half-baked file will surely cause issues.

This is worth considering when you are building your data engineering pipelines and should be addressed in a way as to not cause concern.

For a simple batch ingestion workflow, this could simply be using a working directory name to write to until all data is finalized and then a simple hdfs dfs -mv command could be executed to aid any client that is triggered on the appropriate data availability.

As always is the case, never be too sure of yourself to not listen to others or to give yourself a few minutes to validate your understanding with a simple test. And yes, enjoy the big piece of humble pie when it is served to you.

These findings only raised more questions as I thought about it. What happens to cat, cp, mv, and rm commands? I tested them out and found less than desirable answers, but ones that did fit in line with the findings above.

I was going to publish a second blog post as a follow-up as I’m a big believer that blog postings (not wiki pages!) are immutable, but since I was wrong about learning something new every day (seems hdfs is not as immutable as i thought) and these additional research findings are additive and do not alter the content above, I decided to just include them below.

If you don’t know why blog posts shouldn’t be edited, but wiki pages should; then I would suggest you should check out my enterprise 2.0 book review (using web 2.0 technologies within organizations). Yes, I am a “wiki gnome” at heart!!

Can the COPYING File be Read? YES.

The file cannot be read by the filename that it is being created as, but the COPYING file can.

Can the COPYING File be Copied? YES.

While I was able to create an exception on one test, the results below do validate that the in-flight COPYING file can be copied based on its size at the time of the operation.

Can the COPYING File be Moved/Renamed? YES.

Much to my surprise, this actually caused no problems at all and the completed, full-sized, file retained the name it was renamed to.

Can the COPYING File be Deleted? YES.

Sadly, it can. Additionally, it causes havoc for the client writing the file.

How Do I Feel About All of This?

I guess it doesn’t matter that I don’t like it… It is what it is!! Glad I know NOW!

Good luck and happy Hadooping.