Today is one of those days when I thought I knew something, stood firm with my assumption, and then found out that I wasn’t as right as I thought. Yes, a more humble person might even say they were wrong. But… I’m not totally wrong, but surely not totally right! (wink)

Much like my discoveries in learning something new every day (seems hdfs is not as immutable as i thought), this time was also about Hadoop’s HDFS. I stood firm on my understanding that clients were prevented from “seeing” an in-flight / partially-written file by the fact that the HDFS NameNode (NN) would not let them be aware of the not-completed file. Much of that assumption came from repetitively teaching the bit visualized below.

It seems I was convinced the “lease” that the NN provides (step 1) indicated to the NN to not provide any details on the file until the client indicates that it is complete (step 12). From some quick Googling, it seems this is a contentious question with multiple folks weighing in on locations such as Stack Overflow Question # 26634057 (visualized below).

As with most things in technology, if you aren’t sure of the answer then go find out with a simple test. I prepared a 7.5 GB file to be loaded into HDFS.

training@cmhost:~$ ls -l /ngrams/unzipped/aTHRUe.txt 
-rw-rw-r-- 1 training training 7498258729 Mar 20 09:16 /ngrams/unzipped/aTHRUe.txt

I then created a little script that would push it to Hadoop and print out the before and after timestamps so we could see how long it took to get loaded.

training@cmhost:~$ cat putFile.sh 
date
hdfs dfs -put /ngrams/unzipped/aTHRUe.txt /tmp/ngramsAthruE/
date

Lastly, I kicked it off in the background and started polling the directory on HDFS.

training@cmhost:~$ ./putFile.sh & 
[1] 10138
Wed Mar 20 09:32:51 PDT 2019

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 2818572288 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 4831838208 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 7113539584 2019-03-20 09:32 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_
training@cmhost:~$ Wed Mar 20 09:33:59 PDT 2019

[1]+  Done                    ./putFile.sh

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 7498258729 2019-03-20 09:33 /tmp/ngramsAthruE/aTHRUe.txt
training@cmhost:~$ 

If you take a good look you’ll see I was right, but again, far less correct that I was hoping to be.

Yes, the file in question, aTHRUe.txt, is NOT accessible, but one with the same name and a ._COPYING_ suffix is available during the write operation that represents all completed blocks until that point.

If a client was looking for a specific file then this would be perfectly fine, but more than likely clients would be reading the contents of all files in directory at once and this half-baked file will surely cause issues.

This is worth considering when you are building your data engineering pipelines and should be addressed in a way as to not cause concern.

For a simple batch ingestion workflow, this could simply be using a working directory name to write to until all data is finalized and then a simple hdfs dfs -mv command could be executed to aid any client that is triggered on the appropriate data availability.

As always is the case, never be too sure of yourself to not listen to others or to give yourself a few minutes to validate your understanding with a simple test. And yes, enjoy the big piece of humble pie when it is served to you. (smile)

These findings only raised more questions as I thought about it. What happens to cat, cp, mv, and rm commands? I tested them out and found less than desirable answers, but ones that did fit in line with the findings above.

I was going to publish a second blog post as a follow-up as I’m a big believer that blog postings (not wiki pages!) are immutable, but since I was wrong about learning something new every day (seems hdfs is not as immutable as i thought) and these additional research findings are additive and do not alter the content above, I decided to just include them below.

If you don’t know why blog posts shouldn’t be edited, but wiki pages should; then I would suggest you should check out my enterprise 2.0 book review (using web 2.0 technologies within organizations). Yes, I am a “wiki gnome” at heart!!

Can the COPYING File be Read? YES.

The file cannot be read by the filename that it is being created as, but the COPYING file can.

training@cmhost:~$ ./putFile.sh & 
[1] 10138
Fri Mar 22 11:22:21 PDT 2019

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 1818572288 2019-03-22 11:22 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ hdfs dfs -cat /tmp/ngramsAthruE/aTHRUe.txt
cat: `/tmp/ngramsAthruE/aTHRUe.txt': No such file or directory

training@cmhost:~$ hdfs dfs -cat /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

aflually_ADV	2004	1	1
aflually_ADV	2006	2	2
aflually_ADV	2008	1	1
afluente_.	1923	2	2
afluente_.	1924	5	1
afluente_.	1926	1	1
aflcat: Filesystem closed

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 2415919104 2019-03-22 11:22 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_
training@cmhost:~$ Fri Mar 22 11:23:34 PDT 2019

[1]+  Done                    ./putFile.sh

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 7498258729 2019-03-22 11:23 /tmp/ngramsAthruE/aTHRUe.txt
training@cmhost:~$ 

Can the COPYING File be Copied? YES.

While I was able to create an exception on one test, the results below do validate that the in-flight COPYING file can be copied based on its size at the time of the operation.

training@cmhost:~$ ./putFile.sh &
[1] 18298
Fri Mar 22 11:53:02 PDT 2019

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup  402653184 2019-03-22 11:53 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ hdfs dfs -cp /tmp/ngramsAthruE/aTHRUe.txt._COPYING_ /tmp/ngramsAthruE/inflight-copy.txt

training@cmhost:~$ Fri Mar 22 11:54:37 PDT 2019

[1]+  Done                    ./putFile.sh

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 2 items
-rw-r--r--   3 training supergroup 7498258729 2019-03-22 11:54 /tmp/ngramsAthruE/aTHRUe.txt
-rw-r--r--   3 training supergroup 1225386496 2019-03-22 11:53 /tmp/ngramsAthruE/inflight-copy.txt
training@cmhost:~$

Can the COPYING File be Moved/Renamed? YES.

Much to my surprise, this actually caused no problems at all and the completed, full-sized, file retained the name it was renamed to.

training@cmhost:~$ ./putFile.sh &
[1] 24698
Fri Mar 22 12:02:37 PDT 2019

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup  536870912 2019-03-22 12:02 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ hdfs dfs -mv /tmp/ngramsAthruE/aTHRUe.txt._COPYING_ /tmp/ngramsAthruE/inflight-move.txt
training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 2013265920 2019-03-22 12:02 /tmp/ngramsAthruE/inflight-move.txt

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 4026531840 2019-03-22 12:02 /tmp/ngramsAthruE/inflight-move.txt

Fri Mar 22 12:03:54 PDT 2019

[1]+  Done                    ./putFile.sh

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup 7498258729 2019-03-22 12:03 /tmp/ngramsAthruE/inflight-move.txt
training@cmhost:~$

Can the COPYING File be Deleted? YES.

Sadly, it can. Additionally, it causes havoc for the client writing the file.

training@cmhost:~$ ./putFile.sh &
[1] 11965
Fri Mar 22 12:15:20 PDT 2019

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
Found 1 items
-rw-r--r--   3 training supergroup  536870912 2019-03-22 12:15 /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ hdfs dfs -rm -skipTrash /tmp/ngramsAthruE/aTHRUe.txt._COPYING_
Deleted /tmp/ngramsAthruE/aTHRUe.txt._COPYING_

training@cmhost:~$ 19/03/22 12:15:35 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/ngramsAthruE/aTHRUe.txt._COPYING_ (inode 56597): File does not exist. Holder DFSClient_NONMAPREDUCE_-1649048187_1 does not have any open files.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3820)
  ...  ...  ...  STACK TRACE LINES RM'D  ...  ...  ... 
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:790)
put: No lease on /tmp/ngramsAthruE/aTHRUe.txt._COPYING_ (inode 56597): File does not exist. Holder DFSClient_NONMAPREDUCE_-1649048187_1 does not have any open files.

Fri Mar 22 12:15:36 PDT 2019

[1]+  Done                    ./putFile.sh

training@cmhost:~$ hdfs dfs -ls /tmp/ngramsAthruE
training@cmhost:~$ 

How Do I Feel About All of This?

I guess it doesn’t matter that I don’t like it… It is what it is!! Glad I know NOW!

Good luck and happy Hadooping.