small files and hadoop's hdfs (bonus: an inode formula)
Many people have heard of the "small files" concern with HDFS. Most think it is related to the Namenode (NN) and its memory utilization, but the NN really doesn't care much if the files it is managing are big or small -- it really is concerned about how many there are.
This topic is a fairly detailed and better described via sources such as this HDFS Scalability whitepaper, but basically it comes down to the NN needing to keep track of "objects" (i.e. in memory). These objects are directories, files and blocks. For example; if you had a tree that only had 3 top-level directories *and* each of these had 3 files *and* each of these took up 3 blocks, then your NN would need to keep track of 39 objects as detailed below.
- 3 for each of the directories
- 9 for the total number of files
- 27 for all the blocks
Now... wouldn't it be nice if Hadoop could tell you how many objects you have in place? Of course it would... and of course it does... You can run the count FS Shell operation, but that will only tell you how many directories and files are present.
[hdfs@sandbox ~]$ hdfs dfs -count / 248 510 560952712 /
One way to get the full accounting is to pull up the NN web UI.
So how much memory is required to keep track of all of these objects? It is in the neighborhood of 150-200 bytes per object, but most folks are thinking about the total number of files on a filesystem like HDFS. So if you start applying some heuristics (such as the average number of blocks per files and average files per directory) then you find yourself with the following de facto sizing guesstimate for NN memory heap size:
1 GB of heap size for every 1 million files
Again, that's not a one-size-fits-all formula, but it is a one-size-fits-MOST one that has been used rather universally with great success. Obviously, many things to consider when setting this up and just as many strategies for reducing the number of files in the first place such as concatenating files together, compressing files into a single binary (this one is most useful when the files are purely for long-term storage), leveraging Hadoop Archives, HDFS Federation, and so on. All that said, it is not uncommon to see a 96GB heap sized NN managing 100+ million files.
Not to prescribe a setting for your cluster, but at a minimum, even for early POC efforts, increase your NN heap size to, at least, 4096 MB. If using Ambari, the following screenshot shows you where to find this setting.
So, if we are actively monitoring our NN heap size and keeping it inline with the number of objects the NN is managing we can more accurately fine-tune our expectations for each cluster.
On the flip side, it seems easy enough to manage the amount of disk space we have on HDFS by all the inherent reporting abilities of HDFS (and tools like Ambari), not to mention some very simple math. I recently did get asked about how inodes themselves can prevent the cluster from allowing new files to be added. As a refresher, inodes keep track of all the files on a given (physical, not HDFS) file system. Here is an example of a partition that has 99% of the space allocated showing free, but the inodes are all used up.
Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdf1 3906486416 10586920 3700550300 1% /data04 Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdf1 953856 953856 0 100% /data04
I'll save a richer description on inodes, pros/cons to increasing this value, and the actual procedure(s) to utilize for a stronger linux administrator than I, but again, the question raised to me was; "will the inode size matter on the (underlying) hdfs file system mounts". The short answer is that, in a cluster with more than a trivial amount of Datanodes (DNs) then inodes should not cause an issue. Basically, this is because *all* of the files will be spread across all the DNs *and* each of these will have multiple physical spindles to further spread the love.
Of course, things are often more complicated when we dive deeper. First, each file will be broken up into appropriately sized blocks and then each of these will have three copies. So, our example file above with 3 blocks will need to have 9 separate physical files stored by the DNs. As you peel back the onion, you'll see there really are two files for each block stored; the block itself and then a second file that contains metadata & checksum values. In fact, it gets a tiny bit more complicated than that by the DNs needing to have a directory structure so they don't overrun a flat directory. Chapter 10 of Hadoop: The Definitive Guide (3rd Edition) has a good write up on this as you can see here which is further visualized by the abbreviated directory listing from a DN's block data below.
[hdfs@sandbox finalized]$ pwd /hadoop/hdfs/data/current/BP-1200952396-10.0.2.15-1398089695400/current/finalized [hdfs@sandbox finalized]$ ls -al total 49940 drwxr-xr-x 66 hdfs hadoop 12288 Apr 21 07:18 . drwxr-xr-x 4 hdfs hadoop 4096 Jul 3 12:33 .. -rw-r--r-- 1 hdfs hadoop 7 Apr 21 07:16 blk_1073741825 -rw-r--r-- 1 hdfs hadoop 11 Apr 21 07:16 blk_1073741825_1001.meta -rw-r--r-- 1 hdfs hadoop 42 Apr 21 07:16 blk_1073741826 -rw-r--r-- 1 hdfs hadoop 11 Apr 21 07:16 blk_1073741826_1002.meta -rw-r--r-- 1 hdfs hadoop 392124 Apr 21 07:18 blk_1073741887 -rw-r--r-- 1 hdfs hadoop 3071 Apr 21 07:18 blk_1073741887_1063.meta -rw-r--r-- 1 hdfs hadoop 1363159 Apr 21 07:18 blk_1073741888 -rw-r--r-- 1 hdfs hadoop 10659 Apr 21 07:18 blk_1073741888_1064.meta drwxr-xr-x 2 hdfs hadoop 4096 Apr 21 07:22 subdir0 drwxr-xr-x 2 hdfs hadoop 4096 Jun 3 08:45 subdir1 drwxr-xr-x 2 hdfs hadoop 4096 Apr 21 07:21 subdir63 drwxr-xr-x 2 hdfs hadoop 4096 Apr 21 07:18 subdir9
To generalize (and let's be clear... that's what I'm doing here -- just creating a rough formula to make sure we are in the right ballpark), we can say that we'll need 2.1 times the amount of inodes for each copy of a block; that's 6.3 times the number of blocks we'd expect to be loaded into HDFS to account for replication.
To start the math, let's calculate the NbrOfAvailBlocksPerCluster which is simply the inode limit per disk TIMES the number of disks in a DN TIMES the number of DNs in the cluster and then DIVIDE that number by the 6.3 value described above. For example, the following values surface for a cluster that has DNs with 10 disks each whose JBOD file system partitions can support 1 million inodes.
- 3 Datanodes = 4.7 MM blocks
- 30 Datanodes = 47.6 MM blocks
- 300 Datanodes = 476.2 MM blocks
To then know about how many files our cluster can store before running out of inodes, we need to simply divide the NbrOfAvailBlocksPerCluster by the AvgNbrBlocksPerFile we expect for a given cluster. The following numbers are coupled with the 30 node cluster identified above.
- 1 block/file = 47.6 MM files
- 2.5 blocks/file = 19.0 MM files
- 10 blocks/file = 4.7 MM files
Now, if you are immediately worried that these calculations suggest that a cluster with 30 DNs can only hold 20 MM files (using 2.5 blocks on average), then simply raise the inodes to something larger. If the inodes setting was raised to 100 MM then that same 30 DN cluster would allow for 1.9 BILLION files. If that cluster's block size was 128 MB then it would have twice as much data as the cluster could hold.
So, let's see if we can write that down in an algebraic formula. Remember, we're just thinking about inodes here -- the disk sizing calculation is MUCH easier.
NbrOfFilesInodeSettingCanSupport = ( ( InodeSettingPerDisk * NbrDisksPerDN * NbrDNs ) / ( ReplFactor * 2.1 ) ) / AvgNbrBlocksPerFile
As I'm doing this from the beach while enjoying my last full day of vacation, I challenge everyone to double-check my numbers AND MY LOGIC as the sun and the beverages are surely getting to me.