hadoop mini smoke test (VERY mini)

Obviously, a Hadoop cluster is a complicated beast.  Hortonworks' HDP 2.2 is a great example of where advanced Hadoop distributions are going and of the multitude of components coming together to enable the Modern Data Architecture.  Executing an exhaustive test of all the components, or at least the ones you are using, is critical, but sometimes you just need to do a simple smoke test.  This post presents even less than that; a mini smoke test which just might fit the bill in some situations such as validating your building a virtualized 5-node HDP 2.0 cluster (all within a mac) exercise.

This simple tutorial will present some very quick MapReduce, Hive and Pig tests you can run to gain, at least a small amount of, confidence in a cluster.  For my purposes, I've created a user called lester using the simple hadoop cluster user provisioning process (simple = w/o pam or kerberos) instructions.

MapReduce Testing

The quintessential testing approach for MapReduce is Terasort.  My examples below are based on the HDP 2.2 instructions (there is a typo in there – bonus points for catching it).  I ran the following command to generate 50MB of test data.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar teragen 500000 /user/lester/teragen-50MB

To validate the data was created, just run the following.

[lester@n1 ~]$ hdfs dfs -du /user/lester/teragen-50MB
0         /user/lester/teragen-50MB/_SUCCESS
25000000  /user/lester/teragen-50MB/part-m-00000
25000000  /user/lester/teragen-50MB/part-m-00001
[lester@n1 ~]$ 

Or get a human-readable summarized format.

[lester@n1 ~]$ hdfs dfs -du -s -h /user/lester/teragen-50MB
47.7 M  /user/lester/teragen-50MB
[lester@n1 ~]$ 

Then you can sort it.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar terasort /user/lester/teragen-50MB /user/lester/terasort-50MB

When finished, double-check to make sure the same size data has been outputted. 

[lester@n1 ~]$ hdfs dfs -du /user/lester/terasort-50MB
0         /user/lester/terasort-50MB/_SUCCESS
0         /user/lester/terasort-50MB/_partition.lst
50000000  /user/lester/terasort-50MB/part-r-00000
[lester@n1 ~]$ hdfs dfs -du -s -h /user/lester/terasort-50MB
47.7 M  /user/lester/terasort-50MB
[lester@n1 ~]$ 

Of course, the real purpose of TeraSort is benchmarking so feel free to mess with data bigger than 50MB.

Hive Testing

Let's do some quick testing with Hive, too.  Create a simple CSV file and load it into HDFS.

[lester@n1 ~]$ cat data.txt
a,1
b,2
c,3
[lester@n1 ~]$ hdfs dfs -put data.txt
[lester@n1 ~]$ hdfs dfs -cat /user/lester/data.txt
a,1
b,2
c,3
[lester@n1 ~]$ 

The following script creates a table that this data could be housed into.  NOTE: OpenCSVSerde only surfaced in Hive 0.14 as described here.

createTable.hql
CREATE TABLE alpha_num ( alpha string, num int )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE;

It can be created by running hive -f createTable.hql, then we can use Hive's CLI to interactively describe, load, and query the new table.

[lester@n1 ~]$ hive 
hive> DESCRIBE alpha_num;                                           
OK
alpha                   string                  from deserializer   
num                     string                  from deserializer   
Time taken: 0.462 seconds, Fetched: 2 row(s)
hive> LOAD DATA INPATH '/user/lester/data.txt' INTO TABLE alpha_num;
Loading data to table default.alpha_num
Table default.alpha_num stats: [numFiles=1, totalSize=12]
OK
Time taken: 0.498 seconds
hive> SELECT * FROM alpha_num;                                      
OK
a    1
b    2
c    3
Time taken: 0.125 seconds, Fetched: 3 row(s)
hive> exit;
[lester@n1 ~]$ 

Obviously, thanks to HCatalog, this data is available from other languages/framework/components beyond Hive.

Pig Testing

Pig is one of those languages that could use HCatalog by leveraging HCatalog LoadStore, but we'll keep it simple and just load a file off HDFS.  Just copy the CSV file back into HDFS (the Hive LOAD command deleted it above).  To not let this be too simple, I'm leveraging the CSVExcelStorage class as referenced in use pig to calculate salary statistics for georgia educators (second of a three-part series).

[lester@n1 ~]$ hdfs dfs -put data.txt 
[lester@n1 ~]$ cat dumpData.pig 
REGISTER /usr/hdp/current/pig-client/piggybank.jar;
az99 = LOAD '/user/lester/data.txt'
  using org.apache.pig.piggybank.storage.CSVExcelStorage()
  as (alpha:chararray, num:int);
dump az99;
[lester@n1 ~]$ pig dumpData.pig 
(a,1)
(b,2)
(c,3)
[lester@n1 ~]$ 

These three areas surely don't test all components, nor each of these exhaustively, but they can give you some quick confidence that the basics are working.  Sometimes... that's enough!