hadoop mini smoke test (VERY mini)
Obviously, a Hadoop cluster is a complicated beast. Hortonworks' HDP 2.2 is a great example of where advanced Hadoop distributions are going and of the multitude of components coming together to enable the Modern Data Architecture. Executing an exhaustive test of all the components, or at least the ones you are using, is critical, but sometimes you just need to do a simple smoke test. This post presents even less than that; a mini smoke test which just might fit the bill in some situations such as validating your building a virtualized 5-node HDP 2.0 cluster (all within a mac) exercise.
This simple tutorial will present some very quick MapReduce, Hive and Pig tests you can run to gain, at least a small amount of, confidence in a cluster. For my purposes, I've created a user called lester
using the simple hadoop cluster user provisioning process (simple = w/o pam or kerberos) instructions.
MapReduce Testing
The quintessential testing approach for MapReduce is Terasort. My examples below are based on the HDP 2.2 instructions (there is a typo in there – bonus points for catching it). I ran the following command to generate 50MB of test data.
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar teragen 500000 /user/lester/teragen-50MB
To validate the data was created, just run the following.
[lester@n1 ~]$ hdfs dfs -du /user/lester/teragen-50MB 0 /user/lester/teragen-50MB/_SUCCESS 25000000 /user/lester/teragen-50MB/part-m-00000 25000000 /user/lester/teragen-50MB/part-m-00001 [lester@n1 ~]$
Or get a human-readable summarized format.
[lester@n1 ~]$ hdfs dfs -du -s -h /user/lester/teragen-50MB 47.7 M /user/lester/teragen-50MB [lester@n1 ~]$
Then you can sort it.
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar terasort /user/lester/teragen-50MB /user/lester/terasort-50MB
When finished, double-check to make sure the same size data has been outputted.
[lester@n1 ~]$ hdfs dfs -du /user/lester/terasort-50MB 0 /user/lester/terasort-50MB/_SUCCESS 0 /user/lester/terasort-50MB/_partition.lst 50000000 /user/lester/terasort-50MB/part-r-00000 [lester@n1 ~]$ hdfs dfs -du -s -h /user/lester/terasort-50MB 47.7 M /user/lester/terasort-50MB [lester@n1 ~]$
Of course, the real purpose of TeraSort is benchmarking so feel free to mess with data bigger than 50MB.
Hive Testing
Let's do some quick testing with Hive, too. Create a simple CSV file and load it into HDFS.
[lester@n1 ~]$ cat data.txt a,1 b,2 c,3 [lester@n1 ~]$ hdfs dfs -put data.txt [lester@n1 ~]$ hdfs dfs -cat /user/lester/data.txt a,1 b,2 c,3 [lester@n1 ~]$
The following script creates a table that this data could be housed into. NOTE: OpenCSVSerde
only surfaced in Hive 0.14 as described here.
CREATE TABLE alpha_num ( alpha string, num int ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE;
It can be created by running hive -f createTable.hql
, then we can use Hive's CLI to interactively describe, load, and query the new table.
[lester@n1 ~]$ hive hive> DESCRIBE alpha_num; OK alpha string from deserializer num string from deserializer Time taken: 0.462 seconds, Fetched: 2 row(s) hive> LOAD DATA INPATH '/user/lester/data.txt' INTO TABLE alpha_num; Loading data to table default.alpha_num Table default.alpha_num stats: [numFiles=1, totalSize=12] OK Time taken: 0.498 seconds hive> SELECT * FROM alpha_num; OK a 1 b 2 c 3 Time taken: 0.125 seconds, Fetched: 3 row(s) hive> exit; [lester@n1 ~]$
Obviously, thanks to HCatalog, this data is available from other languages/framework/components beyond Hive.
Pig Testing
Pig is one of those languages that could use HCatalog by leveraging HCatalog LoadStore, but we'll keep it simple and just load a file off HDFS. Just copy the CSV file back into HDFS (the Hive LOAD
command deleted it above). To not let this be too simple, I'm leveraging the CSVExcelStorage
class as referenced in use pig to calculate salary statistics for georgia educators (second of a three-part series).
[lester@n1 ~]$ hdfs dfs -put data.txt [lester@n1 ~]$ cat dumpData.pig REGISTER /usr/hdp/current/pig-client/piggybank.jar; az99 = LOAD '/user/lester/data.txt' using org.apache.pig.piggybank.storage.CSVExcelStorage() as (alpha:chararray, num:int); dump az99; [lester@n1 ~]$ pig dumpData.pig (a,1) (b,2) (c,3) [lester@n1 ~]$
These three areas surely don't test all components, nor each of these exhaustively, but they can give you some quick confidence that the basics are working. Sometimes... that's enough!