Eventually, you will be ready to go well beyond something as simple as the hadoop mini smoke test (VERY mini) to build more confidence in your Hadoop cluster. This posting is going to introduce Hortonworks' Hive TestBench (HTB) whose focus is on enabling queries from the Transaction Performance Processing Council (TPC)'s TPC-H and TPC-DS decision support benchmarking standards.
...
So, sounds like an easy enough problem to solve by tar'ing up the hive-testbench
folder and drop it elsewhere so that the "setup" scripts can be run to generate data and load create/load the needed Hive databases and tables. That said, the size of this folder is quite big.
...
It is easy enough to trim down this overall folder size by making a copy of it (just in case we delete too much!!) and then see what things can be deletedremoved. If curious, MEP
stands for Minimally Executable Product .
...
As shown above, we can start off deleting the maven elements and then we realize we need to focus in on the "-gen" folders. Then, we can start with working Let's hone in on the 61 MB tpcds-gen
folder where we can start off deleting the zip files there.
...
At this point we have an artifact we can unwind on another host to generate & load some Hive tables to run the bundled queries against. Of course, you would want to do this on a like-for-like system (primarily OS) which should not be a problem in a typical organization's multi-cluster environment. For this blog posting, I'm just going to unzip everything in a separate folder on the same HDP Sandbox instance.
...
Code Block | ||||
---|---|---|---|---|
| ||||
[lester@sandbox hive-testbench-MEP]$ hive hive> show databases; OK default tpcds_bin_partitioned_orc_2 tpcds_text_2 tpch_flat_orc_2 tpch_text_2 xademo Time taken: 3.691 seconds, Fetched: 6 row(s) hive> use tpch_flat_orc_2; OK Time taken: 0.31 seconds hive> show tables; OK customer lineitem nation orders part partsupp region supplier Time taken: 0.439 seconds, Fetched: 8 row(s) hive> desc nation; OK n_nationkey int n_name string n_regionkey int n_comment string Time taken: 0.681 seconds, Fetched: 4 row(s) hive> select * from nation limit 1; OK 0 ALGERIA 0 haggle. carefully final deposits detect slyly agai Time taken: 1.09 seconds, Fetched: 1 row(s) |
More importantly, we can now run the actual TPC queries such as identified in the GitHub project's README file although they are sure going to be slow on my little sandbox.
...
Now that all is operational what do you do next? Basically, you are looking to run these base standardized queries to generate a baseline set of metrics that can be rerun when you make configuration changes and/or cluster size increases (i.e. Rinse, Later, and Repeat).
Happy Hadooping and happy benchmarking!!