Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Eventually, you will be ready to go well beyond something as simple as the hadoop mini smoke test (VERY mini) to build more confidence in your Hadoop cluster.  This posting is going to introduce Hortonworks' Hive TestBench (HTB) whose focus is on enabling queries from the Transaction Performance Processing Council (TPC)'s TPC-H and TPC-DS decision support benchmarking standards. 

...

So, sounds like an easy enough problem to solve by tar'ing up the hive-testbench folder and drop it elsewhere so that the "setup" scripts can be run to generate data and load create/load the needed Hive databases and tables.  That said, the size of this folder is quite big.

...

It is easy enough to trim down this overall folder size by making a copy of it (just in case we delete too much!!) and then see what things can be deletedremoved.  If curious, MEP stands for Minimally Executable Product (wink).

...

As shown above, we can start off deleting the maven elements and then we realize we need to focus in on the "-gen" folders.  Then, we can start with working Let's hone in on the 61 MB tpcds-gen folder where we can start off deleting the zip files there.

...

At this point we have an artifact we can unwind on another host to generate & load some Hive tables to run the bundled queries against.  Of course, you would want to do this on a like-for-like system (primarily OS) which should not be a problem in a typical organization's multi-cluster environment.  For this blog posting, I'm just going to unzip everything in a separate folder on the same HDP Sandbox instance.

...

Code Block
languagebash
themeEmacs
[lester@sandbox hive-testbench-MEP]$ hive
hive> show databases;
OK
default
tpcds_bin_partitioned_orc_2
tpcds_text_2
tpch_flat_orc_2
tpch_text_2
xademo
Time taken: 3.691 seconds, Fetched: 6 row(s)
hive> use tpch_flat_orc_2;
OK
Time taken: 0.31 seconds
hive> show tables;
OK
customer
lineitem
nation
orders
part
partsupp
region
supplier
Time taken: 0.439 seconds, Fetched: 8 row(s)
hive> desc nation;
OK
n_nationkey             int                                         
n_name                  string                                      
n_regionkey             int                                         
n_comment               string                                      
Time taken: 0.681 seconds, Fetched: 4 row(s)
hive> select * from nation limit 1;
OK
0    ALGERIA    0     haggle. carefully final deposits detect slyly agai
Time taken: 1.09 seconds, Fetched: 1 row(s)

More importantly, we can now run the actual TPC queries such as identified in the GitHub project's README file although they are sure going to be slow on my little sandbox.

...

Now that all is operational what do you do next?  Basically, you are looking to run these base standardized queries to generate a baseline set of metrics that can be rerun when you make configuration changes and/or cluster size increases (i.e. Rinse, Later, and Repeat).

Happy Hadooping and happy benchmarking!!