trying out hive testbench on hdp sandbox (and packaging it up for deployment elsewhere)

Eventually, you will be ready to go well beyond something as simple as the hadoop mini smoke test (VERY mini) to build more confidence in your Hadoop cluster.  This posting is going to introduce Hortonworks' Hive TestBench whose focus is on enabling queries from the Transaction Performance Processing Council (TPC)'s TPC-H and TPC-DS decision support benchmarking standards. 

Interestingly enough... I'm going to set this up on an instance of the HDP Sandbox; specifically, I'm using 2.2.4 for VirtualBox.  Obviously, that's not the place you are going to want to try and get your best performance, but it can be a great place to set up the test framework and then deploy it on another cluster.  I'm getting ahead of myself; more on that later...

To get started, I created a user called lester following my simple hadoop cluster user provisioning process (simple = w/o pam or kerberos).  Then I created a benchmarking folder in my home directory where I ran git clone https://github.com/hortonworks/hive-testbench.git to pull down the https://github.com/hortonworks/hive-testbench project.

The project's README gives some straightforward instructions on running the two build scripts and while the tpch-build.sh script worked without problem, tcpds-build.sh surfaced some errors for me.  I was able to get past it with the README pull request I submitted at https://github.com/hortonworks/hive-testbench/pull/4 which you should take care of before trying to run the build script.

At this point, I was able to following the instructions and run ./tpcds-setup 2 and ./tpch-setup 2 to generate some tiny "big data", but I want to go down another path first.  If you have been running these same steps yourself, you'll notice that we had to have a variety of development tools such as gcc and maven to get to the point of creating some data to run the TPC queries against.  The HDP Sandbox did great with this, but it is easy to imagine that some clusters (especially production ones) might not have these tools available, nor will all sysadmin teams' standards allow for them.  Additionally, to get this far the host needs Internet access which probably is not a normal set up for many clusters out there.

So, sounds like an easy enough problem to solve by tar'ing up the hive-testbench folder and drop it elsewhere so that the "setup" scripts can be run to generate data and create/load the needed Hive databases and tables.  That said, the size of this folder is quite big.

[lester@sandbox benchmarking]$ pwd
/home/lester/benchmarking
[lester@sandbox benchmarking]$ ls
hive-testbench  
[lester@sandbox benchmarking]$ du -sh *
272M    hive-testbench

It is easy enough to trim down this overall folder size by making a copy of it (just in case we delete too much!!) and then see what things can be removed.  If curious, MEP stands for Minimally Executable Product (wink).

[lester@sandbox benchmarking]$ cp -R hive-testbench/ hive-testbench-MEP/
[lester@sandbox benchmarking]$ cd hive-testbench-MEP/
[lester@sandbox hive-testbench-MEP]$ du -sh *
5.8M    apache-maven-3.0.5
5.0M    apache-maven-3.0.5-bin.tar.gz
236K    ddl-tpcds
44K     ddl-tpch
8.0K    README.md
4.0K    runSuite.pl
288K    sample-queries-tpcds
104K    sample-queries-tpch
16K     settings
4.0K    tpcds-build.sh
61M     tpcds-gen
4.0K    tpcds-setup.sh
4.0K    tpch-build.sh
199M    tpch-gen
4.0K    tpch-setup.sh
[lester@sandbox hive-testbench-MEP]$ rm -r apache-maven-*

As shown above, we can start off deleting the maven elements and then we realize we need to focus in on the "-gen" folders.  Let's hone in on the 61 MB tpcds-gen folder where we can start off deleting the zip files there.

[lester@sandbox hive-testbench-MEP]$ cd tpcds-gen
[lester@sandbox tpcds-gen]$ du -sh *
5.1M    DSTools-VERIFY.zip
4.0K    Makefile
28K     patches
4.0K    pom.xml
4.0K    README.md
32K     src
51M     target
5.1M    tpcds_kit.zip
[lester@sandbox tpcds-gen]$ rm *.zip

Next up, drill into the 51 MB target folder for some further purging identified below.

[lester@sandbox tpcds-gen]$ cd target
[lester@sandbox target]$ du -sh *
40K     classes
8.0K    generated-sources
31M     lib
8.0K    maven-archiver
4.0K    surefire
5.6M    tools
12K     tpcds-gen-1.0-SNAPSHOT.jar
5.1M    tpcds_kit.zip
9.8M    TPCDSVersion1.3.1
[lester@sandbox target]$ rm -r TPCDSVersion1.3.1/
[lester@sandbox target]$ rm tpcds_kit.zip 
[lester@sandbox target]$ rm -r tools

Some trial & error testing helped me to figure out there is just one file in the 31 MB lib folder that is needed after the project has been compiled/built which brings the folder size way down.

[lester@sandbox target]$ mv ./lib/dsdgen.jar .
[lester@sandbox target]$ rm -r lib
[lester@sandbox target]$ mkdir lib
[lester@sandbox target]$ mv dsdgen.jar ./lib/dsdgen.jar
[lester@sandbox target]$ du -sh *
40K     classes
8.0K    generated-sources
1.5M    lib
8.0K    maven-archiver
4.0K    surefire
12K     tpcds-gen-1.0-SNAPSHOT.jar

Now we can do basically the same thing for the TPC-H files & folders to shrink it from 199 MB down to 544 KB.

[lester@sandbox target]$ cd ../..
[lester@sandbox hive-testbench-MEP]$ du -sh *
236K    ddl-tpcds
44K     ddl-tpch
8.0K    README.md
4.0K    runSuite.pl
288K    sample-queries-tpcds
104K    sample-queries-tpch
16K     settings
4.0K    tpcds-build.sh
1.7M    tpcds-gen
4.0K    tpcds-setup.sh
4.0K    tpch-build.sh
199M    tpch-gen
4.0K    tpch-setup.sh
[lester@sandbox hive-testbench-MEP]$ cd tpch-gen
[lester@sandbox tpch-gen]$ du -sh *
12K     ddl
4.0K    Makefile
12K     patches
4.0K    pom.xml
4.0K    README.md
36K     src
173M    target
26M     tpch_kit.zip
[lester@sandbox tpch-gen]$ rm tpch_kit.zip 
[lester@sandbox tpch-gen]$ cd target
[lester@sandbox target]$ du -sh *
44K     classes
8.0K    generated-sources
29M     lib
35M     __MACOSX
8.0K    maven-archiver
4.0K    surefire
4.0K    tools
84M     tpch_2_16_0
12K     tpch-gen-1.0-SNAPSHOT.jar
26M     tpch_kit.zip
[lester@sandbox target]$ rm tpch_kit.zip 
[lester@sandbox target]$ rm -r tpch_2_16_0/
[lester@sandbox target]$ rm -r __MACOSX/
[lester@sandbox target]$ mv ./lib/dbgen.jar .
[lester@sandbox target]$ rm -r lib
[lester@sandbox target]$ mkdir lib
[lester@sandbox target]$ mv dbgen.jar ./lib/dbgen.jar
[lester@sandbox target]$ du -sh *
44K     classes
8.0K    generated-sources
384K    lib
8.0K    maven-archiver
4.0K    surefire
4.0K    tools
12K     tpch-gen-1.0-SNAPSHOT.jar

Now things are looking much better.

[lester@sandbox target]$ cd ../..
[lester@sandbox hive-testbench-MEP]$ du -sh *
236K    ddl-tpcds
44K     ddl-tpch
8.0K    README.md
4.0K    runSuite.pl
288K    sample-queries-tpcds
104K    sample-queries-tpch
16K     settings
4.0K    tpcds-build.sh
1.7M    tpcds-gen
4.0K    tpcds-setup.sh
4.0K    tpch-build.sh
544K    tpch-gen
4.0K    tpch-setup.sh
[lester@sandbox hive-testbench-MEP]$ cd ..
[lester@sandbox benchmarking]$ du -sh *
272M    hive-testbench
3.6M    hive-testbench-MEP

Definitely, 3.6 MB is MUCH tighter than 272 MB and we can squeeze it just a bit tighter when we zip it all up.

[lester@sandbox benchmarking]$ tar -czf hive-testbench-MEP.tar.gz ./hive-testbench-MEP/
[lester@sandbox benchmarking]$ du -sh *
272M    hive-testbench
3.6M    hive-testbench-MEP
2.2M    hive-testbench-MEP.tar.gz

At this point we have an artifact we can unwind on another host to generate & load some Hive tables to run the bundled queries against.  Of course, you would want to do this on a like-for-like system (primarily OS) which should not be a problem in a typical organization's multi-cluster environment.  For this blog posting, I'm just going to unzip everything in a separate folder on the same HDP Sandbox instance.

[lester@sandbox benchmarking]$ cd ..
[lester@sandbox ~]$ mkdir benchmarkELSEWHERE
[lester@sandbox ~]$ cp benchmarking/hive-testbench-MEP.tar.gz ./benchmarkELSEWHERE/
[lester@sandbox ~]$ cd benchmarkELSEWHERE/
[lester@sandbox benchmarkELSEWHERE]$ ls
hive-testbench-MEP.tar.gz
[lester@sandbox benchmarkELSEWHERE]$ tar -xf hive-testbench-MEP.tar.gz 
[lester@sandbox benchmarkELSEWHERE]$ ls
hive-testbench-MEP  hive-testbench-MEP.tar.gz
[lester@sandbox benchmarkELSEWHERE]$ cd hive-testbench-MEP
[lester@sandbox hive-testbench-MEP]$ ls
ddl-tpcds  runSuite.pl           settings        tpcds-setup.sh  tpch-setup.sh
ddl-tpch   sample-queries-tpcds  tpcds-build.sh  tpch-build.sh
README.md  sample-queries-tpch   tpcds-gen       tpch-gen

From here we can verify that we can generate some data.

[lester@sandbox hive-testbench-MEP]$ ./tpch-setup.sh 2
TPC-H text data generation complete.
Loading text data into external tables.
Optimizing table part (1/8).
Optimizing table partsupp (2/8).
Optimizing table supplier (3/8).
Optimizing table customer (4/8).
Optimizing table orders (5/8).
Optimizing table lineitem (6/8).
Optimizing table nation (7/8).
Optimizing table region (8/8).
Data loaded into database tpch_flat_orc_2.
[lester@sandbox hive-testbench-MEP]$ ./tpcds-setup.sh 2
TPC-DS text data generation complete.
Loading text data into external tables.
Optimizing table store_sales (1/24).
Optimizing table store_returns (2/24).
Optimizing table web_sales (3/24).
Optimizing table web_returns (4/24).
Optimizing table catalog_sales (5/24).
Optimizing table catalog_returns (6/24).
Optimizing table inventory (7/24).
Optimizing table date_dim (8/24).
Optimizing table time_dim (9/24).
Optimizing table item (10/24).
Optimizing table customer (11/24).
Optimizing table customer_demographics (12/24).
Optimizing table household_demographics (13/24).
Optimizing table customer_address (14/24).
Optimizing table store (15/24).
Optimizing table promotion (16/24).
Optimizing table warehouse (17/24).
Optimizing table ship_mode (18/24).
Optimizing table reason (19/24).
Optimizing table income_band (20/24).
Optimizing table call_center (21/24).
Optimizing table web_page (22/24).
Optimizing table catalog_page (23/24).
Optimizing table web_site (24/24).
Data loaded into database tpcds_bin_partitioned_orc_2.

Then we can verify the tables are present and have data loaded.

[lester@sandbox hive-testbench-MEP]$ hive
hive> show databases;
OK
default
tpcds_bin_partitioned_orc_2
tpcds_text_2
tpch_flat_orc_2
tpch_text_2
xademo
Time taken: 3.691 seconds, Fetched: 6 row(s)
hive> use tpch_flat_orc_2;
OK
Time taken: 0.31 seconds
hive> show tables;
OK
customer
lineitem
nation
orders
part
partsupp
region
supplier
Time taken: 0.439 seconds, Fetched: 8 row(s)
hive> desc nation;
OK
n_nationkey             int                                         
n_name                  string                                      
n_regionkey             int                                         
n_comment               string                                      
Time taken: 0.681 seconds, Fetched: 4 row(s)
hive> select * from nation limit 1;
OK
0    ALGERIA    0     haggle. carefully final deposits detect slyly agai
Time taken: 1.09 seconds, Fetched: 1 row(s)

More importantly, we can now run the actual TPC queries such as identified in the GitHub project's README file although they are sure going to be slow on my little sandbox.

hive> source tpch_query1.sql;
Query ID = lester_20150709034949_285c833f-fea1-48ad-8e01-8c651b73150a
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1436365286495_0090)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       0       0
Reducer 2 ......   SUCCEEDED     48         48        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 46.97 s    
--------------------------------------------------------------------------------
OK
A    F    7.5478173E7    1.1319733134601569E11    1.0753640820730757E11    1.118388987696159E11    25.505698877458507    38251.814164120944    0.05000395030241819    2959267
N    F    1966480.0    2.94611482674001E9    2.798796636156392E9    2.9110301630686007E9    25.530080751953886    38248.31650014294    0.04999558590603488    77026
N    O    1.49872794E8    2.2474848743041776E11    2.1351545694796832E11    2.2205919740072766E11    25.495187016222236    38232.454108058955    0.04997914935058362    5878474
R    F    7.5577628E7    1.1335191421816394E11    1.0768808181149153E11    1.1199430786622203E11    25.512150382609875    38263.32154391632    0.04997978002407437    2962417
Time taken: 53.591 seconds, Fetched: 4 row(s)

Now that all is operational what do you do next?  Basically, you are looking to run these standardized queries to generate a baseline set of metrics that can be rerun when you make configuration changes and/or cluster size increases (i.e. Rinse, Later, and Repeat).

Happy Hadooping and happy benchmarking!!