trying out hive testbench on hdp sandbox (and packaging it up for deployment elsewhere)
Eventually, you will be ready to go well beyond something as simple as the hadoop mini smoke test (VERY mini) to build more confidence in your Hadoop cluster. This posting is going to introduce Hortonworks' Hive TestBench whose focus is on enabling queries from the Transaction Performance Processing Council (TPC)'s TPC-H and TPC-DS decision support benchmarking standards.
Interestingly enough... I'm going to set this up on an instance of the HDP Sandbox; specifically, I'm using 2.2.4 for VirtualBox. Obviously, that's not the place you are going to want to try and get your best performance, but it can be a great place to set up the test framework and then deploy it on another cluster. I'm getting ahead of myself; more on that later...
To get started, I created a user called lester
following my simple hadoop cluster user provisioning process (simple = w/o pam or kerberos). Then I created a benchmarking
folder in my home directory where I ran git clone
https://github.com/hortonworks/hive-testbench.git
to pull down the https://github.com/hortonworks/hive-testbench project.
The project's README gives some straightforward instructions on running the two build scripts and while the tpch-build.sh
script worked without problem, tcpds-build.sh
surfaced some errors for me. I was able to get past it with the README pull request I submitted at https://github.com/hortonworks/hive-testbench/pull/4 which you should take care of before trying to run the build script.
At this point, I was able to following the instructions and run ./tpcds-setup 2
and ./tpch-setup 2
to generate some tiny "big data", but I want to go down another path first. If you have been running these same steps yourself, you'll notice that we had to have a variety of development tools such as gcc and maven to get to the point of creating some data to run the TPC queries against. The HDP Sandbox did great with this, but it is easy to imagine that some clusters (especially production ones) might not have these tools available, nor will all sysadmin teams' standards allow for them. Additionally, to get this far the host needs Internet access which probably is not a normal set up for many clusters out there.
So, sounds like an easy enough problem to solve by tar'ing up the hive-testbench
folder and drop it elsewhere so that the "setup" scripts can be run to generate data and create/load the needed Hive databases and tables. That said, the size of this folder is quite big.
[lester@sandbox benchmarking]$ pwd /home/lester/benchmarking [lester@sandbox benchmarking]$ ls hive-testbench [lester@sandbox benchmarking]$ du -sh * 272M hive-testbench
It is easy enough to trim down this overall folder size by making a copy of it (just in case we delete too much!!) and then see what things can be removed. If curious, MEP
stands for Minimally Executable Product .
[lester@sandbox benchmarking]$ cp -R hive-testbench/ hive-testbench-MEP/ [lester@sandbox benchmarking]$ cd hive-testbench-MEP/ [lester@sandbox hive-testbench-MEP]$ du -sh * 5.8M apache-maven-3.0.5 5.0M apache-maven-3.0.5-bin.tar.gz 236K ddl-tpcds 44K ddl-tpch 8.0K README.md 4.0K runSuite.pl 288K sample-queries-tpcds 104K sample-queries-tpch 16K settings 4.0K tpcds-build.sh 61M tpcds-gen 4.0K tpcds-setup.sh 4.0K tpch-build.sh 199M tpch-gen 4.0K tpch-setup.sh [lester@sandbox hive-testbench-MEP]$ rm -r apache-maven-*
As shown above, we can start off deleting the maven elements and then we realize we need to focus in on the "-gen" folders. Let's hone in on the 61 MB tpcds-gen
folder where we can start off deleting the zip files there.
[lester@sandbox hive-testbench-MEP]$ cd tpcds-gen [lester@sandbox tpcds-gen]$ du -sh * 5.1M DSTools-VERIFY.zip 4.0K Makefile 28K patches 4.0K pom.xml 4.0K README.md 32K src 51M target 5.1M tpcds_kit.zip [lester@sandbox tpcds-gen]$ rm *.zip
Next up, drill into the 51 MB target
folder for some further purging identified below.
[lester@sandbox tpcds-gen]$ cd target [lester@sandbox target]$ du -sh * 40K classes 8.0K generated-sources 31M lib 8.0K maven-archiver 4.0K surefire 5.6M tools 12K tpcds-gen-1.0-SNAPSHOT.jar 5.1M tpcds_kit.zip 9.8M TPCDSVersion1.3.1 [lester@sandbox target]$ rm -r TPCDSVersion1.3.1/ [lester@sandbox target]$ rm tpcds_kit.zip [lester@sandbox target]$ rm -r tools
Some trial & error testing helped me to figure out there is just one file in the 31 MB lib
folder that is needed after the project has been compiled/built which brings the folder size way down.
[lester@sandbox target]$ mv ./lib/dsdgen.jar . [lester@sandbox target]$ rm -r lib [lester@sandbox target]$ mkdir lib [lester@sandbox target]$ mv dsdgen.jar ./lib/dsdgen.jar [lester@sandbox target]$ du -sh * 40K classes 8.0K generated-sources 1.5M lib 8.0K maven-archiver 4.0K surefire 12K tpcds-gen-1.0-SNAPSHOT.jar
Now we can do basically the same thing for the TPC-H files & folders to shrink it from 199 MB down to 544 KB.
[lester@sandbox target]$ cd ../.. [lester@sandbox hive-testbench-MEP]$ du -sh * 236K ddl-tpcds 44K ddl-tpch 8.0K README.md 4.0K runSuite.pl 288K sample-queries-tpcds 104K sample-queries-tpch 16K settings 4.0K tpcds-build.sh 1.7M tpcds-gen 4.0K tpcds-setup.sh 4.0K tpch-build.sh 199M tpch-gen 4.0K tpch-setup.sh [lester@sandbox hive-testbench-MEP]$ cd tpch-gen [lester@sandbox tpch-gen]$ du -sh * 12K ddl 4.0K Makefile 12K patches 4.0K pom.xml 4.0K README.md 36K src 173M target 26M tpch_kit.zip [lester@sandbox tpch-gen]$ rm tpch_kit.zip [lester@sandbox tpch-gen]$ cd target [lester@sandbox target]$ du -sh * 44K classes 8.0K generated-sources 29M lib 35M __MACOSX 8.0K maven-archiver 4.0K surefire 4.0K tools 84M tpch_2_16_0 12K tpch-gen-1.0-SNAPSHOT.jar 26M tpch_kit.zip [lester@sandbox target]$ rm tpch_kit.zip [lester@sandbox target]$ rm -r tpch_2_16_0/ [lester@sandbox target]$ rm -r __MACOSX/ [lester@sandbox target]$ mv ./lib/dbgen.jar . [lester@sandbox target]$ rm -r lib [lester@sandbox target]$ mkdir lib [lester@sandbox target]$ mv dbgen.jar ./lib/dbgen.jar [lester@sandbox target]$ du -sh * 44K classes 8.0K generated-sources 384K lib 8.0K maven-archiver 4.0K surefire 4.0K tools 12K tpch-gen-1.0-SNAPSHOT.jar
Now things are looking much better.
[lester@sandbox target]$ cd ../.. [lester@sandbox hive-testbench-MEP]$ du -sh * 236K ddl-tpcds 44K ddl-tpch 8.0K README.md 4.0K runSuite.pl 288K sample-queries-tpcds 104K sample-queries-tpch 16K settings 4.0K tpcds-build.sh 1.7M tpcds-gen 4.0K tpcds-setup.sh 4.0K tpch-build.sh 544K tpch-gen 4.0K tpch-setup.sh [lester@sandbox hive-testbench-MEP]$ cd .. [lester@sandbox benchmarking]$ du -sh * 272M hive-testbench 3.6M hive-testbench-MEP
Definitely, 3.6 MB is MUCH tighter than 272 MB and we can squeeze it just a bit tighter when we zip it all up.
[lester@sandbox benchmarking]$ tar -czf hive-testbench-MEP.tar.gz ./hive-testbench-MEP/ [lester@sandbox benchmarking]$ du -sh * 272M hive-testbench 3.6M hive-testbench-MEP 2.2M hive-testbench-MEP.tar.gz
At this point we have an artifact we can unwind on another host to generate & load some Hive tables to run the bundled queries against. Of course, you would want to do this on a like-for-like system (primarily OS) which should not be a problem in a typical organization's multi-cluster environment. For this blog posting, I'm just going to unzip everything in a separate folder on the same HDP Sandbox instance.
[lester@sandbox benchmarking]$ cd .. [lester@sandbox ~]$ mkdir benchmarkELSEWHERE [lester@sandbox ~]$ cp benchmarking/hive-testbench-MEP.tar.gz ./benchmarkELSEWHERE/ [lester@sandbox ~]$ cd benchmarkELSEWHERE/ [lester@sandbox benchmarkELSEWHERE]$ ls hive-testbench-MEP.tar.gz [lester@sandbox benchmarkELSEWHERE]$ tar -xf hive-testbench-MEP.tar.gz [lester@sandbox benchmarkELSEWHERE]$ ls hive-testbench-MEP hive-testbench-MEP.tar.gz [lester@sandbox benchmarkELSEWHERE]$ cd hive-testbench-MEP [lester@sandbox hive-testbench-MEP]$ ls ddl-tpcds runSuite.pl settings tpcds-setup.sh tpch-setup.sh ddl-tpch sample-queries-tpcds tpcds-build.sh tpch-build.sh README.md sample-queries-tpch tpcds-gen tpch-gen
From here we can verify that we can generate some data.
[lester@sandbox hive-testbench-MEP]$ ./tpch-setup.sh 2 TPC-H text data generation complete. Loading text data into external tables. Optimizing table part (1/8). Optimizing table partsupp (2/8). Optimizing table supplier (3/8). Optimizing table customer (4/8). Optimizing table orders (5/8). Optimizing table lineitem (6/8). Optimizing table nation (7/8). Optimizing table region (8/8). Data loaded into database tpch_flat_orc_2. [lester@sandbox hive-testbench-MEP]$ ./tpcds-setup.sh 2 TPC-DS text data generation complete. Loading text data into external tables. Optimizing table store_sales (1/24). Optimizing table store_returns (2/24). Optimizing table web_sales (3/24). Optimizing table web_returns (4/24). Optimizing table catalog_sales (5/24). Optimizing table catalog_returns (6/24). Optimizing table inventory (7/24). Optimizing table date_dim (8/24). Optimizing table time_dim (9/24). Optimizing table item (10/24). Optimizing table customer (11/24). Optimizing table customer_demographics (12/24). Optimizing table household_demographics (13/24). Optimizing table customer_address (14/24). Optimizing table store (15/24). Optimizing table promotion (16/24). Optimizing table warehouse (17/24). Optimizing table ship_mode (18/24). Optimizing table reason (19/24). Optimizing table income_band (20/24). Optimizing table call_center (21/24). Optimizing table web_page (22/24). Optimizing table catalog_page (23/24). Optimizing table web_site (24/24). Data loaded into database tpcds_bin_partitioned_orc_2.
Then we can verify the tables are present and have data loaded.
[lester@sandbox hive-testbench-MEP]$ hive hive> show databases; OK default tpcds_bin_partitioned_orc_2 tpcds_text_2 tpch_flat_orc_2 tpch_text_2 xademo Time taken: 3.691 seconds, Fetched: 6 row(s) hive> use tpch_flat_orc_2; OK Time taken: 0.31 seconds hive> show tables; OK customer lineitem nation orders part partsupp region supplier Time taken: 0.439 seconds, Fetched: 8 row(s) hive> desc nation; OK n_nationkey int n_name string n_regionkey int n_comment string Time taken: 0.681 seconds, Fetched: 4 row(s) hive> select * from nation limit 1; OK 0 ALGERIA 0 haggle. carefully final deposits detect slyly agai Time taken: 1.09 seconds, Fetched: 1 row(s)
More importantly, we can now run the actual TPC queries such as identified in the GitHub project's README file although they are sure going to be slow on my little sandbox.
hive> source tpch_query1.sql; Query ID = lester_20150709034949_285c833f-fea1-48ad-8e01-8c651b73150a Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1436365286495_0090) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 2 2 0 0 0 0 Reducer 2 ...... SUCCEEDED 48 48 0 0 0 0 Reducer 3 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 46.97 s -------------------------------------------------------------------------------- OK A F 7.5478173E7 1.1319733134601569E11 1.0753640820730757E11 1.118388987696159E11 25.505698877458507 38251.814164120944 0.05000395030241819 2959267 N F 1966480.0 2.94611482674001E9 2.798796636156392E9 2.9110301630686007E9 25.530080751953886 38248.31650014294 0.04999558590603488 77026 N O 1.49872794E8 2.2474848743041776E11 2.1351545694796832E11 2.2205919740072766E11 25.495187016222236 38232.454108058955 0.04997914935058362 5878474 R F 7.5577628E7 1.1335191421816394E11 1.0768808181149153E11 1.1199430786622203E11 25.512150382609875 38263.32154391632 0.04997978002407437 2962417 Time taken: 53.591 seconds, Fetched: 4 row(s)
Now that all is operational what do you do next? Basically, you are looking to run these standardized queries to generate a baseline set of metrics that can be rerun when you make configuration changes and/or cluster size increases (i.e. Rinse, Later, and Repeat).
Happy Hadooping and happy benchmarking!!