Add HDP support

jerryshao · Lv, Qi · commit 891ab7000e2b · 2015-10-29T13:33:37.000+08:00
Conflicts:
	README.md
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@
 
 This benchmark suite contains 10 typical micro workloads. This benchmark suite also has options for users to enable input/output compression for most workloads with default compression codec (zlib). Some initial work based on this benchmark suite please refer to the included ICDE workshop paper (i.e., WISS10_conf_full_011.pdf).
 
-Note: 
+Note:
  1. Since HiBench-2.2, the input data of benchmarks are all automatically generated by their corresponding prepare scripts.
  2. Since HiBench-3.0, it introduces Yarn support
  3. Since HiBench-4.0, it consists of more workload implementations on both Hadoop MR and Spark. For Spark, three different APIs including Scala, Java, Python are supportive.
@@ -33,7 +33,7 @@ Note:
 2. WordCount (wordcount)
 
     This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.
-	
+
 3. TeraSort (terasort)
 
     TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
@@ -53,7 +53,7 @@ Note:
 6. PageRank (pagerank)
 
     This workload benchmarks PageRank algorithm implemented in Spark-MLLib/Hadoop (a search engine ranking benchmark included in pegasus 2.0) examples. The data source is generated from Web data whose hyperlinks follow the Zipfian distribution.
-	
+
 7. Nutch indexing (nutchindexing)
 
     Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file /usr/share/dict/linux.words.
@@ -75,13 +75,16 @@ Note:
 10. enhanced DFSIO (dfsioe)
 
     Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.
-    
+
 **Supported hadoop/spark release:**
 
   - Apache release of Hadoop 1.x and Hadoop 2.x
   - CDH4/CDH5 release of MR1 and MR2.
+  - HDP2.3
   - Spark1.2
   - Spark1.3
+ 
+  Note : No version of CDH supports SparkSQL. Please download SparkSQL from Apache-spark official release page if you are using it.
 
 ---
 ### Getting Started ###
@@ -93,39 +96,41 @@ Note:
      Download/checkout HiBench benchmark suite
 
      Run `<HiBench_Root>/bin/build-all.sh` to build HiBench.
-      
+
      Note: Begin from HiBench V4.0, HiBench will need python 2.x(>=2.6) .
 
 2. HiBench Configurations.
 
      For minimum requirements: create & edit `conf/99-user_defined_properties.conf`：
-     
-          cd conf 
+
+          cd conf
           cp 99-user_defined_properties.conf.template 99-user_defined_properties.conf
-     
+
      And Make sure below properties has been set:
 
           hibench.hadoop.home      The Hadoop installation location
           hibench.spark.home       The Spark installation location
           hibench.hdfs.master      HDFS master
           hibench.spark.master     SPARK master
-	  
+
      Note: For YARN mode, set `hibench.spark.master` to `yarn-client`. (`yarn-cluster` is not supported yet)
 
+     To run HiBench on HDP, please specify `hibench.hadoop.mapreduce.home` to the mapreduce home, normally it should be "/usr/hdp/current/hadoop-mapreduce-client". Also please specify `hibench.hadoop.release` to "hdp".
+
 3. Run
 
    Execute the `<HiBench_Root>/bin/run-all.sh` to run all workloads with all language APIs with `large` data scale.
 
 4. View the report:
-   
+
    Goto `<HiBench_Root>/report` to check for the final report:
       - `report/hibench.report`: Overall report about all workloads.
       - `report/<workload>/<language APIs>/bench.log`: Raw logs on client side.
       - `report/<workload>/<language APIs>/monitor.html`: System utilization monitor results.
       - `report/<workload>/<language APIs>/conf/<workload>.conf`: Generated environment variable configurations for this workload.
       - `report/<workload>/<language APIs>/conf/sparkbench/<workload>/sparkbench.conf`: Generated configuration for this workloads, which is used for mapping to environment variable.
       - `report/<workload>/<language APIs>/conf/sparkbench/<workload>/spark.conf`: Generated configuration for spark.
-      
+
    [Optional] Execute `<HiBench root>/bin/report_gen_plot.py report/hibench.report` to generate report figures.
 
    Note: `report_gen_plot.py` requires `python2.x` and `python-matplotlib`.
@@ -135,12 +140,12 @@ Note:
 
 1. Parallelism, memory, executor number tuning:
 
-          hibench.default.map.parallelism       Mapper numbers in MR, 
+          hibench.default.map.parallelism       Mapper numbers in MR,
                                                 partition numbers in Spark
-          hibench.default.shuffle.parallelism   Reducer numbers in MR, shuffle 
+          hibench.default.shuffle.parallelism   Reducer numbers in MR, shuffle
                                                 partition numbers in Spark
           hibench.yarn.executors.num            Number executors in YARN mode
-          hibench.yarn.executors.cores          Number executor cores in YARN mode 
+          hibench.yarn.executors.cores          Number executor cores in YARN mode
           spark.executors.memory                Executor memory, standalone or YARN mode
           spark.driver.memory                   Driver memory, standalone or YARN mode
 
@@ -150,11 +155,11 @@ Note:
 
           hibench.compress.profile              Compression option `enable` or `disable`
           hibench.compress.codec.profile        Compression codec, `snappy`, `lzo` or `default`
-     
+
 3. Data scale profile selection:
 
           hibench.scale.profile                 Data scale profile, `tiny`, `small`, `large`, `huge`, `gigantic`, `bigdata`
-                                                  
+
    You can add more data scale profiles in `conf/10-data-scale-profile.conf`. And please don't change `conf/00-default-properties.conf` if you have no confidence.
 
 4. Configure for each workload or each language API:
@@ -166,7 +171,7 @@ Note:
               workloads/<workload>/<language APIs>/.../*.conf     Configure for various languages
 
      2. For configurations in same folder, the loading sequence will be
-     sorted according to configure file name. 
+     sorted according to configure file name.
 
      3. Values in latter configure will override former.
 
@@ -189,7 +194,7 @@ Note:
           hibench.spark.version          spark1.3
 
 6. Configures for running workloads and language APIs:
-  
+
   The `conf/benchmarks.lst` file under the package folder defines the
   workloads to run when you execute the `bin/run-all.sh` script under
   the package folder. Each line in the list file specifies one
@@ -227,7 +232,7 @@ Note:
    You'll need to install numpy (version > 1.4) in master & all slave nodes.
 
    For CentOS(6.2+):
-     
+
      `yum install numpy`
 
    For Ubuntu/Debian:
@@ -239,7 +244,7 @@ Note:
    You'll need to install python-matplotlib(version > 0.9).
 
    For CentOS(6.2+):
-     
+
      `yum install python-matplotlib`
 
    For Ubuntu/Debian:
diff --git a/bin/functions/load-config.py b/bin/functions/load-config.py
@@ -233,7 +233,7 @@ def generate_optional_value():  # get some critical values from environment or m
                 "UNKNOWN"
             HibenchConfRef["hibench.hadoop.release"] = "Inferred by: hadoop version, which is:\"%s\"" % hadoop_version
 
-        assert HibenchConf["hibench.hadoop.release"] in ["cdh4", "cdh5", "apache"],  "Unknown hadoop release. Auto probe failed, please override `hibench.hadoop.release` to explicitly define this property"
+        assert HibenchConf["hibench.hadoop.release"] in ["cdh4", "cdh5", "apache", "hdp"],  "Unknown hadoop release. Auto probe failed, please override `hibench.hadoop.release` to explicitly define this property"
 
 
     # probe spark version
@@ -260,18 +260,21 @@ def generate_optional_value():  # get some critical values from environment or m
     if not HibenchConf.get("hibench.hadoop.examples.jar", ""):
         if HibenchConf["hibench.hadoop.version"] == "hadoop1": # MR1
             if HibenchConf['hibench.hadoop.release'] == 'apache': # Apache release
-                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.home']+"/hadoop-examples*.jar")
-                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.home']+"/hadoop-examples*.jar"
+                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home']+"/hadoop-examples*.jar")
+                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/hadoop-examples*.jar"
             elif HibenchConf['hibench.hadoop.release'].startswith('cdh'): # CDH release
-                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.home']+"/share/hadoop/mapreduce1/hadoop-examples*.jar")
-                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.home']+"/share/hadoop/mapreduce1/hadoop-examples*.jar"
+                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home']+"/share/hadoop/mapreduce1/hadoop-examples*.jar")
+                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/share/hadoop/mapreduce1/hadoop-examples*.jar"
         else:                   # MR2
             if HibenchConf['hibench.hadoop.release'] == 'apache': # Apache release
-                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.home'] + "/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar")
-                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.home']+"/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar"
+                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar")
+                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar"
             elif HibenchConf['hibench.hadoop.release'].startswith('cdh'): # CDH release
-                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.home'] + "/share/hadoop/mapreduce2/hadoop-mapreduce-examples-*.jar")
-                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.home']+"/share/hadoop/mapreduce2/hadoop-mapreduce-examples-*.jar"
+                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/share/hadoop/mapreduce2/hadoop-mapreduce-examples-*.jar")
+                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/share/hadoop/mapreduce2/hadoop-mapreduce-examples-*.jar"
+            elif HibenchConf['hibench.hadoop.release'].startswith('hdp'): # HDP release
+                HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/hadoop-mapreduce-examples.jar")
+                HibenchConfRef["hibench.hadoop.examples.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/hadoop-mapreduce-examples.jar"
 
     # probe hadoop examples test jars (for sleep in hadoop2 only)
     if not HibenchConf.get("hibench.hadoop.examples.test.jar", ""):
@@ -280,15 +283,18 @@ def generate_optional_value():  # get some critical values from environment or m
             HibenchConfRef["hibench.hadoop.examples.test.jar"]= "Dummy value, not available in hadoop1"
         else:
             if HibenchConf['hibench.hadoop.release'] == 'apache':
-                HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.home'] + "/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient*-tests.jar")
-                HibenchConfRef["hibench.hadoop.examples.test.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.home']+"/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient*-tests.jar"
+                HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient*-tests.jar")
+                HibenchConfRef["hibench.hadoop.examples.test.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient*-tests.jar"
             elif HibenchConf['hibench.hadoop.release'].startswith('cdh'):
                 if HibenchConf["hibench.hadoop.version"] == "hadoop2":
-                    HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.home'] + "/share/hadoop/mapreduce2/hadoop-mapreduce-client-jobclient*-tests.jar")
-                    HibenchConfRef["hibench.hadoop.examples.test.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.home']+"/share/hadoop/mapreduce2/hadoop-mapreduce-client-jobclient*-tests.jar"
+                    HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/share/hadoop/mapreduce2/hadoop-mapreduce-client-jobclient*-tests.jar")
+                    HibenchConfRef["hibench.hadoop.examples.test.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/share/hadoop/mapreduce2/hadoop-mapreduce-client-jobclient*-tests.jar"
                 elif HibenchConf["hibench.hadoop.version"] == "hadoop1":
-                    HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.home'] + "/share/hadoop/mapreduce1/hadoop-examples-*.jar")
-                    HibenchConfRef["hibench.hadoop.examples.test.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.home']+"/share/hadoop/mapreduce1/hadoop-mapreduce-client-jobclient*-tests.jar"
+                    HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/share/hadoop/mapreduce1/hadoop-examples-*.jar")
+                    HibenchConfRef["hibench.hadoop.examples.test.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/share/hadoop/mapreduce1/hadoop-mapreduce-client-jobclient*-tests.jar"
+            elif HibenchConf['hibench.hadoop.release'].startswith('hdp'): # HDP release
+                HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/hadoop-mapreduce-client-jobclient-tests.jar")
+                HibenchConfRef["hibench.hadoop.examples.test.jar"]= "Inferred by: " + HibenchConf['hibench.hadoop.mapreduce.home']+"/hadoop-mapreduce-client-jobclient-tests.jar"
 
     # set hibench.sleep.job.jar
     if not HibenchConf.get('hibench.sleep.job.jar', ''):
@@ -302,7 +308,7 @@ def generate_optional_value():  # get some critical values from environment or m
 
     # probe hadoop configuration files
     if not HibenchConf.get("hibench.hadoop.configure.dir", ""):
-        if HibenchConf["hibench.hadoop.release"] == "apache": # Apache release
+        if HibenchConf["hibench.hadoop.release"] == "apache" or HibenchConf["hibench.hadoop.release"] == "hdp": # Apache and HDP release
             HibenchConf["hibench.hadoop.configure.dir"] = join(HibenchConf["hibench.hadoop.home"], "conf") if HibenchConf["hibench.hadoop.version"] == "hadoop1" \
                 else join(HibenchConf["hibench.hadoop.home"], "etc", "hadoop")
             HibenchConfRef["hibench.hadoop.configure.dir"] = "Inferred by: 'hibench.hadoop.version' & 'hibench.hadoop.release'"
diff --git a/conf/00-default-properties.conf b/conf/00-default-properties.conf
@@ -37,6 +37,9 @@
 # default hadoop executable path
 hibench.hadoop.executable	${hibench.hadoop.home}/bin/hadoop
 
+# Hadoop MapReduce home dir, should be same to Hadoop home by default
+hibench.hadoop.mapreduce.home   ${hibench.hadoop.home}
+
 #======================================================
 # basic spark conf
 #======================================================
@@ -49,7 +52,7 @@ hibench.hadoop.configure.dir
 hibench.spark.version
 hibench.masters.hostnames
 hibench.slaves.hostnames
-hibench.dfsioe.map.java_opts	
+hibench.dfsioe.map.java_opts
 hibench.dfsioe.red.java_opts
 
 # default spark master if unspecified
@@ -128,6 +131,7 @@ hibench.pagerank.dir.name.output	${hibench.workload.dir.name.output}
 hibench.pagerank.pegasus.dir		${hibench.dependency.dir}/pegasus/target/pegasus-2.0-SNAPSHOT.jar
 hibench.mahout.home		${hibench.dependency.dir}/mahout/target/${hibench.mahout.release}
 hibench.mahout.release.apache   mahout-distribution-0.9
+hibench.mahout.release.hdp      mahout-distribution-0.9
 hibench.mahout.release.cdh4	mahout-0.7-cdh4.7.1
 hibench.mahout.release.cdh5	mahout-0.9-cdh5.1.0
 hibench.mahout.release		${hibench.mahout.release.${hibench.hadoop.release}}