You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+25-20
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@
18
18
19
19
This benchmark suite contains 10 typical micro workloads. This benchmark suite also has options for users to enable input/output compression for most workloads with default compression codec (zlib). Some initial work based on this benchmark suite please refer to the included ICDE workshop paper (i.e., WISS10_conf_full_011.pdf).
20
20
21
-
Note:
21
+
Note:
22
22
1. Since HiBench-2.2, the input data of benchmarks are all automatically generated by their corresponding prepare scripts.
23
23
2. Since HiBench-3.0, it introduces Yarn support
24
24
3. Since HiBench-4.0, it consists of more workload implementations on both Hadoop MR and Spark. For Spark, three different APIs including Scala, Java, Python are supportive.
@@ -33,7 +33,7 @@ Note:
33
33
2. WordCount (wordcount)
34
34
35
35
This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.
36
-
36
+
37
37
3. TeraSort (terasort)
38
38
39
39
TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
@@ -53,7 +53,7 @@ Note:
53
53
6. PageRank (pagerank)
54
54
55
55
This workload benchmarks PageRank algorithm implemented in Spark-MLLib/Hadoop (a search engine ranking benchmark included in pegasus 2.0) examples. The data source is generated from Web data whose hyperlinks follow the Zipfian distribution.
56
-
56
+
57
57
7. Nutch indexing (nutchindexing)
58
58
59
59
Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file /usr/share/dict/linux.words.
@@ -75,13 +75,16 @@ Note:
75
75
10. enhanced DFSIO (dfsioe)
76
76
77
77
Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.
78
-
78
+
79
79
**Supported hadoop/spark release:**
80
80
81
81
- Apache release of Hadoop 1.x and Hadoop 2.x
82
82
- CDH4/CDH5 release of MR1 and MR2.
83
+
- HDP2.3
83
84
- Spark1.2
84
85
- Spark1.3
86
+
87
+
Note : No version of CDH supports SparkSQL. Please download SparkSQL from Apache-spark official release page if you are using it.
85
88
86
89
---
87
90
### Getting Started ###
@@ -93,39 +96,41 @@ Note:
93
96
Download/checkout HiBench benchmark suite
94
97
95
98
Run `<HiBench_Root>/bin/build-all.sh` to build HiBench.
96
-
99
+
97
100
Note: Begin from HiBench V4.0, HiBench will need python 2.x(>=2.6) .
98
101
99
102
2. HiBench Configurations.
100
103
101
104
For minimum requirements: create & edit `conf/99-user_defined_properties.conf`:
hibench.hadoop.home The Hadoop installation location
109
112
hibench.spark.home The Spark installation location
110
113
hibench.hdfs.master HDFS master
111
114
hibench.spark.master SPARK master
112
-
115
+
113
116
Note: For YARN mode, set `hibench.spark.master` to `yarn-client`. (`yarn-cluster` is not supported yet)
114
117
118
+
To run HiBench on HDP, please specify `hibench.hadoop.mapreduce.home` to the mapreduce home, normally it should be "/usr/hdp/current/hadoop-mapreduce-client". Also please specify `hibench.hadoop.release` to "hdp".
119
+
115
120
3. Run
116
121
117
122
Execute the `<HiBench_Root>/bin/run-all.sh` to run all workloads with all language APIs with `large` data scale.
118
123
119
124
4. View the report:
120
-
125
+
121
126
Goto `<HiBench_Root>/report` to check for the final report:
122
127
-`report/hibench.report`: Overall report about all workloads.
123
128
-`report/<workload>/<language APIs>/bench.log`: Raw logs on client side.
124
129
-`report/<workload>/<language APIs>/monitor.html`: System utilization monitor results.
125
130
-`report/<workload>/<language APIs>/conf/<workload>.conf`: Generated environment variable configurations for this workload.
126
131
-`report/<workload>/<language APIs>/conf/sparkbench/<workload>/sparkbench.conf`: Generated configuration for this workloads, which is used for mapping to environment variable.
127
132
-`report/<workload>/<language APIs>/conf/sparkbench/<workload>/spark.conf`: Generated configuration for spark.
128
-
133
+
129
134
[Optional] Execute `<HiBench root>/bin/report_gen_plot.py report/hibench.report` to generate report figures.
130
135
131
136
Note: `report_gen_plot.py` requires `python2.x` and `python-matplotlib`.
@@ -135,12 +140,12 @@ Note:
135
140
136
141
1. Parallelism, memory, executor number tuning:
137
142
138
-
hibench.default.map.parallelism Mapper numbers in MR,
143
+
hibench.default.map.parallelism Mapper numbers in MR,
139
144
partition numbers in Spark
140
-
hibench.default.shuffle.parallelism Reducer numbers in MR, shuffle
145
+
hibench.default.shuffle.parallelism Reducer numbers in MR, shuffle
141
146
partition numbers in Spark
142
147
hibench.yarn.executors.num Number executors in YARN mode
143
-
hibench.yarn.executors.cores Number executor cores in YARN mode
148
+
hibench.yarn.executors.cores Number executor cores in YARN mode
144
149
spark.executors.memory Executor memory, standalone or YARN mode
145
150
spark.driver.memory Driver memory, standalone or YARN mode
146
151
@@ -150,11 +155,11 @@ Note:
150
155
151
156
hibench.compress.profile Compression option `enable` or `disable`
152
157
hibench.compress.codec.profile Compression codec, `snappy`, `lzo` or `default`
153
-
158
+
154
159
3. Data scale profile selection:
155
160
156
161
hibench.scale.profile Data scale profile, `tiny`, `small`, `large`, `huge`, `gigantic`, `bigdata`
157
-
162
+
158
163
You can add more data scale profiles in `conf/10-data-scale-profile.conf`. And please don't change `conf/00-default-properties.conf` if you have no confidence.
159
164
160
165
4. Configure for each workload or each language API:
@@ -166,7 +171,7 @@ Note:
166
171
workloads/<workload>/<language APIs>/.../*.conf Configure for various languages
167
172
168
173
2. For configurations in same folder, the loading sequence will be
169
-
sorted according to configure file name.
174
+
sorted according to configure file name.
170
175
171
176
3. Values in latter configure will override former.
172
177
@@ -189,7 +194,7 @@ Note:
189
194
hibench.spark.version spark1.3
190
195
191
196
6. Configures for running workloads and language APIs:
192
-
197
+
193
198
The `conf/benchmarks.lst` file under the package folder defines the
194
199
workloads to run when you execute the `bin/run-all.sh` script under
195
200
the package folder. Each line in the list file specifies one
@@ -227,7 +232,7 @@ Note:
227
232
You'll need to install numpy (version > 1.4) in master & all slave nodes.
228
233
229
234
For CentOS(6.2+):
230
-
235
+
231
236
`yum install numpy`
232
237
233
238
For Ubuntu/Debian:
@@ -239,7 +244,7 @@ Note:
239
244
You'll need to install python-matplotlib(version > 0.9).
Copy file name to clipboardExpand all lines: bin/functions/load-config.py
+22-16
Original file line number
Diff line number
Diff line change
@@ -233,7 +233,7 @@ def generate_optional_value(): # get some critical values from environment or m
233
233
"UNKNOWN"
234
234
HibenchConfRef["hibench.hadoop.release"] ="Inferred by: hadoop version, which is:\"%s\""%hadoop_version
235
235
236
-
assertHibenchConf["hibench.hadoop.release"] in ["cdh4", "cdh5", "apache"], "Unknown hadoop release. Auto probe failed, please override `hibench.hadoop.release` to explicitly define this property"
236
+
assertHibenchConf["hibench.hadoop.release"] in ["cdh4", "cdh5", "apache", "hdp"], "Unknown hadoop release. Auto probe failed, please override `hibench.hadoop.release` to explicitly define this property"
237
237
238
238
239
239
# probe spark version
@@ -260,18 +260,21 @@ def generate_optional_value(): # get some critical values from environment or m
0 commit comments