Skip to content

Commit 4ea725a

Browse files
committed
Update examples and docs for Spark 4 and Scala 2.13
1 parent 3192491 commit 4ea725a

16 files changed

+72
-73
lines changed

README.md

Lines changed: 20 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,10 @@ and spark-shell/pyspark environments.
3232
- [Demo](#demo)
3333
- [Examples of sparkMeasure on notebooks](#examples-of-sparkmeasure-on-notebooks)
3434
- [Examples of sparkMeasure on the CLI](#examples-of-sparkmeasure-on-the-cli)
35-
- [Command line example for Task Metrics](#command-line-example-for-task-metrics)
36-
- [Setting Up SparkMeasure with Spark](#setting-up-sparkmeasure-with-spark)
37-
- [Version Compatibility for SparkMeasure](#version-compatibility-for-sparkmeasure)
38-
- [Downloading SparkMeasure](#downloading-sparkmeasure)
39-
- [Including sparkMeasure in Your Spark Environment](#including-sparkmeasure-in-your-spark-environment)
40-
- [Setup Examples](#setup-examples)
35+
- [Setting up SparkMeasure with Spark](#setting-up-sparkmeasure-with-spark)
36+
- [Version vompatibility for SparkMeasure](#version-compatibility-for-sparkmeasure)
37+
- [Downloading sparkMeasure](#downloading-sparkmeasure)
38+
- [Setup examples](#setup-examples)
4139
- [Notes on Metrics](#notes-on-metrics)
4240
- [Documentation and API reference](#documentation-api-and-examples)
4341
- [Architecture diagram](#architecture-diagram)
@@ -241,23 +239,6 @@ To get SparkMeasure, choose one of the following options:
241239

242240
* Clone the repository and use sbt to build: `sbt +package`.
243241

244-
### Including sparkMeasure in your Spark environment
245-
246-
Choose your preferred method:
247-
248-
* Use the `--packages` option:
249-
250-
```bash
251-
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
252-
```
253-
* Directly reference the JAR file:
254-
255-
```bash
256-
--jars /path/to/spark-measure_2.12-0.25.jar
257-
--jars https://github.com/LucaCanali/sparkMeasure/releases/download/v0.25/spark-measure_2.12-0.25.jar
258-
--conf spark.driver.extraClassPath=/path/to/spark-measure_2.12-0.25.jar
259-
```
260-
261242
### Setup Examples
262243

263244
#### Spark 4 with Scala 2.13
@@ -279,6 +260,22 @@ Choose your preferred method:
279260
pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
280261
pip install sparkmeasure
281262
```
263+
### Including sparkMeasure in your Spark environment
264+
265+
Choose your preferred method:
266+
267+
* Use the `--packages` option:
268+
269+
```bash
270+
--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
271+
```
272+
* Directly reference the JAR file:
273+
274+
```bash
275+
--jars /path/to/spark-measure_2.13-0.25.jar
276+
--jars https://github.com/LucaCanali/sparkMeasure/releases/download/v0.25/spark-measure_2.13-0.25.jar
277+
--conf spark.driver.extraClassPath=/path/to/spark-measure_2.13-0.25.jar
278+
```
282279

283280
---
284281
## Notes on Spark Metrics

docs/Flight_recorder_mode_FileSink.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Metrics can also be printed to stdout.
1212
## Recording metrics using the Flight Recorder mode with Stage-level granularity
1313
To record metrics at the stage execution level granularity add these configurations to spark-submit:
1414
```
15-
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.23
15+
--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
1616
--conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics
1717
```
1818

@@ -25,7 +25,7 @@ The usage is almost the same as for the stage metrics mode described above, just
2525
The configuration parameters applicable to Flight recorder mode for Task granularity are:
2626

2727
```
28-
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.23
28+
--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
2929
--conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderTaskMetrics
3030
```
3131

@@ -51,7 +51,7 @@ A Python example
5151
- This runs the pi.py example script
5252
- collects and saves the metrics to `/tmp/stageMetrics_flightRecorder` in json format:
5353
```
54-
bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 \
54+
bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 \
5555
--conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics \
5656
examples/src/main/python/pi.py
5757
@@ -63,7 +63,7 @@ A Scala example
6363
- same example as above, in addition use a custom output filename
6464
- print metrics also to stdout
6565
```
66-
bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 \
66+
bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 \
6767
--class org.apache.spark.examples.SparkPi \
6868
--conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics \
6969
--conf spark.sparkmeasure.printToStdout=true \
@@ -80,7 +80,7 @@ This example collected metrics with Task granularity.
8080
(note: source the Hadoop environment before running this)
8181
```
8282
bin/spark-submit --master yarn --deploy-mode cluster \
83-
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.25 \
83+
--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 \
8484
--conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderTaskMetrics \
8585
--conf spark.sparkmeasure.outputFormat=json_to_hadoop \
8686
--conf spark.sparkmeasure.outputFilename="hdfs://myclustername/user/luca/test/myoutput_$(date +%s).json" \
@@ -90,13 +90,13 @@ examples/src/main/python/pi.py
9090
hdfs dfs -ls <path>/myoutput_*.json
9191
```
9292

93-
Example, use spark-3.3.0, Kubernetes, Scala 2.12 and write output to S3:
93+
Example, use Spark 4, Kubernetes, Scala 2.13 and write output to S3:
9494
(note: export KUBECONFIG=... + setup Hadoop environment + configure s3a keys in the script)
9595
```
9696
bin/spark-submit --master k8s://https://XXX.XXX.XXX.XXX --deploy-mode client --conf spark.executor.instances=3 \
9797
--conf spark.executor.cores=2 --executor-memory 6g --driver-memory 8g \
98-
--conf spark.kubernetes.container.image=<registry-URL>/spark:v3.0.0_20190529_hadoop32 \
99-
--packages org.apache.hadoop:hadoop-aws:3.3.2,ch.cern.sparkmeasure:spark-measure_2.12:0.25 \
98+
--conf spark.kubernetes.container.image=apache/spark \
99+
--packages org.apache.hadoop:hadoop-aws:3.4.1,ch.cern.sparkmeasure:spark-measure_2.13:0.25 \
100100
--conf spark.hadoop.fs.s3a.secret.key="YYY..." \
101101
--conf spark.hadoop.fs.s3a.access.key="ZZZ..." \
102102
--conf spark.hadoop.fs.s3a.endpoint="https://s3.cern.ch" \
@@ -105,7 +105,7 @@ bin/spark-submit --master k8s://https://XXX.XXX.XXX.XXX --deploy-mode client --c
105105
--conf spark.sparkmeasure.outputFormat=json_to_hadoop \
106106
--conf spark.sparkmeasure.outputFilename="s3a://test/myoutput_$(date +%s).json" \
107107
--class org.apache.spark.examples.SparkPi \
108-
examples/jars/spark-examples_2.12-3.3.1.jar 10
108+
examples/jars/spark-examples_2.13-4.4.0.jar 10
109109
```
110110

111111

@@ -115,7 +115,7 @@ To post-process the saved metrics you will need to deserialize objects saved by
115115
This is an example of how to do that using the supplied helper object sparkmeasure.Utils
116116

117117
```
118-
bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
118+
bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
119119
120120
val myMetrics = ch.cern.sparkmeasure.IOUtils.readSerializedStageMetricsJSON("/tmp/stageMetrics_flightRecorder")
121121
// use ch.cern.sparkmeasure.IOUtils.readSerializedStageMetrics("/tmp/stageMetrics.serialized") for java serialization

docs/Flight_recorder_mode_InfluxDBSink.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,12 +71,12 @@ in spark-submit/spark-shell as in:
7171
- This example uses InfluxDB version 1.8 (using InfluxDB version 2 requires some changes in the example)
7272
```
7373
# Alternative 1.
74-
# Use this if you plan to use the Spark dashboard as in
74+
# Use this if you plan to use the Spark dashboard v1 as in
7575
# https://github.com/cerndb/spark-dashboard
7676
docker run --name influx --network=host -d lucacanali/spark-dashboard:v01
7777
7878
# Alternative 2.
79-
# Start InfluxDB, for example using a docker image
79+
# Start InfluxDB stand-alone, for example using a docker image
8080
docker run --name influx --network=host -d influxdb:1.8.10
8181
```
8282

@@ -87,7 +87,7 @@ bin/spark-shell \
8787
--conf spark.sparkmeasure.influxdbURL="http://localhost:8086" \
8888
--conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSink,ch.cern.sparkmeasure.InfluxDBSinkExtended \
8989
--conf spark.sparkmeasure.influxdbStagemetrics=true
90-
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
90+
--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
9191
9292
// run a Spark job, this will produce metrics
9393
spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show

docs/Flight_recorder_mode_KafkaSink.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ bin/spark-shell \
6767
--conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSink \
6868
--conf spark.sparkmeasure.kafkaBroker=localhost:9092 \
6969
--conf spark.sparkmeasure.kafkaTopic=metrics
70-
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
70+
--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
7171
```
7272

7373
- Look at the metrics being written into Kafka:

docs/Flight_recorder_mode_PrometheusPushgatewaySink.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Examples:
6060
bin/spark-shell \
6161
--conf spark.extraListeners=ch.cern.sparkmeasure.PushGatewaySink \
6262
--conf spark.sparkmeasure.pushgateway=localhost:9091 \
63-
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
63+
--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
6464
```
6565

6666
- Look at the metrics being written to the Pushgateway

docs/Instrument_Python_code.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ You can find an example of how to instrument a Scala application running Apache
1111

1212
How to run the example:
1313
```
14-
bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 <path_to_examples>/test_sparkmeasure_python.py
14+
bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 <path_to_examples>/test_sparkmeasure_python.py
1515
```
1616

1717
Some relevant snippet of code are:
@@ -36,7 +36,7 @@ bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 <path_t
3636
metrics = stagemetrics.aggregate_stagemetrics()
3737
print(f"metrics elapsedTime = {metrics.get('elapsedTime')}")
3838

39-
# Introduced in sparkMeasure v0.21, memory metrics report:
39+
# memory metrics report:
4040
stageMetrics.print_memory_report()
4141

4242
# save session metrics data in json format (default)
@@ -54,10 +54,10 @@ The details are discussed in the [examples for Python shell and notebook](https:
5454

5555
- This is how to run sparkMeasure using a packaged version in Maven Central
5656
```
57-
bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25 your_python_code.py
57+
bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 your_python_code.py
5858
5959
// alternative: just download and use the jar (it is only needed in the driver) as in:
60-
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24.jar ...
60+
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.25.jar ...
6161
```
6262

6363
### Download and build sparkMeasure (optional)
@@ -67,14 +67,14 @@ The details are discussed in the [examples for Python shell and notebook](https:
6767
git clone https://github.com/lucacanali/sparkmeasure
6868
cd sparkmeasure
6969
sbt +package
70-
ls -l target/scala-2.12/spark-measure*.jar # location of the compiled jar
70+
ls -l target/scala-2.13/spark-measure*.jar # location of the compiled jar
7171
7272
cd python
7373
pip install .
7474
7575
# Run as in one of these examples:
76-
bin/spark-submit --jars path>/spark-measure_2.12-0.25-SNAPSHOT.jar ...
76+
bin/spark-submit --jars path>/spark-measure_2.13-0.26-SNAPSHOT.jar ...
7777
7878
# alternative, set classpath for the driver (sparkmeasure code runs only in the driver)
79-
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.25-SNAPSHOT.jar ...
79+
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.26-SNAPSHOT.jar ...
8080
```

docs/Instrument_Scala_code.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ How to run the example:
1313
# build the example jar
1414
sbt package
1515
16-
bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 --class ch.cern.testSparkMeasure.testSparkMeasure <path_to_the_example_jar>/testsparkmeasurescala_2.12-0.1.jar
16+
bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 --class ch.cern.testSparkMeasure.testSparkMeasure <path_to_the_example_jar>/testsparkmeasurescala_2.13-0.1.jar
1717
```
1818

1919
### Collect and save Stage Metrics
@@ -71,10 +71,10 @@ See details at: [Prometheus Pushgateway](Prometheus.md)
7171

7272
- This is how to run sparkMeasure using a packaged version in Maven Central
7373
```
74-
bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
74+
bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
7575

7676
// or just download and use the jar (it is only needed in the driver) as in:
77-
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24.jar ...
77+
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.25.jar ...
7878
```
7979
- The alternative, see paragraph above, is to build a jar from master (See below).
8080

@@ -88,8 +88,8 @@ See details at: [Prometheus Pushgateway](Prometheus.md)
8888
ls -l target/scala-2.12/spark-measure*.jar # location of the compiled jar
8989
9090
# Run as in one of these examples:
91-
bin/spark-submit --jars path>/spark-measure_2.12-0.25-SNAPSHOT.jar
91+
bin/spark-submit --jars path>/spark-measure_2.13-0.26-SNAPSHOT.jar
9292
9393
# alternative, set classpath for the driver (it is only needed in the driver)
94-
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.25-SNAPSHOT.jar ...
94+
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.26-SNAPSHOT.jar ...
9595
```

docs/Prometheus.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ https://prometheus.io/docs/instrumenting/exposition_formats/
3535

3636
1. Measure metrics at the Stage level (example in Scala):
3737
```
38-
bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
38+
bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
3939
4040
val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark)
4141
stageMetrics.begin()
@@ -84,4 +84,4 @@ Added method:
8484
* def sendReportPrometheus(serverIPnPort: String, metricsJob: String,
8585
labelName: String = sparkSession.sparkContext.appName,
8686
labelValue: String = sparkSession.sparkContext.applicationId): Unit -> send metrics to prometheus pushgateway
87-
87+

docs/Python_shell_and_Jupyter.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,25 +8,26 @@ See also [README](../README.md) for an introduction to sparkMeasure and its arch
88

99
- Use PyPi to install the Python wrapper and take the jar from Maven central:
1010
```
11+
pip install pyspark # Spark 4
1112
pip install sparkmeasure
12-
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
13+
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
1314
```
1415
- If you prefer to build from the latest development version:
1516
```
1617
git clone https://github.com/lucacanali/sparkmeasure
1718
cd sparkmeasure
1819
sbt +package
19-
ls -l target/scala-2.12/spark-measure*.jar # note location of the compiled and packaged jar
20+
ls -l target/scala-2.13/spark-measure*.jar # note location of the compiled and packaged jar
2021
2122
# Install the Python wrapper package
2223
cd python
2324
pip install .
2425
2526
# Run as in one of these examples:
26-
bin/pyspark --jars path>/spark-measure_2.12-0.24-SNAPSHOT.jar
27+
bin/pyspark --jars path>/spark-measure_2.13-0.26-SNAPSHOT.jar
2728
2829
#Alternative:
29-
bin/pyspark --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24-SNAPSHOT.jar
30+
bin/pyspark --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.26-SNAPSHOT.jar
3031
```
3132
3233
@@ -155,4 +156,4 @@ Stage 3 JVMHeapMemory maxVal bytes => 279558120 (266.6 MB)
155156
Stage 3 OnHeapExecutionMemory maxVal bytes => 0 (0 Bytes)
156157
```
157158
158-
159+

docs/Scala_shell_and_notebooks.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ See also [README](../README.md) for an introduction to sparkMeasure and its arch
88

99
- The alternative, see paragraph above, is to build a jar from master.
1010
```
11-
bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
11+
bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
1212
1313
// or just download and use the jar (it is only needed in the driver) as in:
14-
bin/spark-shell --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24.jar
14+
bin/spark-shell --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.25.jar
1515
```
1616

1717
### Download and build sparkMeasure (optional)
@@ -21,10 +21,10 @@ See also [README](../README.md) for an introduction to sparkMeasure and its arch
2121
git clone https://github.com/lucacanali/sparkmeasure
2222
cd sparkmeasure
2323
sbt +package
24-
ls -l target/scala-2.12/spark-measure*.jar # location of the compiled jar
24+
ls -l target/scala-2.13/spark-measure*.jar # location of the compiled jar
2525
2626
# Run as in one of these examples:
27-
bin/spark-shell --jars <path>/spark-measure_2.12-0.24-SNAPSHOT.jar
27+
bin/spark-shell --jars <path>/spark-measure_2.13-0.26-SNAPSHOT.jar
2828
2929
# Alternative, set classpath for the driver (the JAR is only needed in the driver)
3030
bin/spark-shell --conf spark.driver.extraClassPath=<path>/spark-measure_2.11-0.24-SNAPSHOT.jar
@@ -34,7 +34,7 @@ See also [README](../README.md) for an introduction to sparkMeasure and its arch
3434
3535
1. Measure metrics at the Stage level, a basic example:
3636
```
37-
bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
37+
bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
3838
3939
val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark)
4040
stageMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show)

0 commit comments

Comments
 (0)