Update examples and docs for Spark 4 and Scala 2.13

LucaCanali · LucaCanali · commit 4ea725a6f63a · 2025-06-06T09:56:30.000+02:00
diff --git a/README.md b/README.md
@@ -32,12 +32,10 @@ and spark-shell/pyspark environments.
   - [Demo](#demo) 
   - [Examples of sparkMeasure on notebooks](#examples-of-sparkmeasure-on-notebooks)  
   - [Examples of sparkMeasure on the CLI](#examples-of-sparkmeasure-on-the-cli)
-  - [Command line example for Task Metrics](#command-line-example-for-task-metrics)
-- [Setting Up SparkMeasure with Spark](#setting-up-sparkmeasure-with-spark)
-  - [Version Compatibility for SparkMeasure](#version-compatibility-for-sparkmeasure)
-  - [Downloading SparkMeasure](#downloading-sparkmeasure)
-  - [Including sparkMeasure in Your Spark Environment](#including-sparkmeasure-in-your-spark-environment)
-  - [Setup Examples](#setup-examples)
+- [Setting up SparkMeasure with Spark](#setting-up-sparkmeasure-with-spark)
+  - [Version vompatibility for SparkMeasure](#version-compatibility-for-sparkmeasure)
+  - [Downloading sparkMeasure](#downloading-sparkmeasure)
+  - [Setup examples](#setup-examples)
 - [Notes on Metrics](#notes-on-metrics)
 - [Documentation and API reference](#documentation-api-and-examples)
 - [Architecture diagram](#architecture-diagram)
@@ -241,23 +239,6 @@ To get SparkMeasure, choose one of the following options:
 
     * Clone the repository and use sbt to build: `sbt +package`.
 
-### Including sparkMeasure in your Spark environment
-
-Choose your preferred method:
-
-* Use the `--packages` option:
-
-  ```bash
-  --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
-  ```
-* Directly reference the JAR file:
-
-  ```bash
-  --jars /path/to/spark-measure_2.12-0.25.jar
-  --jars https://github.com/LucaCanali/sparkMeasure/releases/download/v0.25/spark-measure_2.12-0.25.jar
-  --conf spark.driver.extraClassPath=/path/to/spark-measure_2.12-0.25.jar
-  ```
-
 ### Setup Examples
 
 #### Spark 4 with Scala 2.13
@@ -279,6 +260,22 @@ Choose your preferred method:
   pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
   pip install sparkmeasure
   ```
+### Including sparkMeasure in your Spark environment
+
+Choose your preferred method:
+
+* Use the `--packages` option:
+
+  ```bash
+  --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
+  ```
+* Directly reference the JAR file:
+
+  ```bash
+  --jars /path/to/spark-measure_2.13-0.25.jar
+  --jars https://github.com/LucaCanali/sparkMeasure/releases/download/v0.25/spark-measure_2.13-0.25.jar
+  --conf spark.driver.extraClassPath=/path/to/spark-measure_2.13-0.25.jar
+  ```
 
 ---
 ## Notes on Spark Metrics
diff --git a/docs/Flight_recorder_mode_FileSink.md b/docs/Flight_recorder_mode_FileSink.md
@@ -12,7 +12,7 @@ Metrics can also be printed to stdout.
 ## Recording metrics using the Flight Recorder mode with Stage-level granularity  
 To record metrics at the stage execution level granularity add these configurations to spark-submit: 
    ```
-   --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23
+   --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
    --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics
    ```
 
@@ -25,7 +25,7 @@ The usage is almost the same as for the stage metrics mode described above, just
 The configuration parameters applicable to Flight recorder mode for Task granularity are:
 
    ```
-   --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23
+   --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
    --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderTaskMetrics
    ```
 
@@ -51,7 +51,7 @@ A Python example
  - This runs the pi.py example script 
  - collects and saves the metrics to `/tmp/stageMetrics_flightRecorder` in json format:
 ```
-bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 \
+bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 \
 --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics \
 examples/src/main/python/pi.py
 
@@ -63,7 +63,7 @@ A Scala example
 - same example as above, in addition use a custom output filename
 - print metrics also to stdout
 ```
-bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 \
+bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 \
 --class org.apache.spark.examples.SparkPi \
 --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics \
 --conf spark.sparkmeasure.printToStdout=true \
@@ -80,7 +80,7 @@ This example collected metrics with Task granularity.
 (note: source the Hadoop environment before running this)
 ```
 bin/spark-submit --master yarn --deploy-mode cluster \
---packages ch.cern.sparkmeasure:spark-measure_2.12:0.25 \
+--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 \
 --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderTaskMetrics \
 --conf spark.sparkmeasure.outputFormat=json_to_hadoop \
 --conf spark.sparkmeasure.outputFilename="hdfs://myclustername/user/luca/test/myoutput_$(date +%s).json" \
@@ -90,13 +90,13 @@ examples/src/main/python/pi.py
 hdfs dfs -ls <path>/myoutput_*.json
 ```
 
-Example, use spark-3.3.0, Kubernetes, Scala 2.12 and write output to S3:  
+Example, use Spark 4, Kubernetes, Scala 2.13 and write output to S3:  
 (note: export KUBECONFIG=... + setup Hadoop environment + configure s3a keys in the script)
 ```
 bin/spark-submit --master k8s://https://XXX.XXX.XXX.XXX --deploy-mode client --conf spark.executor.instances=3 \
 --conf spark.executor.cores=2 --executor-memory 6g --driver-memory 8g \
---conf spark.kubernetes.container.image=<registry-URL>/spark:v3.0.0_20190529_hadoop32 \
---packages org.apache.hadoop:hadoop-aws:3.3.2,ch.cern.sparkmeasure:spark-measure_2.12:0.25 \
+--conf spark.kubernetes.container.image=apache/spark \
+--packages org.apache.hadoop:hadoop-aws:3.4.1,ch.cern.sparkmeasure:spark-measure_2.13:0.25 \
 --conf spark.hadoop.fs.s3a.secret.key="YYY..." \
 --conf spark.hadoop.fs.s3a.access.key="ZZZ..." \
 --conf spark.hadoop.fs.s3a.endpoint="https://s3.cern.ch" \
@@ -105,7 +105,7 @@ bin/spark-submit --master k8s://https://XXX.XXX.XXX.XXX --deploy-mode client --c
 --conf spark.sparkmeasure.outputFormat=json_to_hadoop \
 --conf spark.sparkmeasure.outputFilename="s3a://test/myoutput_$(date +%s).json" \
 --class org.apache.spark.examples.SparkPi \
-examples/jars/spark-examples_2.12-3.3.1.jar 10
+examples/jars/spark-examples_2.13-4.4.0.jar 10
 ```
 
 
@@ -115,7 +115,7 @@ To post-process the saved metrics you will need to deserialize objects saved by
 This is an example of how to do that using the supplied helper object sparkmeasure.Utils
 
 ```
-bin/spark-shell  --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
+bin/spark-shell  --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
 
 val myMetrics = ch.cern.sparkmeasure.IOUtils.readSerializedStageMetricsJSON("/tmp/stageMetrics_flightRecorder")
 // use ch.cern.sparkmeasure.IOUtils.readSerializedStageMetrics("/tmp/stageMetrics.serialized") for java serialization
diff --git a/docs/Flight_recorder_mode_InfluxDBSink.md b/docs/Flight_recorder_mode_InfluxDBSink.md
@@ -71,12 +71,12 @@ in spark-submit/spark-shell as in:
   - This example uses InfluxDB version 1.8 (using InfluxDB version 2 requires some changes in the example)
 ```
 # Alternative 1. 
-# Use this if you plan to use the Spark dashboard as in
+# Use this if you plan to use the Spark dashboard v1 as in
 # https://github.com/cerndb/spark-dashboard 
 docker run --name influx --network=host -d lucacanali/spark-dashboard:v01
 
 # Alternative 2.
-# Start InfluxDB, for example using a docker image 
+# Start InfluxDB stand-alone, for example using a docker image 
 docker run --name influx --network=host -d influxdb:1.8.10
 ```
 
@@ -87,7 +87,7 @@ bin/spark-shell \
   --conf spark.sparkmeasure.influxdbURL="http://localhost:8086" \
   --conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSink,ch.cern.sparkmeasure.InfluxDBSinkExtended \
   --conf spark.sparkmeasure.influxdbStagemetrics=true
-  --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
+  --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
 
 // run a Spark job, this will produce metrics  
 spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show
diff --git a/docs/Flight_recorder_mode_KafkaSink.md b/docs/Flight_recorder_mode_KafkaSink.md
@@ -67,7 +67,7 @@ bin/spark-shell \
 --conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSink \
 --conf spark.sparkmeasure.kafkaBroker=localhost:9092 \
 --conf spark.sparkmeasure.kafkaTopic=metrics
---packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
+--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
 ```
 
 - Look at the metrics being written into Kafka:
diff --git a/docs/Flight_recorder_mode_PrometheusPushgatewaySink.md b/docs/Flight_recorder_mode_PrometheusPushgatewaySink.md
@@ -60,7 +60,7 @@ Examples:
 bin/spark-shell \
 --conf spark.extraListeners=ch.cern.sparkmeasure.PushGatewaySink \
 --conf spark.sparkmeasure.pushgateway=localhost:9091 \
---packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
+--packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
 ```
 
 - Look at the metrics being written to the Pushgateway
diff --git a/docs/Instrument_Python_code.md b/docs/Instrument_Python_code.md
@@ -11,7 +11,7 @@ You can find an example of how to instrument a Scala application running Apache
  
 How to run the example:
  ```
-bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 <path_to_examples>/test_sparkmeasure_python.py
+bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 <path_to_examples>/test_sparkmeasure_python.py
  ```
 
  Some relevant snippet of code are:
@@ -36,7 +36,7 @@ bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 <path_t
     metrics = stagemetrics.aggregate_stagemetrics()
     print(f"metrics elapsedTime = {metrics.get('elapsedTime')}")
     
-    # Introduced in sparkMeasure v0.21, memory metrics report:
+    # memory metrics report:
     stageMetrics.print_memory_report()
 
     # save session metrics data in json format (default)
@@ -54,10 +54,10 @@ The details are discussed in the [examples for Python shell and notebook](https:
 
 - This is how to run sparkMeasure using a packaged version in Maven Central
   ```
-  bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25 your_python_code.py
+  bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 your_python_code.py
 
   // alternative: just download and use the jar (it is only needed in the driver) as in:
-  bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24.jar ...
+  bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.25.jar ...
   ```
 
 ### Download and build sparkMeasure (optional)
@@ -67,14 +67,14 @@ The details are discussed in the [examples for Python shell and notebook](https:
      git clone https://github.com/lucacanali/sparkmeasure
      cd sparkmeasure
      sbt +package
-     ls -l target/scala-2.12/spark-measure*.jar  # location of the compiled jar
+     ls -l target/scala-2.13/spark-measure*.jar  # location of the compiled jar
 
      cd python
      pip install .
   
      # Run as in one of these examples:
-     bin/spark-submit --jars path>/spark-measure_2.12-0.25-SNAPSHOT.jar ...
+     bin/spark-submit --jars path>/spark-measure_2.13-0.26-SNAPSHOT.jar ...
      
      # alternative, set classpath for the driver (sparkmeasure code runs only in the driver)
-     bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.25-SNAPSHOT.jar ...
+     bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.26-SNAPSHOT.jar ...
      ```
diff --git a/docs/Instrument_Scala_code.md b/docs/Instrument_Scala_code.md
@@ -13,7 +13,7 @@ How to run the example:
 # build the example jar
 sbt package
 
-bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 --class ch.cern.testSparkMeasure.testSparkMeasure <path_to_the_example_jar>/testsparkmeasurescala_2.12-0.1.jar
+bin/spark-submit --master local[*] --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 --class ch.cern.testSparkMeasure.testSparkMeasure <path_to_the_example_jar>/testsparkmeasurescala_2.13-0.1.jar
  ```
  
 ### Collect and save Stage Metrics
@@ -71,10 +71,10 @@ See details at: [Prometheus Pushgateway](Prometheus.md)
 
 - This is how to run sparkMeasure using a packaged version in Maven Central
     ```
-    bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
+    bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
 
     // or just download and use the jar (it is only needed in the driver) as in:
-    bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24.jar ...
+    bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.25.jar ...
    ```
 - The alternative, see paragraph above, is to build a jar from master (See below).
 
@@ -88,8 +88,8 @@ See details at: [Prometheus Pushgateway](Prometheus.md)
    ls -l target/scala-2.12/spark-measure*.jar  # location of the compiled jar
 
    # Run as in one of these examples:
-   bin/spark-submit --jars path>/spark-measure_2.12-0.25-SNAPSHOT.jar
+   bin/spark-submit --jars path>/spark-measure_2.13-0.26-SNAPSHOT.jar
    
    # alternative, set classpath for the driver (it is only needed in the driver)
-   bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.25-SNAPSHOT.jar ...
+   bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.26-SNAPSHOT.jar ...
    ```
diff --git a/docs/Prometheus.md b/docs/Prometheus.md
@@ -35,7 +35,7 @@ https://prometheus.io/docs/instrumenting/exposition_formats/
 
 1. Measure metrics at the Stage level (example in Scala):
 ```
-bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25 
+bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25 
 
 val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
 stageMetrics.begin()
@@ -84,4 +84,4 @@ Added method:
    * def sendReportPrometheus(serverIPnPort: String, metricsJob: String,
      labelName: String = sparkSession.sparkContext.appName,
      labelValue: String = sparkSession.sparkContext.applicationId): Unit -> send metrics to prometheus pushgateway
-   
+   
diff --git a/docs/Python_shell_and_Jupyter.md b/docs/Python_shell_and_Jupyter.md
@@ -8,25 +8,26 @@ See also [README](../README.md) for an introduction to sparkMeasure and its arch
 
 - Use PyPi to install the Python wrapper and take the jar from Maven central: 
     ```
+    pip install pyspark # Spark 4
     pip install sparkmeasure
-    bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
+    bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
     ```
  - If you prefer to build from the latest development version:
     ```
     git clone https://github.com/lucacanali/sparkmeasure
     cd sparkmeasure
     sbt +package
-    ls -l target/scala-2.12/spark-measure*.jar  # note location of the compiled and packaged jar
+    ls -l target/scala-2.13/spark-measure*.jar  # note location of the compiled and packaged jar
  
     # Install the Python wrapper package
     cd python
     pip install .
     
     # Run as in one of these examples:
-    bin/pyspark --jars path>/spark-measure_2.12-0.24-SNAPSHOT.jar
+    bin/pyspark --jars path>/spark-measure_2.13-0.26-SNAPSHOT.jar
     
     #Alternative:
-    bin/pyspark --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24-SNAPSHOT.jar
+    bin/pyspark --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.26-SNAPSHOT.jar
     ```
    
    
@@ -155,4 +156,4 @@ Stage 3 JVMHeapMemory maxVal bytes => 279558120 (266.6 MB)
 Stage 3 OnHeapExecutionMemory maxVal bytes => 0 (0 Bytes)
 ```
 
-    
+    
diff --git a/docs/Scala_shell_and_notebooks.md b/docs/Scala_shell_and_notebooks.md
@@ -8,10 +8,10 @@ See also [README](../README.md) for an introduction to sparkMeasure and its arch
 
 - The alternative, see paragraph above, is to build a jar from master.
     ```
-    bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
+    bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
 
     // or just download and use the jar (it is only needed in the driver) as in:
-    bin/spark-shell --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24.jar
+    bin/spark-shell --conf spark.driver.extraClassPath=<path>/spark-measure_2.13-0.25.jar
    ```
 
 ### Download and build sparkMeasure (optional)
@@ -21,10 +21,10 @@ See also [README](../README.md) for an introduction to sparkMeasure and its arch
     git clone https://github.com/lucacanali/sparkmeasure
     cd sparkmeasure
     sbt +package
-    ls -l target/scala-2.12/spark-measure*.jar  # location of the compiled jar
+    ls -l target/scala-2.13/spark-measure*.jar  # location of the compiled jar
  
     # Run as in one of these examples:
-    bin/spark-shell --jars <path>/spark-measure_2.12-0.24-SNAPSHOT.jar
+    bin/spark-shell --jars <path>/spark-measure_2.13-0.26-SNAPSHOT.jar
     
     # Alternative, set classpath for the driver (the JAR is only needed in the driver)
     bin/spark-shell --conf spark.driver.extraClassPath=<path>/spark-measure_2.11-0.24-SNAPSHOT.jar
@@ -34,7 +34,7 @@ See also [README](../README.md) for an introduction to sparkMeasure and its arch
  
 1. Measure metrics at the Stage level, a basic example:
     ```
-    bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25
+    bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25
     
     val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
     stageMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show)
diff --git a/examples/SparkMeasure_Jupyter_Colab_Example.ipynb b/examples/SparkMeasure_Jupyter_Colab_Example.ipynb
@@ -60,13 +60,13 @@
     "# Start the Spark Session\n",
     "# This example uses Spark in local mode for simplicity.\n",
     "# You can modify master to use YARN or K8S if available \n",
-    "# This example uses sparkMeasure 0.25 compiled with scala 2.12, downloaded from maven central\n",
+    "# This example uses sparkMeasure 0.25 compiled with scala 2.13, downloaded from maven central\n",
     "\n",
     "spark = SparkSession \\\n",
     " .builder \\\n",
     " .master(\"local[*]\") \\\n",
     " .appName(\"Test sparkmeasure instrumentation of Python/PySpark code\") \\\n",
-    " .config(\"spark.jars.packages\",\"ch.cern.sparkmeasure:spark-measure_2.12:0.25\")  \\\n",
+    " .config(\"spark.jars.packages\",\"ch.cern.sparkmeasure:spark-measure_2.13:0.25\")  \\\n",
     " .getOrCreate()"
    ]
   },
diff --git a/examples/SparkMeasure_Jupyter_Python_getting_started.ipynb b/examples/SparkMeasure_Jupyter_Python_getting_started.ipynb
@@ -52,14 +52,14 @@
     "# Start the Spark Session\n",
     "# This example uses Spark in local mode for simplicity.\n",
     "# You can modify master to use YARN or K8S if available \n",
-    "# This example uses sparkMeasure 0.25 compiled with scala 2.12, downloaded from maven central\n",
+    "# This example uses sparkMeasure 0.25 compiled with scala 2.13, downloaded from maven central\n",
     "\n",
     "\n",
     "from pyspark.sql import SparkSession\n",
     "spark = (SparkSession.builder\n",
     "         .appName(\"Test sparkmeasure instrumentation of Python/PySpark code\")\n",
     "         .master(\"local[*]\")\n",
-    "         .config(\"spark.jars.packages\",\"ch.cern.sparkmeasure:spark-measure_2.12:0.25\")\n",
+    "         .config(\"spark.jars.packages\",\"ch.cern.sparkmeasure:spark-measure_2.13:0.25\")\n",
     "         .getOrCreate()\n",
     "        )  \n"
    ]
diff --git a/examples/testSparkMeasureScala/README.md b/examples/testSparkMeasureScala/README.md
diff --git a/examples/testSparkMeasureScala/build.sbt b/examples/testSparkMeasureScala/build.sbt
diff --git a/examples/testSparkMeasureScala/src/main/scala/testSparkMeasure.scala b/examples/testSparkMeasureScala/src/main/scala/testSparkMeasure.scala
diff --git a/examples/test_sparkmeasure_python.py b/examples/test_sparkmeasure_python.py