spline- how to recognize if the both execution plans represents the same entity #558

zacayd · 2022-12-12T08:50:01Z

zacayd
Dec 12, 2022

Hi
After investigation and running of spline- i found that each command that reads from persistent and writes to persisnt is kept under execution plan
for example

spark.read\
    .option("header","true")\
    .option("inferschema","true")\
    .csv("/FileStore/tables/circuits-2.csv")\
    .write\
    .mode('overwrite')\
    .csv("/FileStore/tables/ind-2.csv")

Each time the spark code runs again- it generates a new exection plan (since it keeps history)
my question is -how can i identify that the a serise of execution plans describes the same action?

cerveada · 2022-12-12T09:40:28Z

cerveada
Dec 12, 2022
Maintainer

If you think about the case where the same plan is used multiple times with the same data sources, it should already create only one ExecutionPlan on the server. The Plan id is a hash of the plan itself, so two identical plans will have the same id. Problem is some small changes in the plan may cause them to become different.

See: AbsaOSS/spline#893

If you are talking about identifying the same operation done on different data sources. That is more difficult. It would mean looking for graph Isomorphisms, comparing the graphs and checking for the same operations while ignoring the unimportant metadata.

Alternatively user could just tag their plans with some kind of id/name to identify the same jobs.

0 replies

zacayd · 2022-12-12T09:54:37Z

zacayd
Dec 12, 2022
Author

I wrote this code it generated 2 execution plan
after ru running it again (with no change) - it generated more 2 and so on -each time
"Alternatively user could just tag their plans with some kind of id/name to identify the same jobs."- how can it be done?
here is the spark sample code:

sc._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)

from pyspark.sql.functions import col
# File location and type
file_location = "/FileStore/tables/circuits-2.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

spark.read\
    .option("header","true")\
    .option("inferschema","true")\
    .csv("/FileStore/tables/circuits-2.csv")\
    .write\
    .mode('overwrite')\
    .csv("/FileStore/tables/ind-2.csv")
    
df=spark.read\
    .option("header","true")\
    .option("inferschema","true")\
    .csv("/FileStore/tables/circuits-2.csv")

#df.printSchema()

df=df.withColumn("CXX", col("lat"))
df.createOrReplaceTempView("Data")
df1=spark.sql("select *from Data")
df1.write.mode("overwrite").csv("mycsv.csv")

0 replies

wajda · 2022-12-13T13:49:51Z

wajda
Dec 13, 2022
Maintainer

I wrote this code it generated 2 execution plan

The code from your last post writes into two different CSV files (ind-2.csv and mycsv.csv). From the Spark (and Spline) perspective those are two logically and physically independent actions, each with its own execution plan. So this is correct. Currently Spline does not automatically group execution plans by their enclosing application name, tags, or other metadata, although it can be achieved by a custom AQL query into the Spline database. Such feature was briefly discussed recently, and may be included into the future backlog.

"Alternatively user could just tag their plans with some kind of id/name to identify the same jobs."- how can it be done?

You can annotate your exec plans and events with labels and then call Spline Consumer REST API to filter execution plans and events by their labels.

Example for Spline 1.0.0 (soon to be released):

spline.postProcessingFilter=composite
spline.postProcessingFilter.composite.filters=userExtraMeta,default
spline.postProcessingFilter.userExtraMeta.rules= \
{                                                \
  "executionPlan": {                             \
    "labels": [                                  \
      "lbl_1": "foo" \\,                         \
      "lbl_2": "bar"                             \
    ]                                            \
  }                                              \
}

See:

Then you can call Spline Consumer REST API to filter execution plans and events on those labels, but it's currently under development. Wait for 1.0.0 release.

3 replies

zacayd Dec 13, 2022
Author

Thanks for the answer
I work on databricks notebooks-
about the filter you wrote - where i put them? in the spark code header?
when 1.0.0 is going to be released?

spline.postProcessingFilter=composite
spline.postProcessingFilter.composite.filters=userExtraMeta,default
spline.postProcessingFilter.userExtraMeta.rules= \
{                                                \
  "executionPlan": {                             \
    "labels": [                                  \
      "lbl_1": "foo" \\,                         \
      "lbl_2": "bar"                             \
    ]                                            \
  }                                              \
}

cerveada Dec 15, 2022
Maintainer

about the filter you wrote - where i put them? in the spark code header?

Anywhere where the other spline configuration can be placed. Like spark.spline.lineageDispatcher config.
Details about configuration here: https://github.com/AbsaOSS/spline-spark-agent/#configuration

when 1.0.0 is going to be released?

In a month or so.

Since this is not released there is not much documentation yet for this exact filter but, you can learn from the code and tests:
MetadataCollectingFilter, MetadataCollectingFilterSpec, LineageHarvesterSpec
Int the LineageHarvesterSpec file search for userExtraMeta.

Be careful when putting the json in the properties escaping the characters can be tricky.

wajda Dec 30, 2022
Maintainer

1.0.0 version is released. Please check it out.

zacayd · 2023-01-26T13:12:43Z

zacayd
Jan 26, 2023
Author

Question - does on the version 1.0.0 i can get the name of the notebook in i install the agent of 1.0.0 on the databricks cluster
and what i need to do for this?

0 replies

wajda · 2023-01-27T00:27:10Z

wajda
Jan 27, 2023
Maintainer

No, Spline agent doesn't capture vendor specific (in this case Databricks Notebook) properties out of the box. You need to either utilize userExtraMeta filter to collect the notebook name form the env or spark property or, if it's not sufficient then develop your custom Spline extension. Please read the previous answers carefully.

0 replies

spline- how to recognize if the both execution plans represents the same entity #558

Uh oh!

Uh oh!

zacayd Dec 12, 2022

Replies: 5 comments · 3 replies

Uh oh!

cerveada Dec 12, 2022 Maintainer

Uh oh!

Uh oh!

zacayd Dec 12, 2022 Author

Uh oh!

wajda Dec 13, 2022 Maintainer

Uh oh!

Uh oh!

zacayd Dec 13, 2022 Author

Uh oh!

cerveada Dec 15, 2022 Maintainer

Uh oh!

wajda Dec 30, 2022 Maintainer

Uh oh!

zacayd Jan 26, 2023 Author

Uh oh!

wajda Jan 27, 2023 Maintainer

zacayd
Dec 12, 2022

Replies: 5 comments 3 replies

cerveada
Dec 12, 2022
Maintainer

zacayd
Dec 12, 2022
Author

wajda
Dec 13, 2022
Maintainer

zacayd Dec 13, 2022
Author

cerveada Dec 15, 2022
Maintainer

wajda Dec 30, 2022
Maintainer

zacayd
Jan 26, 2023
Author

wajda
Jan 27, 2023
Maintainer