Replies: 5 comments 3 replies
-
If you think about the case where the same plan is used multiple times with the same data sources, it should already create only one ExecutionPlan on the server. The Plan id is a hash of the plan itself, so two identical plans will have the same id. Problem is some small changes in the plan may cause them to become different. See: AbsaOSS/spline#893 If you are talking about identifying the same operation done on different data sources. That is more difficult. It would mean looking for graph Isomorphisms, comparing the graphs and checking for the same operations while ignoring the unimportant metadata. Alternatively user could just tag their plans with some kind of id/name to identify the same jobs. |
Beta Was this translation helpful? Give feedback.
-
sc._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)
from pyspark.sql.functions import col
# File location and type
file_location = "/FileStore/tables/circuits-2.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
spark.read\
.option("header","true")\
.option("inferschema","true")\
.csv("/FileStore/tables/circuits-2.csv")\
.write\
.mode('overwrite')\
.csv("/FileStore/tables/ind-2.csv")
df=spark.read\
.option("header","true")\
.option("inferschema","true")\
.csv("/FileStore/tables/circuits-2.csv")
#df.printSchema()
df=df.withColumn("CXX", col("lat"))
df.createOrReplaceTempView("Data")
df1=spark.sql("select *from Data")
df1.write.mode("overwrite").csv("mycsv.csv") |
Beta Was this translation helpful? Give feedback.
-
The code from your last post writes into two different CSV files (
You can annotate your exec plans and events with labels and then call Spline Consumer REST API to filter execution plans and events by their labels. Example for Spline 1.0.0 (soon to be released): spline.postProcessingFilter=composite
spline.postProcessingFilter.composite.filters=userExtraMeta,default
spline.postProcessingFilter.userExtraMeta.rules= \
{ \
"executionPlan": { \
"labels": [ \
"lbl_1": "foo" \\, \
"lbl_2": "bar" \
] \
} \
} See:
Then you can call Spline Consumer REST API to filter execution plans and events on those labels, but it's currently under development. Wait for 1.0.0 release. |
Beta Was this translation helpful? Give feedback.
-
Question - does on the version 1.0.0 i can get the name of the notebook in i install the agent of 1.0.0 on the databricks cluster |
Beta Was this translation helpful? Give feedback.
-
No, Spline agent doesn't capture vendor specific (in this case Databricks Notebook) properties out of the box. You need to either utilize |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi
After investigation and running of spline- i found that each command that reads from persistent and writes to persisnt is kept under execution plan
for example
Each time the spark code runs again- it generates a new exection plan (since it keeps history)
my question is -how can i identify that the a serise of execution plans describes the same action?
Beta Was this translation helpful? Give feedback.
All reactions