More Dataset Support #341

carpe-erin · 2021-10-26T00:15:07Z

carpe-erin
Oct 26, 2021

Hello,

I was wondering if there was a plan to do more support for Datasets. Essentially, I am trying to answer questions like which field in the input data ended up in which field in the output data?

I wrote this small program and I dug through the Arango data base to try to figure out how I could find that the field d_code from the input data ended up in both d_code_new and d_name_new in the output data. I spent a good amount of time digging and I can't figure out if I am missing something or does Spline not support it since the mapping steps are not utilizing SparkSql?

package za.co.absa.spline.example.batch

import za.co.absa.spline.SparkApp

object ErinExample1Job extends SparkApp("Erin Example 1") {

  case class Person (
                    first_name: String,
                    last_name:String
                    )

  case class NewPerson (
                      first_name_new: String,
                      last_name_new:String
                    )

  case class CSVModel (
                        d_code: String,
                        d_name: String,
                        people: Seq[Person]
                      )
  
case class NewCSVModel (
                        d_code_new: String,
                        d_name_new: String,
                        people_new: Seq[NewPerson]
                      )


  import org.apache.spark.sql._
  import za.co.absa.spline.harvester.SparkLineageInitializer._

  def PersonMaker(p: Person): NewPerson = {
    NewPerson(
      first_name_new = p.first_name.concat(" TEST"),
      last_name_new = p.last_name.concat(" TEST")
    )
  }

  spark.enableLineageTracking()

  val encoder = org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[CSVModel]

  val ds: Dataset[CSVModel] = spark.read
    .option("inferSchema", "true")
    .json("data/input/batch/test.json")
    .as(encoder)

  ds.map{
    row => {
      NewCSVModel(
        d_code_new = row.d_code.concat("123"),
        d_name_new = row.d_code.concat("123"),
        people_new = row.people.map(PersonMaker(_))
      )
    }
  }.write.mode(SaveMode.Overwrite).json("data/output/batch/erin_test_results")
}

Answered by cerveada

Oct 26, 2021

Hello,

even though Spline recognizes the lineage at the operation level, the extraction of attribute level lineage for commands used here is not yet implemented.

I created a ticket for that: #342

View full answer

cerveada · 2021-10-26T06:28:22Z

cerveada
Oct 26, 2021
Maintainer

Hello,

even though Spline recognizes the lineage at the operation level, the extraction of attribute level lineage for commands used here is not yet implemented.

I created a ticket for that: #342

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More Dataset Support #341

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

More Dataset Support #341

Uh oh!

Uh oh!

carpe-erin Oct 26, 2021

Replies: 1 comment

Uh oh!

Uh oh!

cerveada Oct 26, 2021 Maintainer

carpe-erin
Oct 26, 2021

cerveada
Oct 26, 2021
Maintainer