Skip to content

Conversation

@dylandoamaral
Copy link
Contributor

@dylandoamaral dylandoamaral commented Jan 4, 2023

  • get at least a couple of non-matching rows in test output
    summarize,
  • diff in matching : explain why a particular part of the rule is not matching
  • integrated schema matching as a header (to allow for projection for the following rules, if position based)
  • introduce a blocking key to avoid rule collision (a specific set of field are in priority for matching, which reduce the candidate rules)
  • add expectAtLeast

validity:

  • a row is valid, if it matches only one rule, or if the rules can be distributed between more rows.
case class Person(name, age)

val ds:Dataset = // Person("toto", 13), Person("tata", 13)

ds.expectAll(row(__, 13), row("toto", __)) // should not fail
ds.expectAll(row(__, 13)) // should fail

ds.expectAll(row(__, 13).allowMultipleMatches) // should not fail

summarize :
Due to the potential large volume of data, and low volume of rows, the idea is to keep certain rows to explain the failing test :

  • a set row that are valid are keeped, scored with closeness to rules that were not matching
  • a set row that didn't match any rules, scored with closeness to rules that were not matching
    The purge works per rules, keeping at least N values in the two sets, depending on closeness

Copy link
Member

@ahoy-jon ahoy-jon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some review

test("Dataframe should validate expect all with __ in it") {
for {
people <- Dataset(Person("Louis", 50), Person("Lara", 50))
result <- people.toDF.expectAll(row(__, 50))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to write?

for {
          people <- Dataset(Person("Louis", 50), Person("Lara", 50))
          result <- people.expectAll(row(__, 50))
          

object row {
def apply[T](predicate: T => Boolean): ConditionalMatcher[T] = ConditionalMatcher(predicate)
def apply[T](value: T): DataMatcher[T] = DataMatcher(value)
def apply(value: Any, values: Any*): RowMatcher = RowMatcher(Row(value +: values: _*))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def apply(valueMatcher: ValueMatcher, valueMatcher*: ValueMatcher*):RowMatcher

with different ways to build a valueMatcher.

implicit def conversion[T:ToValueMatcher](t:T):ValueMatcher
implicit def keyConversion[T:ToValueMatcher](t2:(String,T)):ValueMatcher

with possible future syntaxes being

row(__, "name" -> ((_:String).isEmpty || isNull))
row(__, "adr[0]" -> row[Adr](_.countryCode == "FR)
row(__,  "name" -> (v:String) => v == null || v.isEmtpy) // could reuse at lowcost zio-test macros on assertTrue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about datasets ? You only tackle dataframes with this syntax no ?

Copy link
Contributor Author

@dylandoamaral dylandoamaral Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take :
row(__, "name" -> ((_:String).isEmpty || isNull), 20)
There are both positional and key based predicates, what it means ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should work on dataset too.

we could have

expectAll(
  row(_.isAdult).allowMultipleMatches, 
  row("toto", 13))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in Scala, it doesn't work. We could make it work, but that is too ambiguous.

def default[T]: T => Boolean = _ => true

def checkMatch( age: Int => Boolean = default,
                surname: String=> Boolean = default,
                name: String => Boolean =default):Any = ???

def __ = ???

checkMatch(__, name = _.contains("toto"))
checkMatch(__, name = _.contains("toto"), default)

image

Dataset(1, 2, 3).flatMap(_.expectAll(row(1), row(2)))
} @@ failing,
test("Dataset should validate expect all with conditional match") {
Dataset(1, 2, 3).flatMap(_.expectAll(row(_ > 0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only if a matcher is set to match more rows (allowMultipleMatches)

@dylandoamaral dylandoamaral marked this pull request as draft January 5, 2023 08:50
@dylandoamaral dylandoamaral marked this pull request as ready for review January 13, 2023 08:24
@dylandoamaral
Copy link
Contributor Author

@ahoy-jon I would like a review on the current state of this PR. I think we can improve error handling, but I don't know exactly how. I would also need more insight about blocking keys based on the current state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants