Skip to content

RFC: Expectations proposal #112

Closed
Closed
@thorehusfeldt

Description

@thorehusfeldt

Background: I presented an expectations proposal in Lund in July and received a lot of useful feedback. Moreover, the definition of test groups, grading, and scoring have evolved a lot, so that much of the framework is now obsolete. In particular,

  • the verdict of a test group is no longer a meaningful concept.

I’ll try to summarise a new proposal here, hopefully in line with the working group’s preferences.

To solidify terminology, testgroup here means a directory in data (including data itself). data is a testgroup, so is data/sample, the directory data/secret/huge_instances/overflow/ in a pass-fail task, and the subtask data/secret/group1 in an IOI-style problem.

Expectations are for testcases, not testgroups

The conceptually largest change is that expectations are specified for (sets of) testcases, not for testgroups. In particular, an expectation is given for

  • in graph-theoretic terms: the testcases that are descendants of a testgroup path.
  • in filesystem-terms: the testcases of the form path/**.

For instance, in a pass-fail problem, you can write

accepted/th.py: [AC, WA] # final verdict of th.py is either WA or AC
wrong_answer/js.py:
  sample:
    allowed: AC # js.py must accept on sample
  secret:
    required: WA # js.py must get WA on at least one secret test case
mixed/alice.py: # funky example:
  subgroups:
    secret/huge_instances/overflow:
      required: [AC, TLE] # deeply nested testcase uses same semantics

Required and allowed

As far as I can tell, we need to be able to specify both required and allowed testcase verdicts. The above syntax seems less verbose than the alternative:

wrong_answer/js.py:
  data:
    sample: 
      allowed: AC # js.py must accept on sample
    secret:
      required: WA # js.py must get WA on at least one secret test case

The difference becomes particular striking in IOI-style problems with subtasks. (Try it.)

I have no strong feelings about the names of keys, but required and allowed seem clear to me. The semantics is that if $V$ is the set of verdicts for the testcases below the specified testgroup then R\subseteq V\subseteq A $R\cap V\neq\emptyset$ and $V\subseteq A$. I played around with none, any, all, but it didn’t become clearer or shorter. Suggestions are welcome (but try them out first by actually writing resulting YAML expressions.)

[Update: better syntax, see two posts down]

# Useful shorthands:

submission_name: string

is a shorthand for

submission_name:
  allowed: string

which is the most important usecase. Also, string is a shorthand for the singleton [string].

Full schema

The schema is something like this, if this makes sense to you

[string] :  // mixed/th.py
  string |   // AC 
  [...string] |  // [AC, WA]
  number |           // 23
  [number, number] |  // [23, 37]
  { // full map
    allowed?: [string]: #verdict 
    required?: [string]: #verdict 
    score?: [string]: number | [number, number] | "full"
}

With subtasks

The most important use case for me is to specify expected behaviour on subtasks. This becomes less natural than in my original proposal (where the concept “testgroup verdict” existed.)

Now we’re at:

mixed/greedy.py:
  allowed:
    sample: AC # must pass samples (we’re sneaky and haven’t included sample that needs DP)
    secret/group1: AC 
    secret/group2: AC
    secret/group3: [WA, AC] # should not crash or TLE
  required:
    secret/group3: WA # at least one testcase must fail

This is quite verbose, but I can’t find a way to make it shorter. Feel free to try.

Scoring

Currently I’m at

mixed/baz.py:
  score: full
mixed/bar.py:
  score: 54
mixed/baf.py:
  score: [12, 20]

full is important because I don’t want to remember on the values in testdata.yaml when the score for subsask 1 changes; the value full communicates more to the reader than 23.

Q1: Should we instead have a fraction here, such as score: 1.0 meaning full and score: [.2, .45] meaning “this gets between 20% and 45% of the full value for this subtask? This sounds more useful to me.

Judgemessages

I want to allow judgemessages as well, which doesn’t change the schema (just add | string to #verdict):

wrong_answer/th.py:
  required:
    secret: "too many rounds" # this submission must fail with  "too many rounds" on some instance

I think this will make it much easier to construct custom validators (because you can check for full code coverage in your validator.)

Toplevel group name

Consider

mixed/th.py:
  "": [AC, WA]
  sample: [AC]
  secret:
    allowed: [AC, WA]
    required: WA

This is (as far as I can tell) the best way of specifying “this is a WA submission that passes on sample”. But the role of this example is to highlight the fact that the toplevel directory doesn’t have a good name.

Q2: what should be done about this?

  1. Nothing. "" or maybe "." are perfectly fine names for data when you actually need them. (Which is seldom, mostly it follows from the descendant verdicts anyway so you’re just being sloppy.)
  2. Add data/ to all testgroup names, so it’s data/sample etc. from now on
  3. Identify testgroup names by their last part. data means data and sample means data/sample. If authors have both data/secret/foo and data/secret/baz/foo then they have themselves to blame
  4. something else

Bespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed

An alternative would be to not have the required and allowed keys and instead bake in the expected behaviour into the terminology. After all, there is only a constant number of $R$ and $E$ with $R\subseteq A$ that can every appear since $|A|\leq 4$. For instance accepted means “must get exactly on all test cases”, but timeout means “AC and TLE are allowed, and TLE is required”, not_wrong means that WA is disallowed (everything else is OK). I guess there are at best 10 different actually-existing cases that ever need to be defined.

This would allow some very useful shorthands.

Q3: Is this sufficiently tempting to try to come up with a list of those cases, and think about good names?

Please comment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions