RFC: Expectations proposal

Background: I presented an expectations proposal in Lund in July and received a lot of useful feedback. Moreover, the definition of test groups, grading, and scoring have evolved a lot, so that much of the framework is now obsolete. In particular, 

* the _verdict of a test group_ is no longer a meaningful concept.

I’ll try to summarise a new proposal here, hopefully in line with the working group’s preferences.

To solidify terminology, _testgroup_ here means a directory in `data` (including `data` itself). `data` is a testgroup, so is `data/sample`, the directory `data/secret/huge_instances/overflow/` in a pass-fail task, and the subtask `data/secret/group1` in an IOI-style problem.

# Expectations are for testcases, not testgroups

The conceptually largest change is that expectations are specified for (sets of) testcases, not for testgroups. In particular, an expectation is given for
* in graph-theoretic terms: _the testcases that are descendants of a testgroup `path`_.
* in filesystem-terms: _the testcases of the form `path/**`_.

For instance, in a pass-fail problem, you can write

```yaml
accepted/th.py: [AC, WA] # final verdict of th.py is either WA or AC
wrong_answer/js.py:
  sample:
    allowed: AC # js.py must accept on sample
  secret:
    required: WA # js.py must get WA on at least one secret test case
mixed/alice.py: # funky example:
  subgroups:
    secret/huge_instances/overflow:
      required: [AC, TLE] # deeply nested testcase uses same semantics
```

# Required and allowed

<s>As far as I can tell, we need to be able to specify both required and allowed testcase verdicts. The above syntax seems less verbose than the alternative:
```yaml
wrong_answer/js.py:
  data:
    sample: 
      allowed: AC # js.py must accept on sample
    secret:
      required: WA # js.py must get WA on at least one secret test case
```
The difference becomes particular striking in IOI-style problems with subtasks. (Try it.)</s>

I have no strong feelings about the names of keys, but `required` and `allowed` seem clear to me. The semantics is that if $V$ is the set of verdicts for the testcases below the specified testgroup then <s> R\subseteq V\subseteq A</s>  $R\cap V\neq\emptyset$ and $V\subseteq A$. I played around with `none`, `any`, `all`, but it didn’t become clearer or shorter. Suggestions are welcome (but try them out first by actually writing resulting YAML expressions.)

[Update: better syntax, see two posts down]

<s># Useful shorthands:

```yaml
submission_name: string
```
is a shorthand for

```yaml
submission_name:
  allowed: string
```
which is the most important usecase. Also, `string` is a shorthand for the singleton `[string]`.


# Full schema

The schema is something like this, if this makes sense to you

```cue
[string] :  // mixed/th.py
  string |   // AC 
  [...string] |  // [AC, WA]
  number |           // 23
  [number, number] |  // [23, 37]
  { // full map
    allowed?: [string]: #verdict 
    required?: [string]: #verdict 
    score?: [string]: number | [number, number] | "full"
}
```

# With subtasks

The most important use case for me is to specify expected behaviour on subtasks. This becomes less natural than in my original proposal (where the concept “testgroup verdict” existed.)

Now we’re at:
```yaml
mixed/greedy.py:
  allowed:
    sample: AC # must pass samples (we’re sneaky and haven’t included sample that needs DP)
    secret/group1: AC 
    secret/group2: AC
    secret/group3: [WA, AC] # should not crash or TLE
  required:
    secret/group3: WA # at least one testcase must fail
```
This is quite verbose, but I can’t find a way to make it shorter. Feel free to try.

# Scoring

Currently I’m at 
```yaml
mixed/baz.py:
  score: full
mixed/bar.py:
  score: 54
mixed/baf.py:
  score: [12, 20]
```

`full` is important because I don’t want to remember on the values in `testdata.yaml` when the score for subsask 1 changes; the value `full` communicates more to the reader than `23`.

Q1: Should we instead have a fraction here, such as `score: 1.0` meaning `full` and `score: [.2, .45]` meaning “this gets between 20% and 45% of the full value for this subtask? This sounds more useful to me.

# Judgemessages

I want to allow judgemessages as well, which doesn’t change the schema (just add ` | string` to `#verdict`):

```yaml
wrong_answer/th.py:
  required:
    secret: "too many rounds" # this submission must fail with  "too many rounds" on some instance
```

I think this will make it much easier to construct custom validators (because you can check for full code coverage in your validator.)

# Toplevel group name

Consider

```yaml
mixed/th.py:
  "": [AC, WA]
  sample: [AC]
  secret:
    allowed: [AC, WA]
    required: WA
```
This is (as far as I can tell) the best way of specifying “this is a WA submission that passes on sample”. But the role of this example is to highlight the fact that the toplevel directory doesn’t have a good name. 

Q2: what should be done about this?

1. Nothing. `""` or maybe `"."` are perfectly fine names for `data` when you actually need them. (Which is seldom, mostly it follows from the descendant verdicts anyway so you’re just being sloppy.)
2. Add `data/` to all testgroup names, so it’s `data/sample` etc. from now on
3. Identify testgroup names by their last part. `data` means `data` and `sample` means `data/sample`. If authors have both `data/secret/foo` and `data/secret/baz/foo` then they have themselves to blame
4. something else
  

# Bespoke Verdict Syntax Would Get Rid of Lists and  Required / Allowed

An alternative would be to _not_ have the `required` and `allowed` keys and instead bake in the expected behaviour into the terminology. After all, there is only a constant number of $R$ and $E$ with $R\subseteq A$ that can every appear since $|A|\leq 4$. For instance `accepted` means “must get exactly on all test cases”, but `timeout` means “AC and TLE are allowed,  and TLE is required”, `not_wrong` means that `WA` is disallowed (everything else is OK). I guess there are at best 10 different actually-existing cases that ever need to be defined.

This would allow some very useful shorthands.

Q3: Is this sufficiently tempting to try to come up with a list of those cases, and think about good names?
</s>
Please comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Expectations proposal #112

Expectations are for testcases, not testgroups

Required and allowed

Full schema

With subtasks

Scoring

Judgemessages

Toplevel group name

Bespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Expectations proposal #112

Description

Expectations are for testcases, not testgroups

Required and allowed

Full schema

With subtasks

Scoring

Judgemessages

Toplevel group name

Bespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions