Description
Background: I presented an expectations proposal in Lund in July and received a lot of useful feedback. Moreover, the definition of test groups, grading, and scoring have evolved a lot, so that much of the framework is now obsolete. In particular,
- the verdict of a test group is no longer a meaningful concept.
I’ll try to summarise a new proposal here, hopefully in line with the working group’s preferences.
To solidify terminology, testgroup here means a directory in data
(including data
itself). data
is a testgroup, so is data/sample
, the directory data/secret/huge_instances/overflow/
in a pass-fail task, and the subtask data/secret/group1
in an IOI-style problem.
Expectations are for testcases, not testgroups
The conceptually largest change is that expectations are specified for (sets of) testcases, not for testgroups. In particular, an expectation is given for
- in graph-theoretic terms: the testcases that are descendants of a testgroup
path
. - in filesystem-terms: the testcases of the form
path/**
.
For instance, in a pass-fail problem, you can write
accepted/th.py: [AC, WA] # final verdict of th.py is either WA or AC
wrong_answer/js.py:
sample:
allowed: AC # js.py must accept on sample
secret:
required: WA # js.py must get WA on at least one secret test case
mixed/alice.py: # funky example:
subgroups:
secret/huge_instances/overflow:
required: [AC, TLE] # deeply nested testcase uses same semantics
Required and allowed
As far as I can tell, we need to be able to specify both required and allowed testcase verdicts. The above syntax seems less verbose than the alternative:
wrong_answer/js.py:
data:
sample:
allowed: AC # js.py must accept on sample
secret:
required: WA # js.py must get WA on at least one secret test case
The difference becomes particular striking in IOI-style problems with subtasks. (Try it.)
I have no strong feelings about the names of keys, but required
and allowed
seem clear to me. The semantics is that if R\subseteq V\subseteq A none
, any
, all
, but it didn’t become clearer or shorter. Suggestions are welcome (but try them out first by actually writing resulting YAML expressions.)
[Update: better syntax, see two posts down]
# Useful shorthands:
submission_name: string
is a shorthand for
submission_name:
allowed: string
which is the most important usecase. Also, string
is a shorthand for the singleton [string]
.
Full schema
The schema is something like this, if this makes sense to you
[string] : // mixed/th.py
string | // AC
[...string] | // [AC, WA]
number | // 23
[number, number] | // [23, 37]
{ // full map
allowed?: [string]: #verdict
required?: [string]: #verdict
score?: [string]: number | [number, number] | "full"
}
With subtasks
The most important use case for me is to specify expected behaviour on subtasks. This becomes less natural than in my original proposal (where the concept “testgroup verdict” existed.)
Now we’re at:
mixed/greedy.py:
allowed:
sample: AC # must pass samples (we’re sneaky and haven’t included sample that needs DP)
secret/group1: AC
secret/group2: AC
secret/group3: [WA, AC] # should not crash or TLE
required:
secret/group3: WA # at least one testcase must fail
This is quite verbose, but I can’t find a way to make it shorter. Feel free to try.
Scoring
Currently I’m at
mixed/baz.py:
score: full
mixed/bar.py:
score: 54
mixed/baf.py:
score: [12, 20]
full
is important because I don’t want to remember on the values in testdata.yaml
when the score for subsask 1 changes; the value full
communicates more to the reader than 23
.
Q1: Should we instead have a fraction here, such as score: 1.0
meaning full
and score: [.2, .45]
meaning “this gets between 20% and 45% of the full value for this subtask? This sounds more useful to me.
Judgemessages
I want to allow judgemessages as well, which doesn’t change the schema (just add | string
to #verdict
):
wrong_answer/th.py:
required:
secret: "too many rounds" # this submission must fail with "too many rounds" on some instance
I think this will make it much easier to construct custom validators (because you can check for full code coverage in your validator.)
Toplevel group name
Consider
mixed/th.py:
"": [AC, WA]
sample: [AC]
secret:
allowed: [AC, WA]
required: WA
This is (as far as I can tell) the best way of specifying “this is a WA submission that passes on sample”. But the role of this example is to highlight the fact that the toplevel directory doesn’t have a good name.
Q2: what should be done about this?
- Nothing.
""
or maybe"."
are perfectly fine names fordata
when you actually need them. (Which is seldom, mostly it follows from the descendant verdicts anyway so you’re just being sloppy.) - Add
data/
to all testgroup names, so it’sdata/sample
etc. from now on - Identify testgroup names by their last part.
data
meansdata
andsample
meansdata/sample
. If authors have bothdata/secret/foo
anddata/secret/baz/foo
then they have themselves to blame - something else
Bespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed
An alternative would be to not have the required
and allowed
keys and instead bake in the expected behaviour into the terminology. After all, there is only a constant number of accepted
means “must get exactly on all test cases”, but timeout
means “AC and TLE are allowed, and TLE is required”, not_wrong
means that WA
is disallowed (everything else is OK). I guess there are at best 10 different actually-existing cases that ever need to be defined.
This would allow some very useful shorthands.
Q3: Is this sufficiently tempting to try to come up with a list of those cases, and think about good names?
Please comment.