spraakbanken
diff --git a/‎tests/README.md‎
Lines changed: 140 additions & 33 deletions b/‎tests/README.md‎
Lines changed: 140 additions & 33 deletions
@@ -86,10 +86,14 @@ This directory `tests/` contains:
   directly under the `korp` package
 - [`functional/`](functional): functional tests, typically testing the endpoints
   (`korp.views.*`)
+- [`testing/`](testing): unit tests for functionality in test utility
+  modules (`tests.*utils`)
 - `data/`: test data
   - [`data/corpora/src`](data/corpora/src): corpus source data
   - [`data/corpora/config`](data/corpora/config): corpus configuration
     data
+  - `data/corpora/cwb-cache`: cached CWB corpus data encoded from
+    corpus source data (created by the tests)
   - [`data/db`](data/db): Korp MySQL database data
   - [`data/db/tableinfo`](data/db/tableinfo): YAML files with
     information for creating Korp MySQL database tables
@@ -182,38 +186,139 @@ def test_lemgram_count_single_corpus(self, client, database):
 ### Corpus data
 
 Each CWB corpus _corpus_ whose data is used in the tests should have a
-source VRT file _corpus_`.vrt` in `data/corpora/src`. The corpus
-source files use a slightly extended VRT (VeRticalized Text) format
-(the input format for CWB), where structures are marked with XML-style
-tags (with attributes) and each token is on its own line, token
-attributes separated by tags.
-
-The extension is that the positional and structural attributes need to
-be declared at the top of the file as XML comments as follows:
-```
-<!-- #vrt positional-attributes: attr1 attr2 ... -->
-<!-- #vrt structural-attributes: text:0+a1+a2 sentence:0+a3+a4 ... -->
-```
-For example:
-```
-<!-- #vrt positional-attributes: word lemma -->
-<!-- #vrt structural-attributes: text:0+id paragraph:0+id sentence:0+id -->
-<text id="t1">
-<paragraph id="p1">
-<sentence id="s1">
-</sentence>
-This	this
-is	be
-a	a
-test	test
-.	.
-<sentence id="s2">
-Great	great
-!	!
-</sentence>
-</paragraph>
-</text>
-```
+source file in `data/corpora/src`. Two different corpus source formats
+are supported:
+
+1. A slightly extended VRT (VeRticalized Text) format (the input
+   format for CWB), in which structural attributes are marked with
+   XML-style tags (with annotations as element attributes) and each
+   token is on its own line, with positional (token) attributes
+   separated by tabs. In VRT, the XML-style tags may _not_ be
+   indented.
+
+   Since the standard VRT content does not specify the names of
+   positional attributes, the format has been extended so that their
+   names can be specified in a special XML comment at the top of the
+   file. A similar comment can also be used to specify the structural
+   attributes, even though structural attributes can also be inferred
+   from the file content. See below for more details and alternatives.
+   For example:
+
+   ```xml
+   <!-- #vrt positional-attributes: word lemma -->
+   <!-- #vrt structural-attributes: text:0+id paragraph:0+id sentence:0+id -->
+   <text id="t1">
+   <paragraph id="p1">
+   <sentence id="s1" a="x">
+   This	this
+   is	be
+   a	a
+   test	test
+   .	.
+   </sentence>
+   <sentence id="s2">
+   Great	great
+   !	!
+   </sentence>
+   </paragraph>
+   </text>
+   ```
+
+2. An XML format of the kind of the XML export formats produced by
+   [Sparv](https://spraakbanken.gu.se/sparv/): structural attributes
+   are represented by XML elements like in VRT (but tags can be
+   indented) and tokens by leaf-level `token` elements with the word
+   form as the text content and token attributes as element
+   attributes. For example, the following XML corresponds to the above
+   VRT:
+
+   ```xml
+   <?xml version='1.0' encoding='UTF-8'?>
+   <text id="t1">
+     <paragraph id="p1">
+       <sentence id="s1">
+         <token lemma="this">This</token>
+         <token lemma="be">is</token>
+         <token lemma="a">a</token>
+         <token lemma="test">test</token>
+         <token lemma=".">.</token>
+       </sentence>
+       <sentence id="s2" a="x">
+         <token lemma="great">Great</token>
+         <token lemma="!">!</token>
+       </sentence>
+     </paragraph>
+   </text>
+   ```
+   Possible elements above `text` are also included in the data.
+
+For VRT source files, the positional and structural attributes can be
+specified in the following three ways. For XML files, both the
+positional and structural attribute names can be inferred from the
+data as the positional attributes are named attributes of `token`
+elements. However, the first approach can also be used for XML files
+to override the inferred attributes.
+
+1. In a YAML file _corpus_`.attrs.yaml` with content like the
+   following (for the above examples):
+   ```yaml
+   pos_attributes:
+   - word
+   - lemma
+   struct_attributes:
+   - text:
+     - id
+   - sentence:
+     - id
+     - x
+   ```
+   In addition, if a structural attribute can be recursively nested,
+   its name should be followed by the recursive nesting depth,
+   separated by a space or colon:
+   ```yaml
+   struct_attributes:
+   - div 2:
+     - a5
+     # …
+   - np:3: []
+   ```
+   If a structural attribute has no annotations, the annotations
+   should be specified as an empty list.
+
+   If _corpus_`.attrs.yaml` lacks `pos_attributes` or
+   `struct_attributes` information, the missing information is
+   obtained with approach 2 if applicable, otherwise with approach 3.
+
+2. If _corpus_`.attrs.yaml` does not exist, the attributes can be
+   specified at the top of the VRT file as XML comments (an extension
+   to the VRT format):
+   ```xml
+   <!-- #vrt positional-attributes: attr1 attr2 ... -->
+   <!-- #vrt structural-attributes: text:0+a1+a2 sentence:0+a3+a4 ... -->
+   ```
+   Structural attributes are specified in the same way as for the
+   `cwb-encode` tool. See the VRT file example above for a concrete
+   example.
+
+3. If _corpus_`.attrs.yaml` does not exist and the VRT file does not
+   have a `positional-attributes` comment, positional attribute names
+   are first taken from the following list: `word lemma pos msd deprel
+   dephead ref lex/`, as many names as the first token line has
+   tab-separated attributes. If the token line has more attributes,
+   the rest are named as `attr`_n_, where _n_ is the number of the
+   attribute.
+
+   If the VRT file has no `structural-attributes` comment, the
+   structural attributes and their annotations are inferred based on
+   the content of the VRT file.
+
+In approaches 1 and 2, a trailing slash in the name of a positional
+attribute or structural attribute annotation is passed to `cwb-encode`
+to indicate that its values are to be validated and normalized as
+feature sets (multi-valued). Approach 3 infers that a positional
+attribute or structural attribute annotation is feature-set-valued if
+all its values begin and end with a vertical bar `|`. It is also
+inferred similarly from XML data.
 
 In addition to the VRT file _corpus_`.vrt`, a corpus should have a
 corresponding info file _corpus_`.info` containing at least the number
@@ -225,7 +330,9 @@ Updated: 2023-01-20
 
 Note that the encoded test corpus data is placed under a temporary
 directory for the duration of a test session, so test corpora are
-isolated from any other CWB corpora in the system.
+isolated from any other CWB corpora in the system. Encoded test corpus
+data is cached under `tests/data/corpora/cwb-cache` between test
+sessions, to avoid re-encoding it in each session.
 
 
 ### Corpus configuration data