Skip to content

Commit 2ec3d5c

Browse files
Merge pull request #45 from CSCfi/t-tests-improve-corpusutils
Tests: Improvements to test corpus utilities
2 parents 83b76e3 + 25d9082 commit 2ec3d5c

File tree

10 files changed

+928
-78
lines changed

10 files changed

+928
-78
lines changed

tests/README.md

Lines changed: 140 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -86,10 +86,14 @@ This directory `tests/` contains:
8686
directly under the `korp` package
8787
- [`functional/`](functional): functional tests, typically testing the endpoints
8888
(`korp.views.*`)
89+
- [`testing/`](testing): unit tests for functionality in test utility
90+
modules (`tests.*utils`)
8991
- `data/`: test data
9092
- [`data/corpora/src`](data/corpora/src): corpus source data
9193
- [`data/corpora/config`](data/corpora/config): corpus configuration
9294
data
95+
- `data/corpora/cwb-cache`: cached CWB corpus data encoded from
96+
corpus source data (created by the tests)
9397
- [`data/db`](data/db): Korp MySQL database data
9498
- [`data/db/tableinfo`](data/db/tableinfo): YAML files with
9599
information for creating Korp MySQL database tables
@@ -182,38 +186,139 @@ def test_lemgram_count_single_corpus(self, client, database):
182186
### Corpus data
183187

184188
Each CWB corpus _corpus_ whose data is used in the tests should have a
185-
source VRT file _corpus_`.vrt` in `data/corpora/src`. The corpus
186-
source files use a slightly extended VRT (VeRticalized Text) format
187-
(the input format for CWB), where structures are marked with XML-style
188-
tags (with attributes) and each token is on its own line, token
189-
attributes separated by tags.
190-
191-
The extension is that the positional and structural attributes need to
192-
be declared at the top of the file as XML comments as follows:
193-
```
194-
<!-- #vrt positional-attributes: attr1 attr2 ... -->
195-
<!-- #vrt structural-attributes: text:0+a1+a2 sentence:0+a3+a4 ... -->
196-
```
197-
For example:
198-
```
199-
<!-- #vrt positional-attributes: word lemma -->
200-
<!-- #vrt structural-attributes: text:0+id paragraph:0+id sentence:0+id -->
201-
<text id="t1">
202-
<paragraph id="p1">
203-
<sentence id="s1">
204-
</sentence>
205-
This this
206-
is be
207-
a a
208-
test test
209-
. .
210-
<sentence id="s2">
211-
Great great
212-
! !
213-
</sentence>
214-
</paragraph>
215-
</text>
216-
```
189+
source file in `data/corpora/src`. Two different corpus source formats
190+
are supported:
191+
192+
1. A slightly extended VRT (VeRticalized Text) format (the input
193+
format for CWB), in which structural attributes are marked with
194+
XML-style tags (with annotations as element attributes) and each
195+
token is on its own line, with positional (token) attributes
196+
separated by tabs. In VRT, the XML-style tags may _not_ be
197+
indented.
198+
199+
Since the standard VRT content does not specify the names of
200+
positional attributes, the format has been extended so that their
201+
names can be specified in a special XML comment at the top of the
202+
file. A similar comment can also be used to specify the structural
203+
attributes, even though structural attributes can also be inferred
204+
from the file content. See below for more details and alternatives.
205+
For example:
206+
207+
```xml
208+
<!-- #vrt positional-attributes: word lemma -->
209+
<!-- #vrt structural-attributes: text:0+id paragraph:0+id sentence:0+id -->
210+
<text id="t1">
211+
<paragraph id="p1">
212+
<sentence id="s1" a="x">
213+
This this
214+
is be
215+
a a
216+
test test
217+
. .
218+
</sentence>
219+
<sentence id="s2">
220+
Great great
221+
! !
222+
</sentence>
223+
</paragraph>
224+
</text>
225+
```
226+
227+
2. An XML format of the kind of the XML export formats produced by
228+
[Sparv](https://spraakbanken.gu.se/sparv/): structural attributes
229+
are represented by XML elements like in VRT (but tags can be
230+
indented) and tokens by leaf-level `token` elements with the word
231+
form as the text content and token attributes as element
232+
attributes. For example, the following XML corresponds to the above
233+
VRT:
234+
235+
```xml
236+
<?xml version='1.0' encoding='UTF-8'?>
237+
<text id="t1">
238+
<paragraph id="p1">
239+
<sentence id="s1">
240+
<token lemma="this">This</token>
241+
<token lemma="be">is</token>
242+
<token lemma="a">a</token>
243+
<token lemma="test">test</token>
244+
<token lemma=".">.</token>
245+
</sentence>
246+
<sentence id="s2" a="x">
247+
<token lemma="great">Great</token>
248+
<token lemma="!">!</token>
249+
</sentence>
250+
</paragraph>
251+
</text>
252+
```
253+
Possible elements above `text` are also included in the data.
254+
255+
For VRT source files, the positional and structural attributes can be
256+
specified in the following three ways. For XML files, both the
257+
positional and structural attribute names can be inferred from the
258+
data as the positional attributes are named attributes of `token`
259+
elements. However, the first approach can also be used for XML files
260+
to override the inferred attributes.
261+
262+
1. In a YAML file _corpus_`.attrs.yaml` with content like the
263+
following (for the above examples):
264+
```yaml
265+
pos_attributes:
266+
- word
267+
- lemma
268+
struct_attributes:
269+
- text:
270+
- id
271+
- sentence:
272+
- id
273+
- x
274+
```
275+
In addition, if a structural attribute can be recursively nested,
276+
its name should be followed by the recursive nesting depth,
277+
separated by a space or colon:
278+
```yaml
279+
struct_attributes:
280+
- div 2:
281+
- a5
282+
#
283+
- np:3: []
284+
```
285+
If a structural attribute has no annotations, the annotations
286+
should be specified as an empty list.
287+
288+
If _corpus_`.attrs.yaml` lacks `pos_attributes` or
289+
`struct_attributes` information, the missing information is
290+
obtained with approach 2 if applicable, otherwise with approach 3.
291+
292+
2. If _corpus_`.attrs.yaml` does not exist, the attributes can be
293+
specified at the top of the VRT file as XML comments (an extension
294+
to the VRT format):
295+
```xml
296+
<!-- #vrt positional-attributes: attr1 attr2 ... -->
297+
<!-- #vrt structural-attributes: text:0+a1+a2 sentence:0+a3+a4 ... -->
298+
```
299+
Structural attributes are specified in the same way as for the
300+
`cwb-encode` tool. See the VRT file example above for a concrete
301+
example.
302+
303+
3. If _corpus_`.attrs.yaml` does not exist and the VRT file does not
304+
have a `positional-attributes` comment, positional attribute names
305+
are first taken from the following list: `word lemma pos msd deprel
306+
dephead ref lex/`, as many names as the first token line has
307+
tab-separated attributes. If the token line has more attributes,
308+
the rest are named as `attr`_n_, where _n_ is the number of the
309+
attribute.
310+
311+
If the VRT file has no `structural-attributes` comment, the
312+
structural attributes and their annotations are inferred based on
313+
the content of the VRT file.
314+
315+
In approaches 1 and 2, a trailing slash in the name of a positional
316+
attribute or structural attribute annotation is passed to `cwb-encode`
317+
to indicate that its values are to be validated and normalized as
318+
feature sets (multi-valued). Approach 3 infers that a positional
319+
attribute or structural attribute annotation is feature-set-valued if
320+
all its values begin and end with a vertical bar `|`. It is also
321+
inferred similarly from XML data.
217322

218323
In addition to the VRT file _corpus_`.vrt`, a corpus should have a
219324
corresponding info file _corpus_`.info` containing at least the number
@@ -225,7 +330,9 @@ Updated: 2023-01-20
225330

226331
Note that the encoded test corpus data is placed under a temporary
227332
directory for the duration of a test session, so test corpora are
228-
isolated from any other CWB corpora in the system.
333+
isolated from any other CWB corpora in the system. Encoded test corpus
334+
data is cached under `tests/data/corpora/cwb-cache` between test
335+
sessions, to avoid re-encoding it in each session.
229336

230337

231338
### Corpus configuration data

0 commit comments

Comments
 (0)