@@ -86,10 +86,14 @@ This directory `tests/` contains:
8686 directly under the ` korp ` package
8787- [ ` functional/ ` ] ( functional ) : functional tests, typically testing the endpoints
8888 (` korp.views.* ` )
89+ - [ ` testing/ ` ] ( testing ) : unit tests for functionality in test utility
90+ modules (` tests.*utils ` )
8991- ` data/ ` : test data
9092 - [ ` data/corpora/src ` ] ( data/corpora/src ) : corpus source data
9193 - [ ` data/corpora/config ` ] ( data/corpora/config ) : corpus configuration
9294 data
95+ - ` data/corpora/cwb-cache ` : cached CWB corpus data encoded from
96+ corpus source data (created by the tests)
9397 - [ ` data/db ` ] ( data/db ) : Korp MySQL database data
9498 - [ ` data/db/tableinfo ` ] ( data/db/tableinfo ) : YAML files with
9599 information for creating Korp MySQL database tables
@@ -182,38 +186,139 @@ def test_lemgram_count_single_corpus(self, client, database):
182186### Corpus data
183187
184188Each CWB corpus _ corpus_ whose data is used in the tests should have a
185- source VRT file _ corpus_ ` .vrt ` in ` data/corpora/src ` . The corpus
186- source files use a slightly extended VRT (VeRticalized Text) format
187- (the input format for CWB), where structures are marked with XML-style
188- tags (with attributes) and each token is on its own line, token
189- attributes separated by tags.
190-
191- The extension is that the positional and structural attributes need to
192- be declared at the top of the file as XML comments as follows:
193- ```
194- <!-- #vrt positional-attributes: attr1 attr2 ... -->
195- <!-- #vrt structural-attributes: text:0+a1+a2 sentence:0+a3+a4 ... -->
196- ```
197- For example:
198- ```
199- <!-- #vrt positional-attributes: word lemma -->
200- <!-- #vrt structural-attributes: text:0+id paragraph:0+id sentence:0+id -->
201- <text id="t1">
202- <paragraph id="p1">
203- <sentence id="s1">
204- </sentence>
205- This this
206- is be
207- a a
208- test test
209- . .
210- <sentence id="s2">
211- Great great
212- ! !
213- </sentence>
214- </paragraph>
215- </text>
216- ```
189+ source file in ` data/corpora/src ` . Two different corpus source formats
190+ are supported:
191+
192+ 1 . A slightly extended VRT (VeRticalized Text) format (the input
193+ format for CWB), in which structural attributes are marked with
194+ XML-style tags (with annotations as element attributes) and each
195+ token is on its own line, with positional (token) attributes
196+ separated by tabs. In VRT, the XML-style tags may _ not_ be
197+ indented.
198+
199+ Since the standard VRT content does not specify the names of
200+ positional attributes, the format has been extended so that their
201+ names can be specified in a special XML comment at the top of the
202+ file. A similar comment can also be used to specify the structural
203+ attributes, even though structural attributes can also be inferred
204+ from the file content. See below for more details and alternatives.
205+ For example:
206+
207+ ``` xml
208+ <!-- #vrt positional-attributes: word lemma -->
209+ <!-- #vrt structural-attributes: text:0+id paragraph:0+id sentence:0+id -->
210+ <text id =" t1" >
211+ <paragraph id =" p1" >
212+ <sentence id =" s1" a =" x" >
213+ This this
214+ is be
215+ a a
216+ test test
217+ . .
218+ </sentence >
219+ <sentence id =" s2" >
220+ Great great
221+ ! !
222+ </sentence >
223+ </paragraph >
224+ </text >
225+ ```
226+
227+ 2 . An XML format of the kind of the XML export formats produced by
228+ [ Sparv] ( https://spraakbanken.gu.se/sparv/ ) : structural attributes
229+ are represented by XML elements like in VRT (but tags can be
230+ indented) and tokens by leaf-level ` token ` elements with the word
231+ form as the text content and token attributes as element
232+ attributes. For example, the following XML corresponds to the above
233+ VRT:
234+
235+ ``` xml
236+ <?xml version =' 1.0' encoding =' UTF-8' ?>
237+ <text id =" t1" >
238+ <paragraph id =" p1" >
239+ <sentence id =" s1" >
240+ <token lemma =" this" >This</token >
241+ <token lemma =" be" >is</token >
242+ <token lemma =" a" >a</token >
243+ <token lemma =" test" >test</token >
244+ <token lemma =" ." >.</token >
245+ </sentence >
246+ <sentence id =" s2" a =" x" >
247+ <token lemma =" great" >Great</token >
248+ <token lemma =" !" >!</token >
249+ </sentence >
250+ </paragraph >
251+ </text >
252+ ```
253+ Possible elements above ` text ` are also included in the data.
254+
255+ For VRT source files, the positional and structural attributes can be
256+ specified in the following three ways. For XML files, both the
257+ positional and structural attribute names can be inferred from the
258+ data as the positional attributes are named attributes of ` token `
259+ elements. However, the first approach can also be used for XML files
260+ to override the inferred attributes.
261+
262+ 1 . In a YAML file _ corpus_ ` .attrs.yaml ` with content like the
263+ following (for the above examples):
264+ ``` yaml
265+ pos_attributes :
266+ - word
267+ - lemma
268+ struct_attributes :
269+ - text :
270+ - id
271+ - sentence :
272+ - id
273+ - x
274+ ` ` `
275+ In addition, if a structural attribute can be recursively nested,
276+ its name should be followed by the recursive nesting depth,
277+ separated by a space or colon:
278+ ` ` ` yaml
279+ struct_attributes :
280+ - div 2 :
281+ - a5
282+ # …
283+ - np:3 : []
284+ ` ` `
285+ If a structural attribute has no annotations, the annotations
286+ should be specified as an empty list.
287+
288+ If _corpus_` .attrs.yaml` lacks `pos_attributes` or
289+ ` struct_attributes` information, the missing information is
290+ obtained with approach 2 if applicable, otherwise with approach 3.
291+
292+ 2. If _corpus_`.attrs.yaml` does not exist, the attributes can be
293+ specified at the top of the VRT file as XML comments (an extension
294+ to the VRT format) :
295+ ` ` ` xml
296+ <!-- #vrt positional-attributes: attr1 attr2 ... -->
297+ <!-- #vrt structural-attributes: text:0+a1+a2 sentence:0+a3+a4 ... -->
298+ ` ` `
299+ Structural attributes are specified in the same way as for the
300+ ` cwb-encode` tool. See the VRT file example above for a concrete
301+ example.
302+
303+ 3. If _corpus_`.attrs.yaml` does not exist and the VRT file does not
304+ have a `positional-attributes` comment, positional attribute names
305+ are first taken from the following list : ` word lemma pos msd deprel
306+ dephead ref lex/` , as many names as the first token line has
307+ tab-separated attributes. If the token line has more attributes,
308+ the rest are named as `attr`_n_, where _n_ is the number of the
309+ attribute.
310+
311+ If the VRT file has no `structural-attributes` comment, the
312+ structural attributes and their annotations are inferred based on
313+ the content of the VRT file.
314+
315+ In approaches 1 and 2, a trailing slash in the name of a positional
316+ attribute or structural attribute annotation is passed to `cwb-encode`
317+ to indicate that its values are to be validated and normalized as
318+ feature sets (multi-valued). Approach 3 infers that a positional
319+ attribute or structural attribute annotation is feature-set-valued if
320+ all its values begin and end with a vertical bar `|`. It is also
321+ inferred similarly from XML data.
217322
218323In addition to the VRT file _corpus_`.vrt`, a corpus should have a
219324corresponding info file _corpus_`.info` containing at least the number
@@ -225,7 +330,9 @@ Updated: 2023-01-20
225330
226331Note that the encoded test corpus data is placed under a temporary
227332directory for the duration of a test session, so test corpora are
228- isolated from any other CWB corpora in the system.
333+ isolated from any other CWB corpora in the system. Encoded test corpus
334+ data is cached under `tests/data/corpora/cwb-cache` between test
335+ sessions, to avoid re-encoding it in each session.
229336
230337
231338# ## Corpus configuration data
0 commit comments