|
19 | 19 | </titleStmt> |
20 | 20 | <publicationStmt> |
21 | 21 | <publisher>CLARIN</publisher> |
22 | | - <date>2026-04-22</date> |
| 22 | + <date>2026-04-23</date> |
23 | 23 | <availability status="free"> |
24 | 24 | <p>This file is freely available and you are hereby authorised to copy, modify, and |
25 | 25 | redistribute it in any way without further reference or permissions.</p> |
|
73 | 73 | </langUsage> |
74 | 74 | </profileDesc> |
75 | 75 | <revisionDesc> |
| 76 | + <change when="2026-04-23">Tomaž Erjavec: insert discussion of EOL split words.</change> |
76 | 77 | <change when="2026-04-22">Tomaž Erjavec: introduce idno/@type="local".</change> |
77 | 78 | <change when="2026-03-18">Tomaž Erjavec: expand are re-arrange discussion about pb/cb/lb.</change> |
78 | 79 | <change when="2025-11-13">Tomaž Erjavec: add notes about auto-inserted metadata.</change> |
|
86 | 87 | <titlePart type="main">The structure and encoding of |
87 | 88 | <ref target="https://github.com/clarin-eric/PressMint">PressMint corpora</ref></titlePart> |
88 | 89 | </docTitle> |
89 | | - <docDate>2026-03-18</docDate> |
| 90 | + <docDate>2026-04-23</docDate> |
90 | 91 | </titlePage> |
91 | 92 | <p></p> |
92 | 93 | <divGen type="toc"/> |
|
1510 | 1511 |
|
1511 | 1512 | <p>PressMint gives priority to the so-called critical (or structural) transcription of the |
1512 | 1513 | text, i.e. using, as explained above, elements marking divisions, such as individual |
1513 | | - newspaper articles, and paragraphs inside them. The alternative view is the diplomatic one, |
| 1514 | + newspaper articles and paragraphs inside them, as well as, in the ideal case, |
| 1515 | + the correct reading order of the text. |
| 1516 | + The alternative view is the diplomatic one, |
1514 | 1517 | which does not encode structure but rather the visual apperance of the text, i.e. what is a |
1515 | 1518 | page, column or line.</p> |
1516 | 1519 |
|
|
1533 | 1536 | for <gi>lb</gi> to the next <gi>lb</gi> element) or, for the last such empty element, to the |
1534 | 1537 | end of the containing element. |
1535 | 1538 |
|
1536 | | - |
1537 | 1539 | As also discussed there, each of these three empty elements can be connected to the surface |
1538 | 1540 | or zone of the facsimile using the <att>facs</att> attribute, which points to the ID of the |
1539 | | - appropriate <gi>surface</gi> or <gi>zone</gi>. |
| 1541 | + appropriate <gi>surface</gi> or <gi>zone</gi>.</p> |
1540 | 1542 |
|
1541 | | - To illustrate the points above, we give below an example that uses page and line beginnings; for simplicity, |
| 1543 | + <p>To illustrate the points above, we give below an example that uses page and line beginnings; for simplicity, |
1542 | 1544 | we do not give the <att>facs</att> attribute which would connect them to the (areas of the) image stored in the |
1543 | | - <gi>facsimile</gi> element. |
| 1545 | + <gi>facsimile</gi> element: |
1544 | 1546 |
|
1545 | 1547 | <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-lb"> |
1546 | 1548 | <body> |
|
1560 | 1562 | including in the middle of a (end-of-line hyphenated) word, which makes the linguistic |
1561 | 1563 | annotation of such text more complicated, as texual data is mixed with markup, typically |
1562 | 1564 | not otherwise the case.</p> |
| 1565 | + |
| 1566 | + <p>A problem to note is how to treat end-of-line hyphenated words, esp. as (depending on |
| 1567 | + the language and the word in question) some such words need to be, at least for |
| 1568 | + linguistic analysis, joined together ommitting the hyphen (as would be the case with |
| 1569 | + <code>recom-<lb/>mended</code>) while others need the hyphyen retained |
| 1570 | + (e.g. <code>fire-<lb/>proof</code>).</p> |
| 1571 | + |
| 1572 | + <p>The end-of-line hyphens can (but need not) be explicitly marked with the <gi>pc</gi> |
| 1573 | + element, which has the advantage that <gi>lb</gi> elements need not be introduced or |
| 1574 | + preseved (although they still can be). Namely, TEI allows the attribute <att>force</att> |
| 1575 | + on <gi>pc</gi>, which indicates whether the punctuation mark is a word separator or |
| 1576 | + not. While the default for the hyphen would be <code>pc/@force="inter"</code>, i.e. that |
| 1577 | + the hyphen may or may not be a word separator, it can also be set to |
| 1578 | + <code>pc/@force="strong"</code> (hyphen is a word separator) |
| 1579 | + <code>pc/@force="weak"</code> (hyphen is not a word separator). |
| 1580 | + Furthermore, the <gi>lb</gi> element (as well as <gi>cb</gi> and <gi>pb</gi>) |
| 1581 | + can have the <att>break</att> attribute, which signals that the |
| 1582 | + line beginning is considered to mark the end of a word in the same way as whitespace, |
| 1583 | + and has the values of <code>yes</code>, <code>no</code>, and <code>maybe</code>. |
| 1584 | + So, if the end-of-line hypens are encoded with <gi>pc</gi> and line beginnings are preseved |
| 1585 | + then the default annotation our example would thus be |
| 1586 | + <code>pre<pc force="inter">-</pc><lb break="maybe"/>grešil</code>. |
| 1587 | + Of course, the example could also be annotated with <code>pc/@force="weak"</code> and |
| 1588 | + <code>lb/@break="no"</code> (or, conversely, with "strong" and "no") but this necessitates |
| 1589 | + deciding whether the hyphen signals only a word break or it is also a part of the word). |
| 1590 | + </p> |
1563 | 1591 | </div> |
1564 | 1592 |
|
1565 | 1593 | <div xml:id="sec-gaps"> |
|
1652 | 1680 |
|
1653 | 1681 | Below, we explain the encoding of each of these levels.</p> |
1654 | 1682 |
|
| 1683 | + <div xml:id="sec-ana-tokens"> |
| 1684 | + <head>Tokenisation and sentence segmentation</head> |
| 1685 | + |
| 1686 | + <p>The basic linguistic annotation comprises sentence segmentation and tokenisation |
| 1687 | + (which are often performed by the same tool), as illustrated in the example below: |
| 1688 | + |
| 1689 | + <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-token"> |
| 1690 | + <s> |
| 1691 | + <w>Tega</w> |
| 1692 | + <w>se</w> |
| 1693 | + <w>sploh</w> |
| 1694 | + <w>nisem</w> |
| 1695 | + <w join="right">zavedel</w> |
| 1696 | + <pc>.</pc> |
| 1697 | + </s> |
| 1698 | + </egXML> |
| 1699 | + |
| 1700 | + Sentences are marked up using the <gi>s</gi> element, words with the |
| 1701 | + <gi>w</gi> element and punctuation symbols with the <gi>pc</gi> element. To |
| 1702 | + retain the linguistically significant whitespace, the <att>join</att> element |
| 1703 | + with the fixed value <val>right</val> is used, meaning there should be no whitespace |
| 1704 | + to the right of the token.</p> |
| 1705 | + |
| 1706 | + <p>A typical problem with tokenisation of printed sources, as already discussed at |
| 1707 | + the end of the Section on <ref target="#sec-dipl">Diplomatic view of the text</ref>, |
| 1708 | + are end-of-line hyphenated words. Namely, to correctly tokenise such words, their |
| 1709 | + two parts need to be joined together, and, furthermore, it needs to be decided |
| 1710 | + whether the end of line hyphen should be preserved or not. There are two reasons why |
| 1711 | + it could be preserved: |
| 1712 | + <list> |
| 1713 | + <item>the hyphen is, in fact, a part of the word, as in <code>fire-proof</code>;</item> |
| 1714 | + |
| 1715 | + <item>the compilers of the corpus included line beginnings (so, the <gi>lb</gi> |
| 1716 | + element) in the "plain text" version of the corpus, and also wish to preserve them |
| 1717 | + in the linguistically annotated version.</item> |
| 1718 | + |
| 1719 | + </list> |
| 1720 | + |
| 1721 | + For the first case above a (typically automatic) method needs to be in place that |
| 1722 | + will decide which hyphen can be deleted and which not; how this is implemented will |
| 1723 | + very much depend on the language. |
| 1724 | + |
| 1725 | + The second case needs this information as well but, futhermore, needs to retain |
| 1726 | + word-internal markup. The example below gives two words, where the markup of the first |
| 1727 | + indicates that the hyphen is not a part of the word, while in the second one it is: |
| 1728 | + |
| 1729 | + <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-token-hyph"> |
| 1730 | + <w>recom<pc force="strong">-</pc><lb break="no"/>mended</w> |
| 1731 | + ... |
| 1732 | + <w>fire<pc force="weak">-</pc><lb break="no"/>proof</w> |
| 1733 | + </egXML> |
| 1734 | + |
| 1735 | + It should be noted that when such an encoding is used, then the conversion for |
| 1736 | + down-stream formats (such as vertical files for concordancers) where the words |
| 1737 | + should be output in their canonical form, should not output the value of |
| 1738 | + <code>w/pc[@force="strong"</code>. |
| 1739 | + </p> |
| 1740 | + |
| 1741 | + <p>There can also be a further added complication with tokenisation if the words are |
| 1742 | + normalised, which is taken up in the Section on <ref target="#sec-ana-norm">Text |
| 1743 | + modernisation</ref>.</p> |
| 1744 | + </div> |
| 1745 | + |
1655 | 1746 | <div xml:id="sec-ana-words"> |
1656 | 1747 | <head>Word-level annotation</head> |
1657 | 1748 |
|
1658 | | - <p>Basic linguistic annotation comprises tokenisation, sentence segmentation, |
1659 | | - part-of-speech tagging and lemmatisation, and this mark-up is illustrated in the |
1660 | | - example below: |
| 1749 | + <p>Basic linguistic annotation comprises, apart from tokenisation and sentence |
| 1750 | + segmentation, also part-of-speech tagging and lemmatisation, and this mark-up is |
| 1751 | + illustrated in the example below: |
1661 | 1752 |
|
1662 | 1753 | <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-word"> |
1663 | 1754 | <s> |
|
1669 | 1760 | <pc msd="UPosTag=PUNCT">.</pc> |
1670 | 1761 | </s> |
1671 | 1762 | </egXML> |
1672 | | - |
1673 | | - Sentences are marked up using the <gi>s</gi> element, words with the |
1674 | | - <gi>w</gi> element and punctuation symbols with the <gi>pc</gi> element. To |
1675 | | - retain the linguistically significant whitespace, the <att>join</att> element |
1676 | | - with the fixed value <val>right</val> is used, meaning there should be no whitespace |
1677 | | - to the right of the token. There can be an added complication with tokenisation, which is |
1678 | | - further taken up in the next Section on <ref target="#sec-ana-norm">Text modernisation</ref>. |
1679 | 1763 | </p> |
1680 | 1764 |
|
1681 | 1765 | <p>The base form or lemmas of a word is given as the value of the |
|
0 commit comments