Skip to content

Commit 95b9502

Browse files
committed
Add discussion of EOL hyphenated words (#41)
1 parent af18901 commit 95b9502

4 files changed

Lines changed: 447 additions & 351 deletions

File tree

TEI/PressMint.odd.rnc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ namespace xi = "http://www.w3.org/2001/XInclude"
1111
namespace xlink = "http://www.w3.org/1999/xlink"
1212
namespace xsl = "http://www.w3.org/1999/XSL/Transform"
1313

14-
# Schema generated from ODD source 2026-04-22T17:48:16Z. 2026-03-18.
14+
# Schema generated from ODD source 2026-04-23T15:30:44Z. 2026-04-23.
1515
# TEI Edition: P5 Version 4.11.0a. Last updated on 6th October 2025, revision 2d8eae701
1616
# TEI Edition Location: https://www.tei-c.org/Vault/P5/4.11.0a/
1717
#

TEI/PressMint.odd.rng

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
xmlns:xlink="http://www.w3.org/1999/xlink"
66
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
77
ns="http://www.tei-c.org/ns/1.0"><!--
8-
Schema generated from ODD source 2026-04-22T17:48:11Z. 2026-03-18.
8+
Schema generated from ODD source 2026-04-23T15:30:40Z. 2026-04-23.
99
TEI Edition: P5 Version 4.11.0a. Last updated on 6th October 2025, revision 2d8eae701
1010
TEI Edition Location: https://www.tei-c.org/Vault/P5/4.11.0a/
1111

TEI/PressMint.odd.xml

Lines changed: 101 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
</titleStmt>
2020
<publicationStmt>
2121
<publisher>CLARIN</publisher>
22-
<date>2026-04-22</date>
22+
<date>2026-04-23</date>
2323
<availability status="free">
2424
<p>This file is freely available and you are hereby authorised to copy, modify, and
2525
redistribute it in any way without further reference or permissions.</p>
@@ -73,6 +73,7 @@
7373
</langUsage>
7474
</profileDesc>
7575
<revisionDesc>
76+
<change when="2026-04-23">Tomaž Erjavec: insert discussion of EOL split words.</change>
7677
<change when="2026-04-22">Tomaž Erjavec: introduce idno/@type="local".</change>
7778
<change when="2026-03-18">Tomaž Erjavec: expand are re-arrange discussion about pb/cb/lb.</change>
7879
<change when="2025-11-13">Tomaž Erjavec: add notes about auto-inserted metadata.</change>
@@ -86,7 +87,7 @@
8687
<titlePart type="main">The structure and encoding of
8788
<ref target="https://github.com/clarin-eric/PressMint">PressMint corpora</ref></titlePart>
8889
</docTitle>
89-
<docDate>2026-03-18</docDate>
90+
<docDate>2026-04-23</docDate>
9091
</titlePage>
9192
<p></p>
9293
<divGen type="toc"/>
@@ -1510,7 +1511,9 @@
15101511

15111512
<p>PressMint gives priority to the so-called critical (or structural) transcription of the
15121513
text, i.e. using, as explained above, elements marking divisions, such as individual
1513-
newspaper articles, and paragraphs inside them. The alternative view is the diplomatic one,
1514+
newspaper articles and paragraphs inside them, as well as, in the ideal case,
1515+
the correct reading order of the text.
1516+
The alternative view is the diplomatic one,
15141517
which does not encode structure but rather the visual apperance of the text, i.e. what is a
15151518
page, column or line.</p>
15161519

@@ -1533,14 +1536,13 @@
15331536
for <gi>lb</gi> to the next <gi>lb</gi> element) or, for the last such empty element, to the
15341537
end of the containing element.
15351538

1536-
15371539
As also discussed there, each of these three empty elements can be connected to the surface
15381540
or zone of the facsimile using the <att>facs</att> attribute, which points to the ID of the
1539-
appropriate <gi>surface</gi> or <gi>zone</gi>.
1541+
appropriate <gi>surface</gi> or <gi>zone</gi>.</p>
15401542

1541-
To illustrate the points above, we give below an example that uses page and line beginnings; for simplicity,
1543+
<p>To illustrate the points above, we give below an example that uses page and line beginnings; for simplicity,
15421544
we do not give the <att>facs</att> attribute which would connect them to the (areas of the) image stored in the
1543-
<gi>facsimile</gi> element.
1545+
<gi>facsimile</gi> element:
15441546

15451547
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-lb">
15461548
<body>
@@ -1560,6 +1562,32 @@
15601562
including in the middle of a (end-of-line hyphenated) word, which makes the linguistic
15611563
annotation of such text more complicated, as texual data is mixed with markup, typically
15621564
not otherwise the case.</p>
1565+
1566+
<p>A problem to note is how to treat end-of-line hyphenated words, esp. as (depending on
1567+
the language and the word in question) some such words need to be, at least for
1568+
linguistic analysis, joined together ommitting the hyphen (as would be the case with
1569+
<code>recom-&lt;lb/&gt;mended</code>) while others need the hyphyen retained
1570+
(e.g. <code>fire-&lt;lb/&gt;proof</code>).</p>
1571+
1572+
<p>The end-of-line hyphens can (but need not) be explicitly marked with the <gi>pc</gi>
1573+
element, which has the advantage that <gi>lb</gi> elements need not be introduced or
1574+
preseved (although they still can be). Namely, TEI allows the attribute <att>force</att>
1575+
on <gi>pc</gi>, which indicates whether the punctuation mark is a word separator or
1576+
not. While the default for the hyphen would be <code>pc/@force="inter"</code>, i.e. that
1577+
the hyphen may or may not be a word separator, it can also be set to
1578+
<code>pc/@force="strong"</code> (hyphen is a word separator)
1579+
<code>pc/@force="weak"</code> (hyphen is not a word separator).
1580+
Furthermore, the <gi>lb</gi> element (as well as <gi>cb</gi> and <gi>pb</gi>)
1581+
can have the <att>break</att> attribute, which signals that the
1582+
line beginning is considered to mark the end of a word in the same way as whitespace,
1583+
and has the values of <code>yes</code>, <code>no</code>, and <code>maybe</code>.
1584+
So, if the end-of-line hypens are encoded with <gi>pc</gi> and line beginnings are preseved
1585+
then the default annotation our example would thus be
1586+
<code>pre&lt;pc force="inter"&gt;-&lt;/pc&gt;&lt;lb break="maybe"/&gt;grešil</code>.
1587+
Of course, the example could also be annotated with <code>pc/@force="weak"</code> and
1588+
<code>lb/@break="no"</code> (or, conversely, with "strong" and "no") but this necessitates
1589+
deciding whether the hyphen signals only a word break or it is also a part of the word).
1590+
</p>
15631591
</div>
15641592

15651593
<div xml:id="sec-gaps">
@@ -1652,12 +1680,75 @@
16521680

16531681
Below, we explain the encoding of each of these levels.</p>
16541682

1683+
<div xml:id="sec-ana-tokens">
1684+
<head>Tokenisation and sentence segmentation</head>
1685+
1686+
<p>The basic linguistic annotation comprises sentence segmentation and tokenisation
1687+
(which are often performed by the same tool), as illustrated in the example below:
1688+
1689+
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-token">
1690+
<s>
1691+
<w>Tega</w>
1692+
<w>se</w>
1693+
<w>sploh</w>
1694+
<w>nisem</w>
1695+
<w join="right">zavedel</w>
1696+
<pc>.</pc>
1697+
</s>
1698+
</egXML>
1699+
1700+
Sentences are marked up using the <gi>s</gi> element, words with the
1701+
<gi>w</gi> element and punctuation symbols with the <gi>pc</gi> element. To
1702+
retain the linguistically significant whitespace, the <att>join</att> element
1703+
with the fixed value <val>right</val> is used, meaning there should be no whitespace
1704+
to the right of the token.</p>
1705+
1706+
<p>A typical problem with tokenisation of printed sources, as already discussed at
1707+
the end of the Section on <ref target="#sec-dipl">Diplomatic view of the text</ref>,
1708+
are end-of-line hyphenated words. Namely, to correctly tokenise such words, their
1709+
two parts need to be joined together, and, furthermore, it needs to be decided
1710+
whether the end of line hyphen should be preserved or not. There are two reasons why
1711+
it could be preserved:
1712+
<list>
1713+
<item>the hyphen is, in fact, a part of the word, as in <code>fire-proof</code>;</item>
1714+
1715+
<item>the compilers of the corpus included line beginnings (so, the <gi>lb</gi>
1716+
element) in the "plain text" version of the corpus, and also wish to preserve them
1717+
in the linguistically annotated version.</item>
1718+
1719+
</list>
1720+
1721+
For the first case above a (typically automatic) method needs to be in place that
1722+
will decide which hyphen can be deleted and which not; how this is implemented will
1723+
very much depend on the language.
1724+
1725+
The second case needs this information as well but, futhermore, needs to retain
1726+
word-internal markup. The example below gives two words, where the markup of the first
1727+
indicates that the hyphen is not a part of the word, while in the second one it is:
1728+
1729+
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-token-hyph">
1730+
<w>recom<pc force="strong">-</pc><lb break="no"/>mended</w>
1731+
...
1732+
<w>fire<pc force="weak">-</pc><lb break="no"/>proof</w>
1733+
</egXML>
1734+
1735+
It should be noted that when such an encoding is used, then the conversion for
1736+
down-stream formats (such as vertical files for concordancers) where the words
1737+
should be output in their canonical form, should not output the value of
1738+
<code>w/pc[@force="strong"</code>.
1739+
</p>
1740+
1741+
<p>There can also be a further added complication with tokenisation if the words are
1742+
normalised, which is taken up in the Section on <ref target="#sec-ana-norm">Text
1743+
modernisation</ref>.</p>
1744+
</div>
1745+
16551746
<div xml:id="sec-ana-words">
16561747
<head>Word-level annotation</head>
16571748

1658-
<p>Basic linguistic annotation comprises tokenisation, sentence segmentation,
1659-
part-of-speech tagging and lemmatisation, and this mark-up is illustrated in the
1660-
example below:
1749+
<p>Basic linguistic annotation comprises, apart from tokenisation and sentence
1750+
segmentation, also part-of-speech tagging and lemmatisation, and this mark-up is
1751+
illustrated in the example below:
16611752

16621753
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-word">
16631754
<s>
@@ -1669,13 +1760,6 @@
16691760
<pc msd="UPosTag=PUNCT">.</pc>
16701761
</s>
16711762
</egXML>
1672-
1673-
Sentences are marked up using the <gi>s</gi> element, words with the
1674-
<gi>w</gi> element and punctuation symbols with the <gi>pc</gi> element. To
1675-
retain the linguistically significant whitespace, the <att>join</att> element
1676-
with the fixed value <val>right</val> is used, meaning there should be no whitespace
1677-
to the right of the token. There can be an added complication with tokenisation, which is
1678-
further taken up in the next Section on <ref target="#sec-ana-norm">Text modernisation</ref>.
16791763
</p>
16801764

16811765
<p>The base form or lemmas of a word is given as the value of the

docs/index.html

Lines changed: 344 additions & 332 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)