hyphenation-definitions.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt'?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc comments="yes"?>
<?rfc editing="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<rfc number="TODO" category="info">
<front>
    <title abbrev="Hyphenation Definitions Standard">A standard for prioritised and dynamic hyphenation definitions</title>
    <author initials="S." surname="van Geloven" fullname="Sander van Geloven">
        <organization abbrev="OpenTaal">Stichting OpenTaal</organization>
        <address>
            <postal>
                <street></street>
                <city></city>
                <country>Netherlands</country>
            </postal>
            <email>sander.vangeloven@opentaal.org</email>
            <uri>http://www.opentaal.org</uri>
        </address>
    </author>
    <date month="January" year="2014" />
    <area>General</area>
    <keyword>lexicology</keyword>
    <keyword>orthography</keyword>
    <keyword>hyphenation</keyword>
    <keyword>standard</keyword>
    <abstract>
        <t>This document describes a standard for hyphenation definitions enabling the generation of prioritised and dynamic hyphenation patterns. In the early nineteen-eighties, automatic hyphenation of lexical items has been made possible by a hyphenator using language-specific hyphenation patterns. These patterns are generated by the hyphenation software community from hyphenated word lists. The initial design was based on the English orthography and limited character encoding. Support for extended encodings was added in the 1990s mostly for Western languages. However, the hyphenated word list format remained rather unchanged. This complicated the support of specific morphological or phonological structures, requiring hyphenation priority in compounds or dynamic hyphenation resulting in altered spelling. Although over 70 languages are supported now, hyphenation is suboptimal and impossible for languages relying on a universal character encoding. This limited method of hyphenation has been catering to digital typesetting over three decades. Unfortunately, recently implemented hyphenation in layout engines for web page rendering is built upon the same outdated technology. An improved hyphenator and extended hyphenation patterns are necessary to overcome current limitations and support a wider range of languages. To achieve this, the software community needs a standard format for hyphenation definitions in universal human-readable hyphenated word lists. A context-free grammar was developed with unambiguous and fine-grained control allowing enhanced hyphenation. All language-specific cases are illustrated with examples and lexicological theory. Our standard for hyphenation definitions enables improved automatic hyphenation for printed media and web documents.</t>
    </abstract>
</front>
<middle>
    <section anchor="introduction" title="Introduction">
        <t>Recent decades have seen automated hyphenation of text being born and having experienced several growth spurts. Unfortunately, the hyphenation patterns currently used by the hyphenation algorithm cannot offer prioritised or dynamic hyphenation. To enable the next developmental leap to overcome this, these patterns need to be generated from prioritised and dynamic hyphenation definitions. A detailed and illustrated standard for these definitions is described in this document.</t>
        <section title="Requirements language">
            <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in <xref target="RFC2119">RFC 2119</xref> only when they appear in all upper case.  They may also appear in lower or mixed case as English words, without special meaning.</t>
        </section>
        <section title="Language tags">
            <t>References to specific orthographies are made according to <xref target="BCP47">BCP 47</xref>. For example "de-CH-1996" represents German as used in Switzerland and as written using the spelling reform beginning in the year 1996 and "de-1901" represents the German orthography reform of 1901.</t>
        </section>
        <section title="Character encoding">
            <t>References to specific characters in this document are always done via <xref target="UNICODE">Unicode</xref> characters and code points. A Unicode code point can be recognised by a capital U, followed by a plus sign and followed by four to six hexadecimal digits. Usually, four or five digits are being used. A Unicode character is shown between single quotation marks and the Unicode name of the character is written in all capitals. An example code point is U+003D to indicate the character '=' which is known as the EQUALS SIGN.</t>
        </section>
        <section title="Format description">
            <t>The format is formally described by a grammar in <xref target="ISO14977">Extended Backus-Naur Form (EBNF)</xref>. This notation enables that hyphenation definitions can be written, validated and parsed by a context-free grammar. Rules and comments for this grammar are recognised by respectively ::= and /* in this document. The syntax of all accompanying examples, recognisable by a #, always conforms to this grammar.</t>
        </section>
        <section title="Design decisions">
            <t>Compiling an international standard involves making many decisions. It is by far a trivial task. For example, selecting  a reserved character involves checking whether that character is not used in words. Words are normally considered as a concatenation of characters separated by spaces or punctuation, but this differs substantially amongst written languages. What might be a practical choice for one language could be incompatible with for another. Likewise, this standard does not concern itself with the validity of  the resulting hyphenations. This is left up to the users, as languages, and even dialects, have different rules and exceptions based on etymological, morphological or phonetic principles. That the designed format offers a maximum degree of freedom and flexibility for the end user is key.</t>
        </section>
    </section>
    <section title="Hyphenation">
        <t>TODO general introduction and example</t>
            <figure>
<artwork><![CDATA[# Examples of hyphenated text in English and Dutch.
#
#    An extre-         Een boom met prui-
#    mely long         men die men als ei-
#    English           eren beschrijft be-
#    word over-        treft hun omvang.
#    looking a         Iemand wilde pluk-
#    nice sen-         ken zonder toestem-
#    tence as          ming te hebben. Er
#    a beauti-         werd ook nog gespro-
#    ful exam-         ken dat hij een har-
#    ple here.         tendiefje was.
]]></artwork>
            </figure>

        <section anchor="general_hyphenation" title="Hyphenation in general">
            <t>TODO general concept</t>
            <!--t> ... Hyphenation is possible on one or several so called hyphenations points of a word. These are usually the in between of consecutive syllables. For aesthetic reasons, a hyphenation point is never after the first or before the last syllable when that syllable consists only of one character.</t-->
        </section>
        <section anchor="history" title="History">
            <t>TODO implementations and patgen and refs<!--(what were de facto standard or common use) patgen patgen2-->
            <xref target="Lia83">todo</xref>
            <xref target="Nem06">todo</xref>
            <xref target="TM14">asdf</xref>
            <xref target="SS95">asdf</xref>
            <xref target="Soj95">asdf</xref>
            <xref target="Har09">asdf</xref>
            <xref target="MR08">asdf</xref>
            <xref target="Lem03">asdf</xref>
            <xref target="Lem05">asdf</xref>
            <xref target="Hen08">asdf</xref>
            <xref target="BS92">asdf</xref>
            <xref target="W3C11">asdf</xref>
            <xref target="W3C13b">asdf</xref>
            <xref target="W3C13a">asdf</xref>
            <xref target="W3C99">asdf</xref>

            </t>
        </section>
        <section anchor="automated_hyphenation" title="Automated hyphenation">
            <t>TODO the challenge and paper/webpage <!-- (only concept) usage of hyphenation patterns which are generated from hyphenation definitions by software such as patgen. TODO talk more about the challenge TODO some history-->
     create word list for language or dialect
                     generate suggested hyphenation definitions
                     manually review hyphenation definitions
            </t>
            <figure>
<artwork><![CDATA[Process of delivering automated hyphenation

    +---------------------+
    |   word list for a   |
    | language or dialect |
    +---------------------+
               || automated syllabification
               \/
  +-------------------------+
  |     working set of      |
  | hyphenation definitions |
  +-------------------------+
               || manual review and
               \/ automated validation
      +-----------------+
      | +-------------+ |
      | | HYPHENATION | |
      | | DEFINITIONS | |
      | +-------------+ |
      +-----------------+
               || preprocessing by
               \/ hyphenation algorithm
    +----------------------+
    | hyphenation patterns |
    | to ship in software  |
    +----------------------+
               || real-time use of 
               \/ hyphenation algorithm
      +------------------+
      |  automatically   |
      | hyphenated text  |
      +------------------+
]]></artwork>
            </figure>
            <t>This standard caters to the following two functional requirements.
            <list style="symbols">
                <t>As an editor (i.e. person) I want to document hyphenation points in a word list for a certain language of dialect by means of hyphenation definitions.</t>
                <t>As a hyphenation algorithm preprocessor (i.e. software application) I want to retrieve hyphenation points from hyphenation definitions to in order to generate hyphenation patterns for a certain language of dialect.</t>
            </list>
            Both cases are a part of the process to provide automated hyphenation of text in software applications.</t>
        </section>
        <section anchor="applications" title="Applications that hyphenate">
            <t>Improving automated hyphenation affects all software applications depending on it. To indicate the impact of a change it is important to list affected products and organisations. The following applications currently use hyphenation patterns which originate from patgen:
            <list style="symbols">
                <t>document preparation systems based on TeX
                <list style="symbols">
                    <t>Babel - TeX's and LaTeX's multilingual typesetting</t>
                    <t>polyglossia - XeLaTeX's and lualatex's multilingual typesetting</t>
                </list>
                </t>
                <t>hyphenation and justification with libhyphen
                <list style="symbols">
                    <t>LibreOffice - The Document Foundation's office suite</t>
                    <t>Apache OpenOffice - Apache Software Foundation's office suite</t>
                    <t>Inkscape* - a vector graphics editor</t>
                    <t>GIMP - a raster graphics editor</t>
                    <t>Scribus - desktop publishing software</t>
                    <t>InDesign - Adobe's desktop publishing software</t>
                    <t>Illustrator - Adobe's vector graphics editor</t>
                </list>
                </t>
                <t>client-side hyphenation in JavaScript with hyphenator.js</t>
                <t>layout engines for rendering web pages
                <list style="symbols">
                    <t>Gecko by Mozilla
                    <list style="symbols">
                        <t>Firefox - Mozilla's web browser</t>
                        <t>Thunderbird - Mozilla's e-mail and news client</t>
                        <t>Firefox for mobile - Mozilla's web browser for Android</t>
                    </list>
                    </t>
                    <t>WebKit by Apple and Adobe
                    <list style="symbols">
                        <t>Safari - Apple's web browser</t>
                        <t>Konqueror - KDE's web browser and file manager</t>
                    </list>
                    </t>
                    <t>Blink by Google
                    <list style="symbols">
                        <t>Chromium and Chrome - Google's web browsers</t>
                        <t>Opera - Opera's web browser</t>
                        <t>Web Browser - Google's default web browser for Android</t>
                    </list>
                    </t>
                </list>
                </t>
            </list>
             * Implementation of automated hyphenation for Inkscape is planned for the near future.</t>
             <t>This overview does not endorse or favour the use of any of these applications and respects registered trademarks where applicable. It is merely included to illustrate the wide spectrum of applications employing hyphenation patterns.</t>
        </section>
    </section>
    <section anchor="basic" title="Basic format">
        <t>This section describes the basic format for hyphenation patterns. These are usually stored in computer files, but they can also reside in databases or memory. The structure will be described step by step, extending the grammar for this format and illustrating usage in example. The syntax of all examples complies to the grammar of this format.</t>
        <section anchor="main_structure" title="Main structure">
            <t>In order to support as many languages as possible, this format for hyphenation definitions MUST use the Unicode character in a UTF-8 encoding.  A set of hyphenation definitions MAY have one or more lines. Each line MAY have, in the following order:
            <list style="numbers">
                <t>a hyphenation definition,</t>
                <t>white space,</t>
                <t>and/or comments.</t>
            </list>
            This is the the top-level or main structure of the entire format. The syntax for hyphenation definitions in Extended Backus-Naur Form (EBNF) will therefore be:</t>
            <figure>
<artwork><![CDATA[HyphenationDefinitions
         ::= ( EOL* HyphenationDefinition? WhiteSpace? Comment? )*
]]></artwork>
            </figure>
            <t>Here EOL stands for an end of line. An end of line MUST have a LINE FEED (LF) or U+000A and MAY have a CARRIAGE RETURN (CR) or U+000D. This is written in EBNF as:</t>
            <figure>
<artwork><![CDATA[EOL
         ::= ( '\r' | #x000D ) ( '\n' | #x000A )?
           | ( '\n' | #x000A )
]]></artwork>
            </figure>

            <t>White space can be inserted to improve human readability of hyphenation definitions but is OPTIONAL. When used, it SHALL contain only SPACE U+0020 or CHARACTER TABULATION U+0009 characters. White space in EBNF is:</t>
            <figure>
<artwork><![CDATA[WhiteSpace
         ::= ( ( ' ' | #x0009 )
             | ( '\t' | #x0020 ) )+
]]></artwork>
            </figure>
            <t>A comment MUST start with a NUMBER SIGN U+0023 or '#' and MAY contain any combination of printable characters thereafter. Comments MUST NOT contain control characters that can result in an end of line, however the CHARACTER TABULATION U+0009 MAY be used in comments. In EBNF a comment is:</t>
            <figure>
<artwork><![CDATA[Comment
         ::= '#' ( [#x0009]
                 | [#x0020-#xD7FF]
                 | [#xE000-#xFFFD]
                 | [#x10000-#x10FFFF] )*
]]></artwork>
            </figure>
            <t>Note that the allowed range of characters needs to be fine tuned later on. It needs to exclude more non-characters according to section 16.7 called Noncharacters of <xref target="UNICODE">Unicode</xref>. At least the range U+0080 until U+009F is a candidate here but also for the character range defined in <xref target="general_hyphenation_definition">hyphenation definitions in general</xref>.</t>
<!-- OLD section title="Comments and whitespace">It is possible to have comments, prefixed with percent (#), after each definition. It is also possible that an entire line is regarded as comments. Note that comments can be preceded by whitespace in terms of spaces or tabs. This example shows the use of comments and empty lines:</section-->
            <t>With the definition of the main structure, without any actual hyphenation definition, it is possible store data in this format. An example with end of lines, white space and comments is:</t>
            <figure>
<artwork><![CDATA[# This is the first line with only a comment

# This is the third line after an empty second line.
        ## After some whitespace, this is the fourth line.   # #
# Comments can use most reserved characters, e.g. {}[]/|~=.; #
# and Unicode orthographys, e.g.
# ру́сский
# язы́к,
# język polski and
# ελληνική
# γλώσσα
]]></artwork>
            </figure>
            <t>This completes the description of the the main structure which is processed in a line-by-line fashion.</t>
        </section>
        <section anchor="general_hyphenation_definition" title="Hyphenation definition in general">
            <t>A hyphenation definition is the essential part of this format and MUST have, in this order:
            <list style="numbers">
                <t>a word,</t>
                <t>a delimiter,</t>
                <t>and a definition.</t>
            </list>
            This is where the actual hyphenation definition is provided for a word. A word is REQUIRED to be unique amongst all definitions in a single file because it is the unique key for looking up a hyphenation definition. A hyphenation definition in EBNF is written as:</t>
            <figure>
<artwork><![CDATA[HyphenationDefinition
         ::= Word Delimiter Definition
]]></artwork>
            </figure>
            <t>The delimiter MUST be a SEMICOLON ';' or U+003B. In EBNF this is:</t>
            <figure>
<artwork><![CDATA[Delimiter
         ::= ';' | #x003B
]]></artwork>
            </figure>
            <t>A word MUST be a concatenation of at least two characters:</t>
            <figure>
<artwork><![CDATA[Word
         ::= Character Character+
]]></artwork>
            </figure>
            <t>Most Western languages would use a word with minimum of four characters to consider it a candidate for hyphenation. In case of hyphenation these languages require a minimum of two characters before and after hyphenation. The hyphenation character inserted is usually a HYPHEN-MINUS U+002D or '-'. However, some languages have a lexicography with a different set rules for hyphenation.</t>
            <t>Modern Greek, however, allows hyphenation directly after a single character prefix. Another counterexample is the Ge'ez language. It uses a ETHIOPIC WORDSPACE or U+1361 to separate words. This language has no need for a hyphen character at the end of a line because no ambiguous situation can arise whether a word end at an end of line or not. This allows for hyphenation of a single character at the end of a word.</t>
            <t>For the reasons this format allows hyphenation definitions for words with a minimum of two characters. It is up to the user to enforce stricter rules for a greater minimum word length if needed. These are parameters of the hyphenation algorithm preprocessor to ignore words that are too short.</t>
            <t>A character in a word MUST be a printable character and MUST NOT be a control character such as LINE FEED or CHARACTER TABULATION and MUST NOT be a reserved character such as SPACE U+0020 ' ' or NUMBER SIGN U+0023 '#' is discussed. Without going into detail of other reserved characters, the definition of a character in EBNF is:</t>
            <figure>
<artwork><![CDATA[Character
         ::= [#x0021-#x0022]
           | [#x0024-#x002D]
           | [#x0030-#x003A]
           | [#x003C]
           | [#x003E-#x005A]
           | [#x005C]
           | [#x005E]
           | [#x0060-#x007A]
           | [#x007F-#x00A5]
           | [#x00A7-#xD7FF]
           | [#xE000-#xFFFD]
           | [#x10000-#x10FFFF]
]]></artwork>
            </figure>
            <t>Instead of providing a hyphenation definition it is possible to repeat the word after the delimiter without providing any hyphenation information. The grammar rule for definition will allow this. A hyphenation definition repeating the word means that this word SHALL NOT be hyphenated at all. A hyphenation definition MAY be given, but when none is provided for a certain word, then hyphenation for that word is undefined. Some very short examples in the format as it is so far described are:</t>
            <figure>
<artwork><![CDATA[# too short English words not allowed to be hyphenated
#a;a
#at;at
#are;are # too short for hyphenation according to the language

# English words not to be hyphenated
door;door
eight;eight

# German words not to be hyphenated
amorph;amorph
schnarchst;schnarchst

# Dutch words not to be hyphenated
schrijft;schrijft
V-snaar;V-snaar # note that '-' is considered a normal character

# acronyms not to be hyphenated
UNESCO;UNESCO
unicef;unicef

# hyphenation is undefined when no hyphenation definition is given
#impeachment;impeachment
]]></artwork>
            </figure>
        </section>
        <section anchor="word" title="Hyphenation definition for a word">
            <t>A hyphenation definition in the most simple form MUST contain two or more clusters of characters that are separated by a hyphenation point. Combined with the previous description of preventing hyphenation by repeating the word, the EBNF grammar rule for definition is:</t>            
            <figure>
<artwork><![CDATA[Definition
         ::= Cluster ( Hyphen Cluster )*
]]></artwork>
            </figure>
             <t>A character cluster here MUST consist of at least one character. This basic form is already supported by the current hyphenation algorithm and is key to the concept of hyphenation. More intricate schemes of clusters and hyphenations will be discussed later on, but are already referred to in the following EBNF bridging from cluster to character clusters:</t>
            <figure>
<artwork><![CDATA[Cluster
         ::= ( CharacterCluster
             | SubstitutionCluster
             | HomographCluster )+
CharacterCluster
         ::= Character+
]]></artwork>
            </figure>
            <t>The concatenation of different clusters only applies in combination with a substitution cluster or a homograph cluster, as will be demonstrated later on. This is because consecutive character clusters have the same syntax as a single character cluster. These are merely more characters added in the same way and will therefore MUST NOT be regarded as separate character clusters.</t>

            <t>The final construct required to allow for simple hyphenation definitions is a reserved character to separate the clusters of characters which are also known as morphemes. Here one or more TILDE characters '~' or U+007E MUST be used as a morpheme hyphen. In the following, rules allow also for more intricate hyphenation yet, the morpheme hyphen is:</t>
            <figure>
<artwork><![CDATA[Hyphen
         ::= MorphemeHyphen
           | SuffixHyphen
           | PrefixHyphen
           | CompoundHyphen
           | CompoundSuffixHyphen
           | CompoundPrefixHyphen
           | UnfavourableHyphen
MorphemeHyphen
         ::= ( '~' | #x007E )+
]]></artwork>
            </figure>
            <t>Some simple examples of hyphenation definitions for words are:</t>
            <figure>
<artwork><![CDATA[# English word with hyphenation definition
revolve;re~volve # "volve" may not be hyphenated
editor;ed~i~tor # character cluster of single character

# German words with hyphenation definition
Aale;Aa~le # possible hyphenation is "Aa-" "le"
kühle;küh~le # possible hyphenation is "küh-" "le"

# Dutch words with hyphenation definition
alle;al~le # possible hyphenation is "al-" "le"
gezellig;ge~zel~lig # "ge-" "zellig" or "gezel-" "lig"

# Polish word with uncommon hyphenation definition
kung-fu;kung~-fu # possible is "kung-" "-fu"

# Modern Greek
# note hyphenation directly after one character
#άτακτος;
#ά~τα~κτος
]]></artwork>
            </figure>
            <t>Up to this point the functionality of the previous format for hyphenation patterns as used by patgen2 is similar. Everything described in this format from this point onward is newly proposed functionality.</t>
            <t>A hyphenation point SHALL be defined by one or more tildes. A hyphenation point of higher priority MUST have at least one additional tilde compared to lower priority hyphenation points. Some examples to illustrate prioritised hyphenation definitions in words are:</t>
            <figure>
<artwork><![CDATA[# English words with prioritised hyphenation
ergonomic;er~go~~no~mic # because of (er + go) + (no + mic)
thesauruses;the~sau~~rus~es

# French words with prioritised hyphenation
portemonnaie;por~te~~mon~naie # because of (por + te) + (mon naie)
atmosphère;at~mo~~sphè~re # because of (at + mo) + (sphè + re)
]]></artwork>
            </figure>
            <t>The structure of the words is broken down in the comments with the use of brackets '(' and ')' and plus sign '+'. This is a form of syllabification that reflects semantic information. It is not a part of the format but is only used to explain the examples of the format.</t>
        </section>
        <section anchor="word_prefix" title="Hyphenation definition for a word prefix">
            <t>Many languages allow usage of a prefix to alter the meaning of a word. Here a VERTICAL LINE U+007C or '|' MAY be used to indicate a hyphenation point for a prefix. This enables reuse of the hyphenation definition of the word. Hyphenation directly after a prefix has a small priority over a normal hyphenation point. Prefixes are semantically built from right to left for a left-to-right script. Therefore, priority amongst prefixes is from left to right for a left-to-right script. Syntax for defining hyphenation of a prefix should comply to the following EBNF:</t>
            <figure>
<artwork><![CDATA[PrefixHyphen
         ::= '|' | #x007C
]]></artwork>
            </figure>
            <t>Some examples of hyphenation definitions including a prefix are:</t>
            <figure>
<artwork><![CDATA[# English words with prefix
# dis < ap + pear
disappear;dis|ap~pear
# su + pra < or + bit + al
supraorbital;su~pra|or~bit~al

# German words with prefix
# ent < deckt [discouvered]
entdeckt;ent|deckt
# Re < kon < struk + ti + on [reconstruction]
Rekonstruktion;Re|kon|struk~ti~on

# Dutch words with prefix
# ge < wil + lig [willing]
gewillig;ge|wil~lig
# her < be < re + ke + nen [to recalculate]
herbereken;her|be|re~ke~nen
]]></artwork>
            </figure>
            <t>In the comments, the prefixes are indicated with a less-than sign, which precedes evaluation of the plus sign. Sometimes the comments on examples provide the meaning of the word in between double guillemets. These are '[' and ']'. These help understanding the examples which are from languages other than English but are not part of this standard.</t>
        </section>
        <section anchor="word_suffix" title="Hyphenation definition for a word suffix">
        <t>A suffix can be identified in a similar way as is done for <xref target="word_prefix">prefixes</xref>. Instead of a vertical line a BROKEN BAR U+00A6 or '¦' MAY be used for suffixes. In EBNF this is:</t>
            <figure>
<artwork><![CDATA[SuffixHyphen
         ::= '¦' | #x00A6
]]></artwork>
            </figure>
        <t>Some examples are:</t>
            <figure>
<artwork><![CDATA[# English words with suffix
# broth + er > hood
brotherhood;broth~er¦hood
# re + morse > less > ness
remorselessness;re~morse¦less¦ness

# German word with suffix
# wahr + schein > lich [probably]
wahrscheinlich;wahr=schein¦lich
# Un < sich + er > heit [uncertainty]
Unsicherheit;Un|si~cher¦heit

# Dutch words with suffix
# een > zaam > heid [loneliness]
eenzaamheid;een¦zaam¦heid
# beest > ach~tig [beastly]
beestachtig;beest¦ach~tig
]]></artwork>
            </figure>
            <t>The comments use a greater-than sign to explain the structure where suffixes build from left to right, gaining priority in this way for a left-to-right script. A hyphenation point for a suffix has priority over hyphenation on a prefix.</t>
        </section>
    </section>
    <section anchor="extended" title="Extended format">
        <section anchor="compound" title="Hyphenation definition for a compound">
            <t>Many languages can concatenate words to form long compounds. Some real-life examples from Western languages are:</t>
            <figure>
<artwork><![CDATA[# long compound without spaces in German
#Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

# long compound without spaces in Dutch
#aansprakelijkheidswaardevaststellingsveranderingen

# long compound without spaces in Hungarian
#megszentségteleníthetetlenségeskedéseitekért

# long compound without spaces in English
#pneumonoultramicroscopicsilicovolcanoconiosis
]]></artwork>
            </figure>
            <t>These are extreme, but it is also possible in, for example, English to concatenate words, forming long compounds. This is less common, as spaces are usually found in English compounds, hence for those cases hyphenation is less problematic.</t>
            <t>Hyphenation definitions of compounds should be made with a different reserved character. The EQUALS SIGN U+003D or '=' MUST be used to indicate hyphenation on compound level. This prevents long series of tildes in complex compounds allowing automated generation, suggestion or validation of hyphenation patterns for compounds. In EBNF, this is:</t>
            <figure>
<artwork><![CDATA[CompoundHyphen
         ::= ( '=' | #x003D )+
]]></artwork>
            </figure>
            <t>Examples of hyphenation definitions for compounds are:</t> 
            <figure>
<artwork><![CDATA[# English compounds
# small + talk
smalltalk;small=talk
# (bit + ter) + sweet
bittersweet;bit~ter=sweet

# German compounds
# Grenz + schutz + amt [border patrol office]
Grenzschutzamt;Grenz=schutz=amt
# Herz + still + stand [cardiac arrest]
Herzstillstand;Herz=still=stand

# Dutch compounds
# boek + (om + slag) [book cover]
boekomslag;boek=om~slag 
# trein + (wa + gon) [train carriage]
treinwagon;trein=wa~gon
]]></artwork>
            </figure>
            <t>A hyphenation point for a compound SHALL be defined by one or more equals signs. A hyphenation point of higher priority MUST have at least one additional equals sign compared to lower priority hyphenation points for compounds. This is similar to hyphenation point priorities in definitions for <xref target="word">words</xref>. Some examples to illustrate prioritised hyphenation definitions in compounds are:</t>
            <figure>
<artwork><![CDATA[# German
# Erb + (lehn + gut) [lit: inheritened loan property]
Erblehngut;Erb==lehn=gut
# Fach + (werk + statt) [crafts workshop]
Fachwerkstatt;Fach==werk=statt
# Berg + ((fünf + (fin + ger)) + kraut)
# [lit: mountain five-finger herb]
Bergfünffingerkraut;Berg===fünf=fin~ger==kraut
# (See + (schiff + fahrt)) + (stra + ße)
# [sea traffic shipping lane]
Seeschifffahrtstraße;See==schiff=fahrt===stra-ße

# Dutch
# ((goe + de + ren) + trein) + (wa + gon)
# [cargo train carriage]
goederentreinwagon;goe~de~ren=trein==wa~gon
]]></artwork>
            </figure>
            <t>A hyphenation point for a compound MUST be treated with higher priority than that of a suffix.</t>
        </section>
        <section anchor="compound_prefix" title="Hyphenation definition for a compound prefix">
            <t>Compounds can also have a prefix. These are defined in a similar way as a <xref target="word_prefix"> prefix of a word</xref>. A combination of a VERTICAL LINE U+007C or '|' followed directly by a EQUALS SIGN U+003D or '=' MAY be used to indicate a prefix of a compound. In EBNF this is:</t>
            <figure>
<artwork><![CDATA[CompoundPrefixHyphen
         ::= ( '|' | #x007C ) ( '=' | #x003D )+
]]></artwork>
            </figure>
            <t>Examples are:</t>
            <figure>
<artwork><![CDATA[# German compounds with prefix
# un < wahr + (schein + lich) [unlikely]
unwahrscheinlich;un|=wahr=schein~lich
# Ur < groß + (el + tern) [great-grandparents]
Urgroßeltern;Ur|=groß=el~tern

# Dutch compound with prefix
# on < waar + (schijn + lijk) [unlikely]
onwaarschijnlijk;on|=waar=schijn~lijk
]]></artwork>
            </figure>
            <t>Here the number of equals signs match the number of equals signs of the compound hyphenation that this prefix is related to. Compound prefixes are extended from right to left and prioritised from left to right for a left-to-right script.</t>
        </section>
        <section anchor="compound_suffix" title="Hyphenation definition for a compound suffix">
            <t></t>
            <figure>
<artwork><![CDATA[CompoundSuffixHyphen
         ::= ( '=' | #x003D )+ ( '¦' | #x00A6 )
]]></artwork>
            </figure>
            <t>Examples are rare, but some are given below:</t>
            <figure>
<artwork><![CDATA[# German compounds with suffix
# (an + dert) + halb > fach
anderthalbfach;an~dert=halb=¦fach
# zier + rat > lo + se
zierratlose;zier=rat=¦lo~se
# (zu < sam + men) + hang > los
zusammenhanglos;zu|sam~men=hang=¦los

# Dutch compounds with suffix
# (li + te + ra + tuur) + (we + ten > schap) > je
# [lit: diminitive of literature science]
# it is lexicologically the diminitive of science
# but semantically diminutive of the entire compound
literatuurwetenschapje;li~te~ra~tuur=we~ten¦schap=¦je
# (on < (sa + men) + (han + gend)) > heid
# [incoherentness]
onsamenhangendheid;on|sa~men=han~gend=¦heid
]]></artwork>
            </figure>
            <t>The number of equals signs are the same as the number of equals signs of the compound hyphenation this suffix is related to. This is similar to <xref target="word_suffix">word suffix</xref> and <xref target="compound_prefix">compound prefix</xref>. Compound suffixes are extended from left to right and are prioritised from right to left for a left-to-right script, albeit that nested compound suffixes will be extremely rare.</t>        
        </section>
        <section anchor="compound_interfix" title="Hyphenation definition for a compound interfix">
            <t>With the format for hyphenation definitions described up to this point, it is possible to define hyphenation definitions for compounds, even if they have an interfix. Interfixes are common in some languages as a linking element in compounds. They usually do not have a semantic function but rather one of aiding pronunciation. Hyphenation has no special requirements to indicate interfixes. However, it is useful to annotate interfixes, enabling identification of the separate words from which the compound has been formed. In this way the hyphenation definition of the compound can be automatically generated, suggested or validated. In addition, this information could be used for decomposition to validate and extend spell checking.</t>
            <t>There are no grammar rules for this at the moment, because this part of the format is still under discussion. The characters used in the following example are the LESS-THAN SIGN U+003C and GREATER-THAN SIGN U+003E, which could become reserved characters in the future. Interfix annotations can simply be filtered out before hyphenation patterns are used as input to the hyphenation algorithm.</t>
            <figure>
<artwork><![CDATA[# German interfix
# (Arbeit + s) + zimmer [working room]
Arbeitszimmer;Ar~beits=zim~mer # could be Ar~beit<s>=zim~mer

# Dutch interfix
# (kip + (p + en)) + soep [chicken soup]
kippensoep;kip~.pen=soep
# could be;kip<~.pen>=soep
# ((be + roep) + s) + ethiek [professional ethics]
beroepsethiek;be~roeps=ethiek
# could be   ;be~roep<s>=ethiek
# (Koningin + (n + e)) + dag [Queen's Day]
Koninginnedag;Ko~nin~gin~ne=dag
# could be   ;Ko~nin~gin<~ne>=dag

# Croatian interfix
# (brod + o) + gradilište [shipyard]
brodogradilište;brodo=gradilište
# could be     ;brod<o>=gradilište
]]></artwork>
            </figure>
            <t>Note that this should not be used of the word preceding the interfix has changed spelling because of its usage in the compound with an interfix.</t>
        </section>
        <section anchor="unfavourable" title="Unfavourable hyphenation">
            <t>Sometimes hyphenations can be misleading or distorting and are unfavourable. This MUST be indicated by a FULL STOP U+002E or '.'. More than one full stop MAY be used to indicate hyphenation points which are extremely unfavourable. An unfavourable hyphenation point MAY be preceded by a hyphenation character to indicate the type of hyphenation point. In EBNF this is can be written as:</t>
            <figure>
<artwork><![CDATA[UnfavourableHyphen
         ::= ( ( '~' | #x007E )
             | ( '|' | #x007C )
             | ( '¦' | #x00A6 )
             | ( '=' | #x003D ) )?
             ( '.' | #x002E )+
]]></artwork>
            </figure>         
            <t>Some examples of unfavourable hyphenation are:</t>
            <figure>
<artwork><![CDATA[# unfavourable hyphenation in German
# dem + (ent + (spre + chend)) [accordingly]
dementsprechend;dem=ent|.spre-chend
# re + (in + (stal + liert) [reinstalled]
reinstalliert;re|in|.stal-liert
# Sprech + (er + (zie + hung) [elocution]
Sprecherziehung;Sprech=er|.zie-hung
# (Wind + (en + er + gie) + (an + (la + ge)))
# [wind-energy plant]
Windenergieanlage;Wind=en.er-gie==an|la-ge
# Ost + (en + de)
# [toponiem of place in Belgium]
Ostende;Ost=en-.de

# unfavourable hyphenation in Dutch
# (deur + waar + ders) + (ex + ploit) [lit: bailiff abuse]
deurwaardersexploit;deur~waar~ders=ex~..ploit
# (Koningin + (n + e)) + dag [Queen's Day]
Koninginnedag;Ko~nin~gin~.ne=dag
# could be   ;Ko~nin~gin<~.ne>=dag
]]></artwork>
            </figure>
        </section>
    </section>
    <section anchor="dynamic-hyphenation" title="Dynamic hyphenation">
        <section anchor="altered_spelling" title="Hyphenation with altered spelling">
            <t>Hyphenation can result in a changed spelling of the word. How this affects a word depends on the language, as will be seen later on. A hyphenation definition of this type MUST contain both an unhyphenated and a hyphenated spelling for such word. This is called a substitution cluster. It MUST contain only the particular hyphenation point and adjacent character clusters that altered.</t>
            <t>A substitution cluster MUST be provided between curly brackets LEFT CURLY BRACKET U+007B or '{' and RIGHT CURLY BRACKET U+007D or '}' with SOLIDUS U+002F or '/' as a separator. Left of the separator MUST be the unhyphenated spelling and on the right MUST be the hyphenated spelling. Examples later on will clarify this in detail. The exact rule in EBFN for this is:</t>
            <figure>
<artwork><![CDATA[SubstitutionCluster
         ::= '{' CharacterCluster '/'
               ( CharacterCluster ( Hyphen CharacterCluster? )?
               | Hyphen CharacterCluster? )
             '}'
]]></artwork>
            </figure>
            <t>Some languages have transforming digraphs when hyphenating. In German the 'c' and 'k' are orthographic allographs for /k/. The digraph 'ck' can result in 'k-k' when hyphenation is in the middle of that digraph. Examples of transforming digraphs with orthographic allographs are:</t>
            <figure>
<artwork><![CDATA[# German with altered spelling digraph
# "Zucker" or
# "Zuk-" "ker" [sugar]
Zucker;Zu{ck/k~k}er 
]]></artwork>
            </figure>
            <t>In German it is also possible to have doubling of consonants in digraphs when hyphenating. The digraph 'll' can initially be a shorter spelling of the trigraph 'lll', which itself is a concatenation of the digraph 'll' and a glyph 'l'. When hyphenation is in the first mentioned digraph, the previously eliminated 'l' should be restored. Examples of restoring eliminated consonants from trigraphs are:</t>
            <figure>
<artwork><![CDATA[# German with doubled consonant spelling
# "Ab-" "fallager" or
# "Abfall-" "lager" or
# "Abfalla-" "ger" [waste storage]
Abfallager;Ab~fa{ll/ll~l}a~ger
# "Stoffül-" "le" or
# "Stoff-" "fülle" [wealth of material]
Stoffülle;Sto{ff/ff=f}ül~le
# "Vollast" or
# "Voll-" "last" [maximum load, lit: full load]
Vollast;Vo{ll/ll=l}ast

# Norwegian with doubled consonant spelling
# "trykknapp" or "trykk-" "knapp" [snap fastener]
trykknapp;try{kk/kk=k}napp
# equivalent notation, less verbose but more searchable
#trykknapp;tryk{k/k=k}napp
]]></artwork>
            </figure>
            <t>Some languages have vowel doubling. This occurs when stress is on an open syllable and a suffix added after that syllable. This happens for example in Dutch for some diminutive forms. When these diminutives are hyphenated on that syllable, the vowel at the end of an open syllable needs to be duplicated, since the stress will ensure proper pronunciation. Examples of stressed open syllables with doubled vowels are:</t>
            <figure>
<artwork><![CDATA[# Dutch vowel doubling in diminutive
# "omaatje" or
# "oma-" "tje" [granny] [degenitiv of grantmother]
omaatje;oma{a/-}tje
# equivalent notation, more verbose but less searchable
#omaatje;om{aa/a-}tje
]]></artwork>
            </figure>
            <t>In Dutch,s diaeresis can be used on vowels to prevent the so called vowel collision. However, when hyphenating before the vowel that received a diaeresis, that diaeresis will be eliminated in the hyphenated spelling. Examples of hyphenation definitions for eliminated diaeresis are:</t>
            <figure>
<artwork><![CDATA[# Dutch eliminated diaeresis
# "geëerd" or
# "ge-" "eerd" [honoured] [past participle]
geëerd;ge{ë/-e}erd
]]></artwork>
            </figure>
            <t>As stated before, a hyphen can be a valid character in a normal word. Hence, the hyphen character is not a reserved character in this context. When hyphenation on a hyphen that is already part of a word, a new hyphen MUST NOT be inserted in the hyphenated text. A rare counterexample was given in hyphenation of a <xref target="word">word</xref>. Below, more common examples in which a hyphen is not allowed to be duplicated:</t>
            <figure>
<artwork><![CDATA[# Dutch compounds with hyphen as character
# ex- < vriend [former boyfriend]
ex-vriend;ex{-/|}vriend
# (Dow- + Jones) + index [Dow Jones Index]
Dow-Jonesindex;Dow{-/~}Jones=index
# ((dé + jà)- + vu) + gevoel [déjà vu feeling]
déjà-vugevoel;dé~jà{-/~~}vu=ge~voel
# (gilles- + de- + la- + (tou + rette)) + (syn + droom)
# [Tourette syndrome]
#gilles-de-la-tourettesyndroom;
#gilles{-/~~}de{-/~}la{-/~~}tou~rette=syn~droom
# (ad + junct)- + ((al + ge + meen) + (di + rec + teur))
# [vice managing director]
adjunct-algemeendirecteur;ad~junct{-/==}al~ge~meen=di~rec~teur

# English compound with hyphen as character
# (ac + tor)- + (di + rec + tor)
actor-director;ac~tor{-/=}di~rec~tor
]]></artwork>
            </figure>
        </section>
        <section anchor="homograph" title="Hyphenation of homographs">
            <t>A word with multiple meanings but with the same spelling is called a homograph. Some homographs can differ in syllabification and pronunciation even though they are spelled with exactly the same characters. Examples in English are desert (leave to, or barren area of land) and dove (pigeon, or past tense to dive). A difference in pronunciation can result in different hyphenation points for each meaning of the homograph, which is more probable in German or Dutch than in, for example, English.</t>
            <t>When this is the case, the following homograph cluster MUST be used for the hyphenation definition. Here a LEFT SQUARE BRACKET U+005B or '[' and a RIGHT SQUARE BRACKET U+005D or ']' MUST be used to group alternatives inside a hyphenation definition. These MUST be separated by a SOLIDUS U+002F or '/'. In the following rules in EBNF only two alternatives are allowed. The order of the alternatives is not important. However, the grammar introduces a small difference for the left and right side of the separator. One side, and only one side, of the separator may be empty to accommodate for certain definitions. Therefore, always one side of the separator MUST hold a definition. This is in EBNF:</t>
            <figure>
<artwork><![CDATA[Series
         ::= ( CharacterCluster (Hyphen CharacterCluster)* Hyphen? )
           | ( Hyphen (CharacterCluster Hyphen)* CharacterCluster? )
HomographCluster
         ::= '[' ( Series | (SubstitutionCluster Series? ) ) '/'
                 SubstitutionCluster? Series? ']'
]]></artwork>
            </figure>
            <t>The use of a nested substitution cluster will be described <xref target="nested">later on</xref>. Rare but valid examples with alternative hyphenation behaviour for homographs are:</t>
            <figure>
<artwork><![CDATA[# English homographs
# rec + ord [vinyl medium]
# re + cord [first-person present of verb to record]
record;re[~c/c~]ord
# wa + les [plural of whale] or
# Wales [toponiem of part of UK]
wales;wa[~/]les

# German homographs
# Mas + ke or Maske
Maske;Mas[~/]ke
# Wach + (stu + be) [guardroom] or
# Wachs + (tu + be) [wax tube]
Wachstube;Wach[=s/s=]tu-be 
# (Bahn + hof) + (strasse) [lit: station street] or
# (Bahn + hof) + s + (trasse) [lit: station's route]
Bahnhofstrasse;Bahn=hof[==stra-ss/s==tras-s]e

# Dutch homographs
# bal + le + tje [degerailnitiv of ball] or
# bal + let + je [degenitiv of ballet]
balletje;bal~le[~t/t~]je
# valk + uil [ninox, lit: falcon owl] or
# val + kuil [trapping pit, lit: trap pit]
valkuil;val[k=/=k]uil
]]></artwork>
            </figure>
            <t>Note that there is not a preferred order of mirrored homograph clusters but a fixed order could prove practical for automated processing such as validation.</t>
            <t>Automated hyphenation of homographs poses an interesting challenge. How can the hyphenation recognise which hyphenation pattern to use? This is out of scope for this standard but important to discuss. All other forms of hyphenation can be handled directly by a hyphenation algorithm, but here extra information is need. This could be extracted from the context, but can proof difficult if no context is available or the context is ambiguous. On the other hand, the author of a text could provide the needed information. This could be stored in soft hyphens, for example. The hyphenator could assist the author here by playing an interactive role. Similarly to spell checking the author could be asked which meaning of a homograph is intended by having the author choose between expanded hyphenation patterns.</t>
            <t>Something that has not been discussed up to this point, but is illustrated in the previous example with wales and Wales, is case sensitivity of hyphenation patterns. Hyphenation definitions MUST be specified as case sensitive as possible. <!--TODO homograph!!-->In case capitalised, upper case and/or lower case are merged a lower case notation is RECOMMENDED to be used, followed by capitalised and finally upper case. Reasons for this that casting to upper case or capitalised spelling can result in information reduction whereas casting to lower case can not restore the eliminated information. Examples:</t>
            <figure>
<artwork><![CDATA[# German irreversible up and down casting
# Maße upcast -> MASSE
# MASSE ambiguous downcast -> Maße or Masse
# LATIN CAPITAL LETTER SHARP S U+1E9E is rarely used

# Dutch irreversible up and down casting
# officiëren upcast -> OFFICIEREN
# OFFICIEREN ambiguous downcast -> officiëren or officieren
# gêne upcast -> GENE
# GENE ambiguous downcast -> gêne or gene
# Dutch does not use diacritical marks in all upper case words
]]></artwork>
            </figure>   
        </section>
        <section anchor="nested" title="Nested hyphenation">
            <t>Nesting of a substitution cluster inside a homograph cluster MAY be done. This is already defined in the grammar for <xref target="homograph">homograph hyphenation</xref>. Here the priority is on the enclosing homograph cluster. Deeper or other ways of nesting clusters is not allowed. This is very rare, but some examples for German are:</t>
            <figure>
<artwork><![CDATA[# German de-1901 nested hyphenation definitions
Bettücher;Be[t=tü~/{tt/tt=t}ü.]cher
Druckerzeugnis;Dru[{ck/k~k}er~/ck=er.]zeug~nis
Fussballehren;Fuss=ba[ll=/{ll/ll=l}]eh~ren
griffest;gri[f~f/{ff/ff=f}]est
Irreligion;I[{rr/rr=r}/r|r]e.li~gi-on
Staubecken;Stau[~b/b~]e{ck/k~k}en
]]></artwork>
            </figure>
        </section>
    </section>
    <section anchor="priority" title="Hyphenation priority">
<figure>
<preamble>The following hyphenation priority is defined:</preamble>
<artwork><![CDATA[01 [] hyphenation of homograph,
       definition depends on semantics
02 {} dynamic hyphenation,
       change of spelling
03 =¦ hyphenation of compound's suffix,
       multiple = have higher priority
04 |= hyphenation of compound's prefix,
       multiple = have higher priority
05 =  hyphenation of compound,
       multiple = have higher priority
06 ¦  hyphenation of word's suffix,
       priority order is from right to left
07 |  hyphenation of word's prefix,
       priority order is from left to right
08 ~  hyphenation of word,
       multiple ~ have higher priority
09 =. unfavourable hyphenation of compound,
       multiple . have lower priority
10 ¦. unfavourable hyphenation of word's suffix,
       multiple . have lower priority
11 |. unfavourable hyphenation of word's prefix,
       multiple . have lower priority
12 ~. unfavourable hyphenation of word,
       multiple . have lower priority
13 .  unfavourable hyphenation in general,
       multiple . have lower priority
]]></artwork>
</figure>

</section>
    <section anchor="reserved" title="Reserved characters">
        <t>Reserved characters for this format are:</t>
        <figure>
<artwork><![CDATA[/* Hyphenation Definitions 0.8
 * https://raw.github.com/OpenTaal/hyphenation-definitions/master/
 * grammar/grammar.ebnf
 *
 * Reserved characters
 * tab                         U+0009  CHARACTER TABULATION  '\t'
 * line feed                   U+000A  LINE FEED (LF)        '\n'
 * carriage return             U+000D  CARRIAGE RETURN (CR)  '\r'
 * space                       U+0020  SPACE                 ' '
 * begin comment               U+0023  NUMBER SIGN           '#'
 * unfavourable hyphen         U+002E  FULL STOP             '.'
 * cluster separator           U+002F  SOLIDUS               '/'
 * delimiter                   U+003B  SEMICOLON             ';'
 * compound hyphen             U+003D  EQUALS SIGN           '='
 * begin homograph cluster     U+005B  LEFT SQUARE BRACKET   '['
 * end homograph cluster       U+005D  RIGHT SQUARE BRACKET  ']'
 * begin substitution cluster  U+007B  LEFT CURLY BRACKET    '{'
 * prefix hyphen               U+007C  VERTICAL LINE         '|'
 * end substitution cluster    U+007D  RIGHT CURLY BRACKET   '}'
 * morpheme hyphen             U+007E  TILDE                 '~'
 * suffix hyphen               U+00A6  BROKEN BAR            '¦'
 */
]]></artwork>
        </figure>
        <t>Additionally, other characters may be used as placeholders inside of definitions where a hyphenation needs (re)work or reviewing. The following are recommended because these are rarely found in words and are visually quickly identified. The usage of these falls outside the definition of this format and should be filtered out before providing hyphenation patterns that comply with this standard. Examples are:</t>
        <figure>
<artwork><![CDATA[# Examples of placeholders for reviewing purposes
#räche;rä·che # U+00B7 MIDDLE DOT '·'
#radio;ra*dio # U+002A ASTERISK '*'
#tafel;ta_fel # U+005F LOW LINE '_'
]]></artwork>
        </figure>
        <t>Note that the middle dot '·' can be part of a orthography such as Catalan of Franco-Provençal. Use it with care. See also the section on <xref target="compound_interfix">compound interfix</xref> for characters used to make interfix annotations.</t>
    </section>
</middle>
<back>
    <references>
        <reference anchor="ISO14977" target="http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=26153">
            <front>
                <title>Information technology - Syntactic metalanguage - Extended BNF</title>
                <author>
                    <organization abbrev="ISO/IEC">International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), JTC 1</organization>
                    <address>
                        <postal>
                            <!--street>ISO/IEC Copyright Office</street-->
                            <street>Case Postale 56</street>
                            <city>Geneve 20</city> <code>CH-1211</code>
                            <country>Switzerland</country>
                        </postal>
                        <uri>http://iso.org</uri>
                    </address>
                </author>
                <date month="December" year="1996" />
            </front>
            <seriesInfo name="ISO/IEC" value="14977:1996" />
        </reference>
        <reference anchor="Lia83" target="http://www.tug.org/docs/liang/">
            <front>
                <title>Word Hy-phen-a-tion by Com-put-er</title>
                <author initials="F.M." surname="Liang" fullname="Franklin Mark Liang">
                    <organization>Stanford University, Department of Computer Science</organization>
                    <address>
                        <postal>
                            <street></street>
                            <city>Stanford</city> <region>CA</region> <code>94305</code>
                            <country>United States</country>
                        </postal>
                        <uri>http://www.stanford.edu</uri>
                    </address>
                </author>
                <date month="August" year="1983" />
            </front>
        </reference>
        <reference anchor="Gel14" target="http://github.com/OpenTaal/hyphenation-definitions/">
            <front>
                <title>A standard for prioritised and dynamic hyphenation definitions</title>
                <author initials="S." surname="van Geloven" fullname="Sander van Geloven">
                    <organization abbrev="OpenTaal">Stichting OpenTaal</organization>
                    <address>
                        <postal>
                            <street></street>
                            <city></city>
                            <country>Netherlands</country>
                        </postal>
                        <uri>http://www.opentaal.org</uri>
                    </address>
                </author>
                <date month="January" year="2014" />
            </front>
        </reference>
        <reference anchor="W3C11" target="http://www.w3.org/TR/CSS2/">
            <front>
                <title>Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification</title>
                <author>
                    <organization abbrev="W3C">World Wide Web Consortium</organization>
                    <address>
                        <postal>
                            <street>32 Vassar Street, Building 32-G514</street>
                            <city>Cambridge</city> <region>MA</region> <code>02139</code>
                            <country>United States</country>
                        </postal>
                        <uri>http://www.w3c.org</uri>
                    </address>
                </author>
                <date month="June" year="2011" />
            </front>
        </reference>
        <reference anchor="W3C13a" target="http://www.w3.org/TR/css3-text/">
            <front>
                <title>CSS Text Module Level 3</title>
                <author>
                    <organization abbrev="W3C">World Wide Web Consortium</organization>
                    <address>
                        <postal>
                            <street>32 Vassar Street, Building 32-G514</street>
                            <city>Cambridge</city> <region>MA</region> <code>02139</code>
                            <country>United States</country>
                        </postal>
                        <uri>http://www.w3c.org</uri>
                    </address>
                </author>
                <date month="October" year="2013" />
            </front>
        </reference>
        <reference anchor="W3C13b" target="http://www.w3.org/TR/html51/">
            <front>
                <title>HTML 5.1, A vocabulary and associated APIs for HTML and XHTML</title>
                <author>
                    <organization abbrev="W3C">World Wide Web Consortium</organization>
                    <address>
                        <postal>
                            <street>32 Vassar Street, Building 32-G514</street>
                            <city>Cambridge</city> <region>MA</region> <code>02139</code>
                            <country>United States</country>
                        </postal>
                        <uri>http://www.w3c.org</uri>
                    </address>
                </author>
                <date month="October" year="2013" />
            </front>
        </reference>
        <reference anchor="W3C99" target="http://www.w3.org/TR/html401/">
            <front>
                <title>HTML 4.01 Specification</title>
                <author>
                    <organization abbrev="W3C">World Wide Web Consortium</organization>
                    <address>
                        <postal>
                            <street>32 Vassar Street, Building 32-G514</street>
                            <city>Cambridge</city> <region>MA</region> <code>02139</code>
                            <country>United States</country>
                        </postal>
                        <uri>http://www.w3c.org</uri>
                    </address>
                </author>
                <date month="December" year="1999" />
            </front>
        </reference>
        <reference anchor="UNICODE" target="http://www.unicode.org/versions/Unicode6.3.0/">
            <front>
                <title>The Unicode Standard, Version 6.3.0</title>
                <author>
                    <organization>The Unicode Consortium</organization>
                    <address>
                        <postal>
                            <street></street>
                            <city>Mountain View</city> <region>CA</region>
                            <country>United States</country>
                        </postal>
                        <uri>http://www.unicode.org</uri>
                    </address>
                </author>
                <date month="September" year="2013" />
            </front>
        </reference>
        <reference anchor="Har09" target="http://www.ctan.org/tex-archive/info/patgen2/">
            <front>
                <title>A small tutorial on the multilingual features of PatGen2</title>
                <author initials="Y." surname="Haralambous" fullname="Yannis Haralambous">
                </author>
                <date month="December" year="2009" />
            </front>
        </reference>
        <reference anchor="SS95" target="https://www.tug.org/TUGboat/tb16-3/">
            <front>
                <title>Hyphenation in TEX - Quo Vadis?</title>
                <author initials="P." surname="Sojka" fullname="Petr Sojka">
                    <organization>Faculty of Informatics, Masaryk University</organization>
                    <address>
                        <postal>
                            <street>Burešova 20</street>
                            <city>Brno</city> <code>602 00</code>
                            <country>Czech Republic</country>
                        </postal>
                        <email>sojka@muni.cz</email>
                    </address>
                </author>
                <author initials="P." surname="Ševeček" fullname="Pavel Ševeček">
                    <organization>Faculty of Informatics, Masaryk University</organization>
                    <address>
                        <postal>
                            <street>Burešova 20</street>
                            <city>Brno</city> <code>602 00</code>
                            <country>Czech Republic</country>
                        </postal>
                        <email>pavel@muni.cz</email>
                    </address>
                </author>
                <date month="September" year="1995" />
            </front>
            <seriesInfo name="TB" value="16-3" />
        </reference>
        <reference anchor="Soj95" target="https://www.tug.org/TUGboat/tb16-3/">
            <front>
                <title>Notes on Compound Word Hyphenation in TEX</title>
                <author initials="P." surname="Sojka" fullname="Petr Sojka">
                    <organization>Faculty of Informatics, Masaryk University</organization>
                    <address>
                        <postal>
                            <street>Burešova 20</street>
                            <city>Brno</city> <code>602 00</code>
                            <country>Czech Republic</country>
                        </postal>
                        <email>sojka@muni.cz</email>
                    </address>
                </author>
                <date month="September" year="1995" />
            </front>
            <seriesInfo name="TB" value="16-3" />
        </reference>
        <reference anchor="MR08" target="https://www.tug.org/TUGboat/tb29-3/">
            <front>
                <title>Putting the Cork back in the bottle - Improving Unicode support in TEX</title>
                <author initials="M." surname="Miklavec" fullname="Mojca Miklavec">
                    <organization>Faculty of Mathematics and Physics, University of Ljubljana</organization>
                </author>
                <author initials="A." surname="Reutenauer" fullname="Arthur Reutenauer">
                    <organization>GUTenberg, France</organization>
                    <address>
                        <uri>http://tug.org/tex-hyphen</uri>
                    </address>
                </author>
                <date month="October" year="2008" />
            </front>
            <seriesInfo name="TB" value="29-3" />
        </reference>
        <reference anchor="Lem03" target="http://www.dante.de/DTK/Ausgaben_en.html">
            <front>
                <title>Hyphenation Exception Log für deutsche Trennmuster</title>
                <author initials="W." surname="Lemberg" fullname="Werner Lemberg">
                    <organization>DANTE, Deutschsprachige Anwendervereinigung TEX e.V.</organization>
                    <address>
                        <postal>
                            <street>Postfach 10 18 40</street>
                            <city>Heidelberg</city> <code>69008</code>
                            <country>Germany</country>
                        </postal>
			<email>dante@dante.de</email>
                        <uri>http://www.dante.de</uri>
                    </address>
                </author>
                <date month="May" year="2003" />
            </front>
            <seriesInfo name="DTK" value="15-2" />
        </reference>
        <reference anchor="BS92" target="http://www.dante.de/DTK/Ausgaben_en.html">
            <front>
                <title>Deutsche Silbentrennung für TEX 3.1</title>
                <author initials="W." surname="Barth" fullname="Wilhelm Barth">
                    <organization>DANTE, Deutschsprachige Anwendervereinigung TEX e.V.</organization>
                    <address>
                        <postal>
                            <street>Postfach 10 18 40</street>
                            <city>Heidelberg</city> <code>69008</code>
                            <country>Germany</country>
                        </postal>
			<email>dante@dante.de</email>
                        <uri>http://www.dante.de</uri>
                    </address>
                </author>
                <author initials="H." surname="Steiner" fullname="Helmut Steiner">
                    <organization>DANTE, Deutschsprachige Anwendervereinigung TEX e.V.</organization>
                    <address>
                        <postal>
                            <street>Postfach 10 18 40</street>
                            <city>Heidelberg</city> <code>69008</code>
                            <country>Germany</country>
                        </postal>
			<email>dante@dante.de</email>
                        <uri>http://www.dante.de</uri>
                    </address>
                </author>
                <date month="May" year="2005" />
            </front>
            <seriesInfo name="DTK" value="17-2" />
        </reference>
        <reference anchor="Lem05" target="http://www.dante.de/DTK/Ausgaben_en.html">
            <front>
                <title>Hyphenation Exception Log für deutsche Trennmuster, Version 1</title>
                <author initials="W." surname="Lemberg" fullname="Werner Lemberg">
                    <organization>DANTE, Deutschsprachige Anwendervereinigung TEX e.V.</organization>
                    <address>
                        <postal>
                            <street>Postfach 10 18 40</street>
                            <city>Heidelberg</city> <code>69008</code>
                            <country>Germany</country>
                        </postal>
			<email>dante@dante.de</email>
                        <uri>http://www.dante.de</uri>
                    </address>
                </author>
                <date month="May" year="2005" />
            </front>
            <seriesInfo name="DTK" value="17-2" />
        </reference>
        <reference anchor="Hen08" target="http://www.dante.de/DTK/Ausgaben_en.html">
            <front>
                <title>Einige Fragen zum Beitrag »Hyphenation Exception Log für deutsche Trennmuster, Version 1«</title>
                <author initials="S." surname="Hennig" fullname="Stephan Hennig">
                    <organization>DANTE, Deutschsprachige Anwendervereinigung TEX e.V.</organization>
                    <address>
                        <postal>
                            <street>Postfach 10 18 40</street>
                            <city>Heidelberg</city> <code>69008</code>
                            <country>Germany</country>
                        </postal>
			<email>dante@dante.de</email>
                        <uri>http://www.dante.de</uri>
                    </address>
                </author>
                <date month="January" year="2008" />
            </front>
            <seriesInfo name="DTK" value="20-1" />
        </reference>
        <reference anchor="Nem06" target="https://www.tug.org/TUGboat/tb27-1/">
            <front>
                <title>Automatic non-standard hyphenation in OpenOffice.org</title>
                <author initials="L." surname="Németh" fullname="László Németh">
                    <organization>TeX Users Group</organization>
                    <address>
                        <postal>
                            <street></street>
                            <city>Portland</city> <region></region>
                            <country>United States</country>
                        </postal>
                        <uri>https://www.tug.org</uri>
                    </address>
                </author>
                <date month="October" year="2006" />
            </front>
            <seriesInfo name="TB" value="27-1" />
        </reference>
        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119">
            <front>
                <title>Key words for use in RFCs to Indicate Requirement Levels</title>
                <author initials="S." surname="Bradner" fullname="Scott Bradner">
                    <organization>Harvard University</organization>
                    <address>
                        <postal>
                            <street></street>
                            <city>Cambridge</city> <region>MA</region> <code>02138</code>
                            <country>United States</country>
                        </postal>
                        <uri>http://www.harvard.edu</uri>
                    </address>
                </author>
                <date month="March" year="1997" />
            </front>
            <seriesInfo name="RFC" value="2119" />
        </reference>
        <reference anchor="BCP47" target="https://www.rfc-editor.org/info/bcp47">
            <front>
                <title>Tags for Identifying Languages</title>
                <author initials="A." surname="Phillips" fullname="Addison Phillips">
                    <organization>Yahoo! Inc.</organization>
                    <address>
                        <postal>
                            <street></street>
                            <city></city> <region></region> <code></code>
                            <country>United States</country>
                        </postal>
                        <email>addison@inter-locale.com</email>
                        <uri>http://www.yahoo.com</uri>
                    </address>
                </author>
                <author initials="M." surname="Davis" fullname="Mark Davis">
                    <organization>Google</organization>
                    <address>
                        <postal>
                            <street></street>
                            <city></city> <region></region> <code></code>
                            <country>United States</country>
                        </postal>
                        <email>mark.davis@macchiato.com or mark.davis@google.com</email>
                        <uri>http://www.google.com</uri>
                    </address>
                </author>
                <date month="September" year="2006" />
            </front>
            <seriesInfo name="BCP" value="47" />
        </reference>
        <reference anchor="TM14" target="http://projekte.dante.de/Trennmuster/">
            <front>
                <title>Trennmuster</title>
                <author>
                    <organization abbrev="DANTE">DANTE, Deutschsprachige Anwendervereinigung TeX e.V.</organization>
                    <address>
                        <postal>
                            <street>Postfach 10 18 40</street>
                            <city>Heidelberg</city> <code>D-69008</code>
                            <country>Germany</country>
                        </postal>
                        <phone>+49 6221 2 97 66</phone>
                        <facsimile>+49 6221 16 79 06</facsimile>
                        <email>dante@dante.de</email>
                        <uri>http://www.dante.de</uri>
                    </address>
                </author>
                <date month="January" year="2014" />
            </front>
        </reference>
    </references>
    <section anchor="grammar" title="Grammar">
        <t>The complete grammar for this format of hyphenation definitions is in <xref target="ISO14977">Extended Backus-Naur Form (EBNF)</xref>:</t>
        <figure>
<artwork><![CDATA[/* Hyphenation Definitions 0.8
 * https://raw.github.com/OpenTaal/hyphenation-definitions/master/
 * grammar/grammar.ebnf
 *
 * Reserved characters
 * tab                         U+0009  CHARACTER TABULATION  '\t'
 * line feed                   U+000A  LINE FEED (LF)        '\n'
 * carriage return             U+000D  CARRIAGE RETURN (CR)  '\r'
 * space                       U+0020  SPACE                 ' '
 * begin comment               U+0023  NUMBER SIGN           '#'
 * unfavourable hyphen         U+002E  FULL STOP             '.'
 * cluster separator           U+002F  SOLIDUS               '/'
 * delimiter                   U+003B  SEMICOLON             ';'
 * compound hyphen             U+003D  EQUALS SIGN           '='
 * begin homograph cluster     U+005B  LEFT SQUARE BRACKET   '['
 * end homograph cluster       U+005D  RIGHT SQUARE BRACKET  ']'
 * begin substitution cluster  U+007B  LEFT CURLY BRACKET    '{'
 * prefix hyphen               U+007C  VERTICAL LINE         '|'
 * end substitution cluster    U+007D  RIGHT CURLY BRACKET   '}'
 * morpheme hyphen             U+007E  TILDE                 '~'
 * suffix hyphen               U+00A6  BROKEN BAR            '¦'
 */
HyphenationDefinitions
         ::= ( EOL* HyphenationDefinition? WhiteSpace? Comment? )*
EOL
         ::= ( '\r' | #x000D ) ( '\n' | #x000A )?
           | ( '\n' | #x000A )
WhiteSpace
         ::= ( ( ' ' | #x0009 )
             | ( '\t' | #x0020 ) )+
Comment
         ::= '#' ( [#x0009]
                 | [#x0020-#xD7FF]
                 | [#xE000-#xFFFD]
                 | [#x10000-#x10FFFF] )*
HyphenationDefinition
         ::= Word Delimiter Definition
Delimiter
         ::= ';' | #x003B
Word
         ::= Character Character+
Character
         ::= [#x0021-#x0022]
           | [#x0024-#x002D]
           | [#x0030-#x003A]
           | [#x003C]
           | [#x003E-#x005A]
           | [#x005C]
           | [#x005E]
           | [#x0060-#x007A]
           | [#x007F-#x00A5]
           | [#x00A7-#xD7FF]
           | [#xE000-#xFFFD]
           | [#x10000-#x10FFFF]
Definition
         ::= Cluster ( Hyphen Cluster )*
Hyphen
         ::= MorphemeHyphen
           | SuffixHyphen
           | PrefixHyphen
           | CompoundHyphen
           | CompoundSuffixHyphen
           | CompoundPrefixHyphen
           | UnfavourableHyphen
MorphemeHyphen
         ::= ( '~' | #x007E )+
SuffixHyphen
         ::= '¦' | #x00A6
PrefixHyphen
         ::= '|' | #x007C
CompoundHyphen
         ::= ( '=' | #x003D )+

CompoundSuffixHyphen
         ::= ( '=' | #x003D )+ ( '¦' | #x00A6 )

CompoundPrefixHyphen
         ::= ( '|' | #x007C ) ( '=' | #x003D )+

UnfavourableHyphen
         ::= ( ( '~' | #x007E )
             | ( '|' | #x007C )
             | ( '¦' | #x00A6 )
             | ( '=' | #x003D ) )?
             ( '.' | #x002E )+
Cluster
         ::= ( CharacterCluster
             | SubstitutionCluster
             | HomographCluster )+
CharacterCluster
         ::= Character+
SubstitutionCluster
         ::= '{' CharacterCluster '/'
               ( CharacterCluster ( Hyphen CharacterCluster? )?
               | Hyphen CharacterCluster? )
             '}'
Series
         ::= ( CharacterCluster (Hyphen CharacterCluster)* Hyphen? )
           | ( Hyphen (CharacterCluster Hyphen)* CharacterCluster? )
HomographCluster
         ::= '[' ( Series | (SubstitutionCluster Series? ) ) '/'
                 SubstitutionCluster? Series? ']'
]]></artwork>
        </figure>
        <t>This grammar can be visualised in a railroad diagram by means of http://bottlecaps.de/rr/ui for example.</t>
    </section>
    <section anchor="acknowledgements" title="Acknowledgements">
        <t>The author gratefully acknowledges, in alphabetical order, the contributions of Ruud Baars, Simon Brouwer, Arnoud van den Eerenbeemt, Bart Knubben, Stephan Hennig, László Németh, Werner Lemberg, Bob van de Loo, Mojca Miklavec, Günther Milde, Georg Pfeiffer, Kurt Roeckx, Reinout van Schrouwen, Bert Veenhoff, Herbert Voss and Tobias Wendorff. Most of them are contributing to Stichting OpenTaal, Nederlandstalige TeX Gebruikersgroep (NTG) or DANTE's Trennmuster project.</t>
        <t>This standard is based on a <xref target="Gel14">poster presentation</xref> at the 24th Meeting of Computational Linguistics in The Netherlands (CLIN24), Leiden, Netherlands, January 17th, 2014. Thanks go to the Institute for Dutch Lexicology (INL) and the Dutch-Flemish HLT Agency (TST-Centrale) for the organisation.</t>
    </section>
</back>
</rfc>
<!--TODO Store syllabification also for words that are too short to hyphenate. This information could be used for other goals but also when a word is used in a compound.-->

<!--TODO Hyphenation patterns cal also be used for automated selection of typographic ligatures. See German and Dutch ligatures of ffi, fff and ll, lli which should not be used in certain compounds or only in a special way.

Stylistic ligatures

These arose because with the usual type sort for lowercase f, the end of its hood is on a kern, which would be damaged by collision with raised parts of the next letter.

Sometimes, a ligature crossing the morpheme boundary of a composite word (e.g., ff in shelf‌ful[5]) is considered undesirable, and for example official German orthography as outlined in the Duden prohibits ligatures across composition boundaries.[6] Some computer programs (such as TeX) provide a means of suppressing ligatures.
Ligatures "Th" and "Wh" illustration

Some fonts include an fff ligature (the Requiem font by Jonathan Hoefler even contains an fffl ligature), intended for German compound words like Sauerstoffflasche ("oxygen tank") and Schifffahrt ("boat trip").[note 2] However, since the sequence fff in German occurs only across composition boundaries (Schiff-fahrt, Sauerstoff-flasche) and ligatures are officially prohibited across boundaries, these ligatures cannot be correctly employed for German.[6]

Anwendung im Deutschen

Im Deutschen werden Ligaturen nur gesetzt, wenn die zu verbindenden Buchstaben im gleichen Morphem liegen, also beispielsweise im Wortstamm. Ligaturen werden in der Regel nicht gesetzt, wenn sie eine grammatikalische Fuge (z. B. eine Wortfuge) trennen. „Kaufläche“ (Kau-fläche) wird daher mit ﬂ-Ligatur geschrieben; hingegen wird bei „Kaufleute“ keine Ligatur verwendet, weil die Buchstaben f und l hier in verschiedenen Wortteilen (Kauf-leute) stehen. Eine Ausnahme bilden Nachsilben, die mit i beginnen (-ig, -in, -ich, -isch). Hier werden auch über die grammatikalische Fuge hinweg Ligaturen gesetzt. So wird beispielsweise „häufig“ trotz der Fuge (häuf-ig) mit ﬁ-Ligatur geschrieben.

Die Anwendung von Ligaturen ist nicht verbindlich geregelt, generell folgt man dem Grundsatz: Wenn die einzelnen Buchstaben getrennt gesprochen werden, wird keine Ligatur verwendet.-->


<!--TODO https://en.wikipedia.org/wiki/IJ_%28digraph%29 This Dutch shopkeeper wrote 'byoux' instead of 'bijoux', perhaps because he could not imagine 'ij' to exist in a French word

antijudaïsme
Baijum
bijou
bijous
bijouterie
bijouterieën
bijoutje
Dijon
dijonmosterd
Herbaijum
Hijum
Tijum


dijood-gefunctionaliseerde
dijood
Beijing
Fiji
Fiji-eilanden
Meiji-restauratie
gaijin
Beijing
Fijiër

topologie

Heijoshin 平常心

[Heijoshin] In Budo, maybe more specifically in sword/katana related arts, there is an important term/concept which is Heijoshin. Heijoshin is a three kanji word, the first one “Hei” means calm, peaceful, steady. The second one “jo” means always, constant. The third one “Shin” means mind or hearth, the whole inner essence of the individual.

So a final interpretation could be “Keeping your state of mind all the time/in all situation”.

Hiji - Ellbogen
kimi osae
kimi nage-->

<!--TODO Add translations for "German de-1901 nested hyphenation definitions" -->


            <!--section title="Alternative definitions">
                <figure>
                    <preamble>Some hyphenations change because when a new official spelling or grammar is being introduced. For these cases it is possible to store both the old and new hyphenation definition. In this way it is possible to remain backwards compatible for old documents. The normal first and second column separated by a semi column are still usable. The special cases are defined by simply naming, in this case, the second column between two hyphens and providing the different hyphenation definitions in the third and fourth column. The example shows this flexible extensible format:</preamble>
<artwork><![CDATA[# German words with hyphenation definition
Aase;Aa-se

# German hyphenations before and after 2006 in third and fourth column
Acrivastin;-2-;Acri-va-stin;Acri-vas-tin
Aldosteron;-2-;Al-do-ste-ron;Al-dos-te-ron

# More alternatives can be stored in other columns
Geschosse;-2-;-3-;-4-;Ge-scho-sse;Ge-schos-se;Ge-schos-se
Geschosses;-2-;-3-;-4-;Ge-scho-sses;Ge-schos-ses;Ge-schos-ses
]]></artwork>
                    <postamble>It is important to clearly document the meaning in comments when more then two columns are being used. In this way, one file with hyphenation definitions and alternative hyphenation definitions can be used to generate separated files with hyphenations patterns, each for different a spelling or a special form a language. This is used for example in German to easily add special cases for older spelling, Austrian German or Swiss German.</postamble>
                </figure>
            </section-->
<!--
à-la-carterestaurant
B-52-bommenwerper
F-16-gevechtsvliegtuig
a-capellakoor
# (ad + -) + hoc) + (be + leid) [ad hoc policy]
ad-hocbeleid;ad{-/~}hoc=be~leid

# (in- + (vi + tro)) + (pro + ces) [in vitro process]
in-vitroproces;in{-/~~}vi~tro=pro~ces
-->

<!--TODO geen dubbele rules in een kader-->