-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhyphenation-definitions.xml
1456 lines (1390 loc) · 80.1 KB
/
hyphenation-definitions.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt'?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc comments="yes"?>
<?rfc editing="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<rfc number="TODO" category="info">
<front>
<title abbrev="Hyphenation Definitions Standard">A standard for prioritised and dynamic hyphenation definitions</title>
<author initials="S." surname="van Geloven" fullname="Sander van Geloven">
<organization abbrev="OpenTaal">Stichting OpenTaal</organization>
<address>
<postal>
<street></street>
<city></city>
<country>Netherlands</country>
</postal>
<uri>http://www.opentaal.org</uri>
</address>
</author>
<date month="January" year="2014" />
<area>General</area>
<keyword>lexicology</keyword>
<keyword>orthography</keyword>
<keyword>hyphenation</keyword>
<keyword>standard</keyword>
<abstract>
<t>This document describes a standard for hyphenation definitions enabling the generation of prioritised and dynamic hyphenation patterns. In the early nineteen-eighties, automatic hyphenation of lexical items has been made possible by a hyphenator using language-specific hyphenation patterns. These patterns are generated by the hyphenation software community from hyphenated word lists. The initial design was based on the English orthography and limited character encoding. Support for extended encodings was added in the 1990s mostly for Western languages. However, the hyphenated word list format remained rather unchanged. This complicated the support of specific morphological or phonological structures, requiring hyphenation priority in compounds or dynamic hyphenation resulting in altered spelling. Although over 70 languages are supported now, hyphenation is suboptimal and impossible for languages relying on a universal character encoding. This limited method of hyphenation has been catering to digital typesetting over three decades. Unfortunately, recently implemented hyphenation in layout engines for web page rendering is built upon the same outdated technology. An improved hyphenator and extended hyphenation patterns are necessary to overcome current limitations and support a wider range of languages. To achieve this, the software community needs a standard format for hyphenation definitions in universal human-readable hyphenated word lists. A context-free grammar was developed with unambiguous and fine-grained control allowing enhanced hyphenation. All language-specific cases are illustrated with examples and lexicological theory. Our standard for hyphenation definitions enables improved automatic hyphenation for printed media and web documents.</t>
</abstract>
</front>
<middle>
<section anchor="introduction" title="Introduction">
<t>Recent decades have seen automated hyphenation of text being born and having experienced several growth spurts. Unfortunately, the hyphenation patterns currently used by the hyphenation algorithm cannot offer prioritised or dynamic hyphenation. To enable the next developmental leap to overcome this, these patterns need to be generated from prioritised and dynamic hyphenation definitions. A detailed and illustrated standard for these definitions is described in this document.</t>
<section title="Requirements language">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in <xref target="RFC2119">RFC 2119</xref> only when they appear in all upper case. They may also appear in lower or mixed case as English words, without special meaning.</t>
</section>
<section title="Language tags">
<t>References to specific orthographies are made according to <xref target="BCP47">BCP 47</xref>. For example "de-CH-1996" represents German as used in Switzerland and as written using the spelling reform beginning in the year 1996 and "de-1901" represents the German orthography reform of 1901.</t>
</section>
<section title="Character encoding">
<t>References to specific characters in this document are always done via <xref target="UNICODE">Unicode</xref> characters and code points. A Unicode code point can be recognised by a capital U, followed by a plus sign and followed by four to six hexadecimal digits. Usually, four or five digits are being used. A Unicode character is shown between single quotation marks and the Unicode name of the character is written in all capitals. An example code point is U+003D to indicate the character '=' which is known as the EQUALS SIGN.</t>
</section>
<section title="Format description">
<t>The format is formally described by a grammar in <xref target="ISO14977">Extended Backus-Naur Form (EBNF)</xref>. This notation enables that hyphenation definitions can be written, validated and parsed by a context-free grammar. Rules and comments for this grammar are recognised by respectively ::= and /* in this document. The syntax of all accompanying examples, recognisable by a #, always conforms to this grammar.</t>
</section>
<section title="Design decisions">
<t>Compiling an international standard involves making many decisions. It is by far a trivial task. For example, selecting a reserved character involves checking whether that character is not used in words. Words are normally considered as a concatenation of characters separated by spaces or punctuation, but this differs substantially amongst written languages. What might be a practical choice for one language could be incompatible with for another. Likewise, this standard does not concern itself with the validity of the resulting hyphenations. This is left up to the users, as languages, and even dialects, have different rules and exceptions based on etymological, morphological or phonetic principles. That the designed format offers a maximum degree of freedom and flexibility for the end user is key.</t>
</section>
</section>
<section title="Hyphenation">
<t>TODO general introduction and example</t>
<figure>
<artwork><![CDATA[# Examples of hyphenated text in English and Dutch.
#
# An extre- Een boom met prui-
# mely long men die men als ei-
# English eren beschrijft be-
# word over- treft hun omvang.
# looking a Iemand wilde pluk-
# nice sen- ken zonder toestem-
# tence as ming te hebben. Er
# a beauti- werd ook nog gespro-
# ful exam- ken dat hij een har-
# ple here. tendiefje was.
]]></artwork>
</figure>
<section anchor="general_hyphenation" title="Hyphenation in general">
<t>TODO general concept</t>
<!--t> ... Hyphenation is possible on one or several so called hyphenations points of a word. These are usually the in between of consecutive syllables. For aesthetic reasons, a hyphenation point is never after the first or before the last syllable when that syllable consists only of one character.</t-->
</section>
<section anchor="history" title="History">
<t>TODO implementations and patgen and refs<!--(what were de facto standard or common use) patgen patgen2-->
<xref target="Lia83">todo</xref>
<xref target="Nem06">todo</xref>
<xref target="TM14">asdf</xref>
<xref target="SS95">asdf</xref>
<xref target="Soj95">asdf</xref>
<xref target="Har09">asdf</xref>
<xref target="MR08">asdf</xref>
<xref target="Lem03">asdf</xref>
<xref target="Lem05">asdf</xref>
<xref target="Hen08">asdf</xref>
<xref target="BS92">asdf</xref>
<xref target="W3C11">asdf</xref>
<xref target="W3C13b">asdf</xref>
<xref target="W3C13a">asdf</xref>
<xref target="W3C99">asdf</xref>
</t>
</section>
<section anchor="automated_hyphenation" title="Automated hyphenation">
<t>TODO the challenge and paper/webpage <!-- (only concept) usage of hyphenation patterns which are generated from hyphenation definitions by software such as patgen. TODO talk more about the challenge TODO some history-->
create word list for language or dialect
generate suggested hyphenation definitions
manually review hyphenation definitions
</t>
<figure>
<artwork><![CDATA[Process of delivering automated hyphenation
+---------------------+
| word list for a |
| language or dialect |
+---------------------+
|| automated syllabification
\/
+-------------------------+
| working set of |
| hyphenation definitions |
+-------------------------+
|| manual review and
\/ automated validation
+-----------------+
| +-------------+ |
| | HYPHENATION | |
| | DEFINITIONS | |
| +-------------+ |
+-----------------+
|| preprocessing by
\/ hyphenation algorithm
+----------------------+
| hyphenation patterns |
| to ship in software |
+----------------------+
|| real-time use of
\/ hyphenation algorithm
+------------------+
| automatically |
| hyphenated text |
+------------------+
]]></artwork>
</figure>
<t>This standard caters to the following two functional requirements.
<list style="symbols">
<t>As an editor (i.e. person) I want to document hyphenation points in a word list for a certain language of dialect by means of hyphenation definitions.</t>
<t>As a hyphenation algorithm preprocessor (i.e. software application) I want to retrieve hyphenation points from hyphenation definitions to in order to generate hyphenation patterns for a certain language of dialect.</t>
</list>
Both cases are a part of the process to provide automated hyphenation of text in software applications.</t>
</section>
<section anchor="applications" title="Applications that hyphenate">
<t>Improving automated hyphenation affects all software applications depending on it. To indicate the impact of a change it is important to list affected products and organisations. The following applications currently use hyphenation patterns which originate from patgen:
<list style="symbols">
<t>document preparation systems based on TeX
<list style="symbols">
<t>Babel - TeX's and LaTeX's multilingual typesetting</t>
<t>polyglossia - XeLaTeX's and lualatex's multilingual typesetting</t>
</list>
</t>
<t>hyphenation and justification with libhyphen
<list style="symbols">
<t>LibreOffice - The Document Foundation's office suite</t>
<t>Apache OpenOffice - Apache Software Foundation's office suite</t>
<t>Inkscape* - a vector graphics editor</t>
<t>GIMP - a raster graphics editor</t>
<t>Scribus - desktop publishing software</t>
<t>InDesign - Adobe's desktop publishing software</t>
<t>Illustrator - Adobe's vector graphics editor</t>
</list>
</t>
<t>client-side hyphenation in JavaScript with hyphenator.js</t>
<t>layout engines for rendering web pages
<list style="symbols">
<t>Gecko by Mozilla
<list style="symbols">
<t>Firefox - Mozilla's web browser</t>
<t>Thunderbird - Mozilla's e-mail and news client</t>
<t>Firefox for mobile - Mozilla's web browser for Android</t>
</list>
</t>
<t>WebKit by Apple and Adobe
<list style="symbols">
<t>Safari - Apple's web browser</t>
<t>Konqueror - KDE's web browser and file manager</t>
</list>
</t>
<t>Blink by Google
<list style="symbols">
<t>Chromium and Chrome - Google's web browsers</t>
<t>Opera - Opera's web browser</t>
<t>Web Browser - Google's default web browser for Android</t>
</list>
</t>
</list>
</t>
</list>
* Implementation of automated hyphenation for Inkscape is planned for the near future.</t>
<t>This overview does not endorse or favour the use of any of these applications and respects registered trademarks where applicable. It is merely included to illustrate the wide spectrum of applications employing hyphenation patterns.</t>
</section>
</section>
<section anchor="basic" title="Basic format">
<t>This section describes the basic format for hyphenation patterns. These are usually stored in computer files, but they can also reside in databases or memory. The structure will be described step by step, extending the grammar for this format and illustrating usage in example. The syntax of all examples complies to the grammar of this format.</t>
<section anchor="main_structure" title="Main structure">
<t>In order to support as many languages as possible, this format for hyphenation definitions MUST use the Unicode character in a UTF-8 encoding. A set of hyphenation definitions MAY have one or more lines. Each line MAY have, in the following order:
<list style="numbers">
<t>a hyphenation definition,</t>
<t>white space,</t>
<t>and/or comments.</t>
</list>
This is the the top-level or main structure of the entire format. The syntax for hyphenation definitions in Extended Backus-Naur Form (EBNF) will therefore be:</t>
<figure>
<artwork><![CDATA[HyphenationDefinitions
::= ( EOL* HyphenationDefinition? WhiteSpace? Comment? )*
]]></artwork>
</figure>
<t>Here EOL stands for an end of line. An end of line MUST have a LINE FEED (LF) or U+000A and MAY have a CARRIAGE RETURN (CR) or U+000D. This is written in EBNF as:</t>
<figure>
<artwork><![CDATA[EOL
::= ( '\r' | #x000D ) ( '\n' | #x000A )?
| ( '\n' | #x000A )
]]></artwork>
</figure>
<t>White space can be inserted to improve human readability of hyphenation definitions but is OPTIONAL. When used, it SHALL contain only SPACE U+0020 or CHARACTER TABULATION U+0009 characters. White space in EBNF is:</t>
<figure>
<artwork><![CDATA[WhiteSpace
::= ( ( ' ' | #x0009 )
| ( '\t' | #x0020 ) )+
]]></artwork>
</figure>
<t>A comment MUST start with a NUMBER SIGN U+0023 or '#' and MAY contain any combination of printable characters thereafter. Comments MUST NOT contain control characters that can result in an end of line, however the CHARACTER TABULATION U+0009 MAY be used in comments. In EBNF a comment is:</t>
<figure>
<artwork><![CDATA[Comment
::= '#' ( [#x0009]
| [#x0020-#xD7FF]
| [#xE000-#xFFFD]
| [#x10000-#x10FFFF] )*
]]></artwork>
</figure>
<t>Note that the allowed range of characters needs to be fine tuned later on. It needs to exclude more non-characters according to section 16.7 called Noncharacters of <xref target="UNICODE">Unicode</xref>. At least the range U+0080 until U+009F is a candidate here but also for the character range defined in <xref target="general_hyphenation_definition">hyphenation definitions in general</xref>.</t>
<!-- OLD section title="Comments and whitespace">It is possible to have comments, prefixed with percent (#), after each definition. It is also possible that an entire line is regarded as comments. Note that comments can be preceded by whitespace in terms of spaces or tabs. This example shows the use of comments and empty lines:</section-->
<t>With the definition of the main structure, without any actual hyphenation definition, it is possible store data in this format. An example with end of lines, white space and comments is:</t>
<figure>
<artwork><![CDATA[# This is the first line with only a comment
# This is the third line after an empty second line.
## After some whitespace, this is the fourth line. # #
# Comments can use most reserved characters, e.g. {}[]/|~=.; #
# and Unicode orthographys, e.g.
# ру́сский
# язы́к,
# język polski and
# ελληνική
# γλώσσα
]]></artwork>
</figure>
<t>This completes the description of the the main structure which is processed in a line-by-line fashion.</t>
</section>
<section anchor="general_hyphenation_definition" title="Hyphenation definition in general">
<t>A hyphenation definition is the essential part of this format and MUST have, in this order:
<list style="numbers">
<t>a word,</t>
<t>a delimiter,</t>
<t>and a definition.</t>
</list>
This is where the actual hyphenation definition is provided for a word. A word is REQUIRED to be unique amongst all definitions in a single file because it is the unique key for looking up a hyphenation definition. A hyphenation definition in EBNF is written as:</t>
<figure>
<artwork><![CDATA[HyphenationDefinition
::= Word Delimiter Definition
]]></artwork>
</figure>
<t>The delimiter MUST be a SEMICOLON ';' or U+003B. In EBNF this is:</t>
<figure>
<artwork><![CDATA[Delimiter
::= ';' | #x003B
]]></artwork>
</figure>
<t>A word MUST be a concatenation of at least two characters:</t>
<figure>
<artwork><![CDATA[Word
::= Character Character+
]]></artwork>
</figure>
<t>Most Western languages would use a word with minimum of four characters to consider it a candidate for hyphenation. In case of hyphenation these languages require a minimum of two characters before and after hyphenation. The hyphenation character inserted is usually a HYPHEN-MINUS U+002D or '-'. However, some languages have a lexicography with a different set rules for hyphenation.</t>
<t>Modern Greek, however, allows hyphenation directly after a single character prefix. Another counterexample is the Ge'ez language. It uses a ETHIOPIC WORDSPACE or U+1361 to separate words. This language has no need for a hyphen character at the end of a line because no ambiguous situation can arise whether a word end at an end of line or not. This allows for hyphenation of a single character at the end of a word.</t>
<t>For the reasons this format allows hyphenation definitions for words with a minimum of two characters. It is up to the user to enforce stricter rules for a greater minimum word length if needed. These are parameters of the hyphenation algorithm preprocessor to ignore words that are too short.</t>
<t>A character in a word MUST be a printable character and MUST NOT be a control character such as LINE FEED or CHARACTER TABULATION and MUST NOT be a reserved character such as SPACE U+0020 ' ' or NUMBER SIGN U+0023 '#' is discussed. Without going into detail of other reserved characters, the definition of a character in EBNF is:</t>
<figure>
<artwork><![CDATA[Character
::= [#x0021-#x0022]
| [#x0024-#x002D]
| [#x0030-#x003A]
| [#x003C]
| [#x003E-#x005A]
| [#x005C]
| [#x005E]
| [#x0060-#x007A]
| [#x007F-#x00A5]
| [#x00A7-#xD7FF]
| [#xE000-#xFFFD]
| [#x10000-#x10FFFF]
]]></artwork>
</figure>
<t>Instead of providing a hyphenation definition it is possible to repeat the word after the delimiter without providing any hyphenation information. The grammar rule for definition will allow this. A hyphenation definition repeating the word means that this word SHALL NOT be hyphenated at all. A hyphenation definition MAY be given, but when none is provided for a certain word, then hyphenation for that word is undefined. Some very short examples in the format as it is so far described are:</t>
<figure>
<artwork><![CDATA[# too short English words not allowed to be hyphenated
#a;a
#at;at
#are;are # too short for hyphenation according to the language
# English words not to be hyphenated
door;door
eight;eight
# German words not to be hyphenated
amorph;amorph
schnarchst;schnarchst
# Dutch words not to be hyphenated
schrijft;schrijft
V-snaar;V-snaar # note that '-' is considered a normal character
# acronyms not to be hyphenated
UNESCO;UNESCO
unicef;unicef
# hyphenation is undefined when no hyphenation definition is given
#impeachment;impeachment
]]></artwork>
</figure>
</section>
<section anchor="word" title="Hyphenation definition for a word">
<t>A hyphenation definition in the most simple form MUST contain two or more clusters of characters that are separated by a hyphenation point. Combined with the previous description of preventing hyphenation by repeating the word, the EBNF grammar rule for definition is:</t>
<figure>
<artwork><![CDATA[Definition
::= Cluster ( Hyphen Cluster )*
]]></artwork>
</figure>
<t>A character cluster here MUST consist of at least one character. This basic form is already supported by the current hyphenation algorithm and is key to the concept of hyphenation. More intricate schemes of clusters and hyphenations will be discussed later on, but are already referred to in the following EBNF bridging from cluster to character clusters:</t>
<figure>
<artwork><![CDATA[Cluster
::= ( CharacterCluster
| SubstitutionCluster
| HomographCluster )+
CharacterCluster
::= Character+
]]></artwork>
</figure>
<t>The concatenation of different clusters only applies in combination with a substitution cluster or a homograph cluster, as will be demonstrated later on. This is because consecutive character clusters have the same syntax as a single character cluster. These are merely more characters added in the same way and will therefore MUST NOT be regarded as separate character clusters.</t>
<t>The final construct required to allow for simple hyphenation definitions is a reserved character to separate the clusters of characters which are also known as morphemes. Here one or more TILDE characters '~' or U+007E MUST be used as a morpheme hyphen. In the following, rules allow also for more intricate hyphenation yet, the morpheme hyphen is:</t>
<figure>
<artwork><![CDATA[Hyphen
::= MorphemeHyphen
| SuffixHyphen
| PrefixHyphen
| CompoundHyphen
| CompoundSuffixHyphen
| CompoundPrefixHyphen
| UnfavourableHyphen
MorphemeHyphen
::= ( '~' | #x007E )+
]]></artwork>
</figure>
<t>Some simple examples of hyphenation definitions for words are:</t>
<figure>
<artwork><![CDATA[# English word with hyphenation definition
revolve;re~volve # "volve" may not be hyphenated
editor;ed~i~tor # character cluster of single character
# German words with hyphenation definition
Aale;Aa~le # possible hyphenation is "Aa-" "le"
kühle;küh~le # possible hyphenation is "küh-" "le"
# Dutch words with hyphenation definition
alle;al~le # possible hyphenation is "al-" "le"
gezellig;ge~zel~lig # "ge-" "zellig" or "gezel-" "lig"
# Polish word with uncommon hyphenation definition
kung-fu;kung~-fu # possible is "kung-" "-fu"
# Modern Greek
# note hyphenation directly after one character
#άτακτος;
#ά~τα~κτος
]]></artwork>
</figure>
<t>Up to this point the functionality of the previous format for hyphenation patterns as used by patgen2 is similar. Everything described in this format from this point onward is newly proposed functionality.</t>
<t>A hyphenation point SHALL be defined by one or more tildes. A hyphenation point of higher priority MUST have at least one additional tilde compared to lower priority hyphenation points. Some examples to illustrate prioritised hyphenation definitions in words are:</t>
<figure>
<artwork><![CDATA[# English words with prioritised hyphenation
ergonomic;er~go~~no~mic # because of (er + go) + (no + mic)
thesauruses;the~sau~~rus~es
# French words with prioritised hyphenation
portemonnaie;por~te~~mon~naie # because of (por + te) + (mon naie)
atmosphère;at~mo~~sphè~re # because of (at + mo) + (sphè + re)
]]></artwork>
</figure>
<t>The structure of the words is broken down in the comments with the use of brackets '(' and ')' and plus sign '+'. This is a form of syllabification that reflects semantic information. It is not a part of the format but is only used to explain the examples of the format.</t>
</section>
<section anchor="word_prefix" title="Hyphenation definition for a word prefix">
<t>Many languages allow usage of a prefix to alter the meaning of a word. Here a VERTICAL LINE U+007C or '|' MAY be used to indicate a hyphenation point for a prefix. This enables reuse of the hyphenation definition of the word. Hyphenation directly after a prefix has a small priority over a normal hyphenation point. Prefixes are semantically built from right to left for a left-to-right script. Therefore, priority amongst prefixes is from left to right for a left-to-right script. Syntax for defining hyphenation of a prefix should comply to the following EBNF:</t>
<figure>
<artwork><![CDATA[PrefixHyphen
::= '|' | #x007C
]]></artwork>
</figure>
<t>Some examples of hyphenation definitions including a prefix are:</t>
<figure>
<artwork><![CDATA[# English words with prefix
# dis < ap + pear
disappear;dis|ap~pear
# su + pra < or + bit + al
supraorbital;su~pra|or~bit~al
# German words with prefix
# ent < deckt [discouvered]
entdeckt;ent|deckt
# Re < kon < struk + ti + on [reconstruction]
Rekonstruktion;Re|kon|struk~ti~on
# Dutch words with prefix
# ge < wil + lig [willing]
gewillig;ge|wil~lig
# her < be < re + ke + nen [to recalculate]
herbereken;her|be|re~ke~nen
]]></artwork>
</figure>
<t>In the comments, the prefixes are indicated with a less-than sign, which precedes evaluation of the plus sign. Sometimes the comments on examples provide the meaning of the word in between double guillemets. These are '[' and ']'. These help understanding the examples which are from languages other than English but are not part of this standard.</t>
</section>
<section anchor="word_suffix" title="Hyphenation definition for a word suffix">
<t>A suffix can be identified in a similar way as is done for <xref target="word_prefix">prefixes</xref>. Instead of a vertical line a BROKEN BAR U+00A6 or '¦' MAY be used for suffixes. In EBNF this is:</t>
<figure>
<artwork><![CDATA[SuffixHyphen
::= '¦' | #x00A6
]]></artwork>
</figure>
<t>Some examples are:</t>
<figure>
<artwork><![CDATA[# English words with suffix
# broth + er > hood
brotherhood;broth~er¦hood
# re + morse > less > ness
remorselessness;re~morse¦less¦ness
# German word with suffix
# wahr + schein > lich [probably]
wahrscheinlich;wahr=schein¦lich
# Un < sich + er > heit [uncertainty]
Unsicherheit;Un|si~cher¦heit
# Dutch words with suffix
# een > zaam > heid [loneliness]
eenzaamheid;een¦zaam¦heid
# beest > ach~tig [beastly]
beestachtig;beest¦ach~tig
]]></artwork>
</figure>
<t>The comments use a greater-than sign to explain the structure where suffixes build from left to right, gaining priority in this way for a left-to-right script. A hyphenation point for a suffix has priority over hyphenation on a prefix.</t>
</section>
</section>
<section anchor="extended" title="Extended format">
<section anchor="compound" title="Hyphenation definition for a compound">
<t>Many languages can concatenate words to form long compounds. Some real-life examples from Western languages are:</t>
<figure>
<artwork><![CDATA[# long compound without spaces in German
#Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
# long compound without spaces in Dutch
#aansprakelijkheidswaardevaststellingsveranderingen
# long compound without spaces in Hungarian
#megszentségteleníthetetlenségeskedéseitekért
# long compound without spaces in English
#pneumonoultramicroscopicsilicovolcanoconiosis
]]></artwork>
</figure>
<t>These are extreme, but it is also possible in, for example, English to concatenate words, forming long compounds. This is less common, as spaces are usually found in English compounds, hence for those cases hyphenation is less problematic.</t>
<t>Hyphenation definitions of compounds should be made with a different reserved character. The EQUALS SIGN U+003D or '=' MUST be used to indicate hyphenation on compound level. This prevents long series of tildes in complex compounds allowing automated generation, suggestion or validation of hyphenation patterns for compounds. In EBNF, this is:</t>
<figure>
<artwork><![CDATA[CompoundHyphen
::= ( '=' | #x003D )+
]]></artwork>
</figure>
<t>Examples of hyphenation definitions for compounds are:</t>
<figure>
<artwork><![CDATA[# English compounds
# small + talk
smalltalk;small=talk
# (bit + ter) + sweet
bittersweet;bit~ter=sweet
# German compounds
# Grenz + schutz + amt [border patrol office]
Grenzschutzamt;Grenz=schutz=amt
# Herz + still + stand [cardiac arrest]
Herzstillstand;Herz=still=stand
# Dutch compounds
# boek + (om + slag) [book cover]
boekomslag;boek=om~slag
# trein + (wa + gon) [train carriage]
treinwagon;trein=wa~gon
]]></artwork>
</figure>
<t>A hyphenation point for a compound SHALL be defined by one or more equals signs. A hyphenation point of higher priority MUST have at least one additional equals sign compared to lower priority hyphenation points for compounds. This is similar to hyphenation point priorities in definitions for <xref target="word">words</xref>. Some examples to illustrate prioritised hyphenation definitions in compounds are:</t>
<figure>
<artwork><![CDATA[# German
# Erb + (lehn + gut) [lit: inheritened loan property]
Erblehngut;Erb==lehn=gut
# Fach + (werk + statt) [crafts workshop]
Fachwerkstatt;Fach==werk=statt
# Berg + ((fünf + (fin + ger)) + kraut)
# [lit: mountain five-finger herb]
Bergfünffingerkraut;Berg===fünf=fin~ger==kraut
# (See + (schiff + fahrt)) + (stra + ße)
# [sea traffic shipping lane]
Seeschifffahrtstraße;See==schiff=fahrt===stra-ße
# Dutch
# ((goe + de + ren) + trein) + (wa + gon)
# [cargo train carriage]
goederentreinwagon;goe~de~ren=trein==wa~gon
]]></artwork>
</figure>
<t>A hyphenation point for a compound MUST be treated with higher priority than that of a suffix.</t>
</section>
<section anchor="compound_prefix" title="Hyphenation definition for a compound prefix">
<t>Compounds can also have a prefix. These are defined in a similar way as a <xref target="word_prefix"> prefix of a word</xref>. A combination of a VERTICAL LINE U+007C or '|' followed directly by a EQUALS SIGN U+003D or '=' MAY be used to indicate a prefix of a compound. In EBNF this is:</t>
<figure>
<artwork><![CDATA[CompoundPrefixHyphen
::= ( '|' | #x007C ) ( '=' | #x003D )+
]]></artwork>
</figure>
<t>Examples are:</t>
<figure>
<artwork><![CDATA[# German compounds with prefix
# un < wahr + (schein + lich) [unlikely]
unwahrscheinlich;un|=wahr=schein~lich
# Ur < groß + (el + tern) [great-grandparents]
Urgroßeltern;Ur|=groß=el~tern
# Dutch compound with prefix
# on < waar + (schijn + lijk) [unlikely]
onwaarschijnlijk;on|=waar=schijn~lijk
]]></artwork>
</figure>
<t>Here the number of equals signs match the number of equals signs of the compound hyphenation that this prefix is related to. Compound prefixes are extended from right to left and prioritised from left to right for a left-to-right script.</t>
</section>
<section anchor="compound_suffix" title="Hyphenation definition for a compound suffix">
<t></t>
<figure>
<artwork><![CDATA[CompoundSuffixHyphen
::= ( '=' | #x003D )+ ( '¦' | #x00A6 )
]]></artwork>
</figure>
<t>Examples are rare, but some are given below:</t>
<figure>
<artwork><![CDATA[# German compounds with suffix
# (an + dert) + halb > fach
anderthalbfach;an~dert=halb=¦fach
# zier + rat > lo + se
zierratlose;zier=rat=¦lo~se
# (zu < sam + men) + hang > los
zusammenhanglos;zu|sam~men=hang=¦los
# Dutch compounds with suffix
# (li + te + ra + tuur) + (we + ten > schap) > je
# [lit: diminitive of literature science]
# it is lexicologically the diminitive of science
# but semantically diminutive of the entire compound
literatuurwetenschapje;li~te~ra~tuur=we~ten¦schap=¦je
# (on < (sa + men) + (han + gend)) > heid
# [incoherentness]
onsamenhangendheid;on|sa~men=han~gend=¦heid
]]></artwork>
</figure>
<t>The number of equals signs are the same as the number of equals signs of the compound hyphenation this suffix is related to. This is similar to <xref target="word_suffix">word suffix</xref> and <xref target="compound_prefix">compound prefix</xref>. Compound suffixes are extended from left to right and are prioritised from right to left for a left-to-right script, albeit that nested compound suffixes will be extremely rare.</t>
</section>
<section anchor="compound_interfix" title="Hyphenation definition for a compound interfix">
<t>With the format for hyphenation definitions described up to this point, it is possible to define hyphenation definitions for compounds, even if they have an interfix. Interfixes are common in some languages as a linking element in compounds. They usually do not have a semantic function but rather one of aiding pronunciation. Hyphenation has no special requirements to indicate interfixes. However, it is useful to annotate interfixes, enabling identification of the separate words from which the compound has been formed. In this way the hyphenation definition of the compound can be automatically generated, suggested or validated. In addition, this information could be used for decomposition to validate and extend spell checking.</t>
<t>There are no grammar rules for this at the moment, because this part of the format is still under discussion. The characters used in the following example are the LESS-THAN SIGN U+003C and GREATER-THAN SIGN U+003E, which could become reserved characters in the future. Interfix annotations can simply be filtered out before hyphenation patterns are used as input to the hyphenation algorithm.</t>
<figure>
<artwork><![CDATA[# German interfix
# (Arbeit + s) + zimmer [working room]
Arbeitszimmer;Ar~beits=zim~mer # could be Ar~beit<s>=zim~mer
# Dutch interfix
# (kip + (p + en)) + soep [chicken soup]
kippensoep;kip~.pen=soep
# could be;kip<~.pen>=soep
# ((be + roep) + s) + ethiek [professional ethics]
beroepsethiek;be~roeps=ethiek
# could be ;be~roep<s>=ethiek
# (Koningin + (n + e)) + dag [Queen's Day]
Koninginnedag;Ko~nin~gin~ne=dag
# could be ;Ko~nin~gin<~ne>=dag
# Croatian interfix
# (brod + o) + gradilište [shipyard]
brodogradilište;brodo=gradilište
# could be ;brod<o>=gradilište
]]></artwork>
</figure>
<t>Note that this should not be used of the word preceding the interfix has changed spelling because of its usage in the compound with an interfix.</t>
</section>
<section anchor="unfavourable" title="Unfavourable hyphenation">
<t>Sometimes hyphenations can be misleading or distorting and are unfavourable. This MUST be indicated by a FULL STOP U+002E or '.'. More than one full stop MAY be used to indicate hyphenation points which are extremely unfavourable. An unfavourable hyphenation point MAY be preceded by a hyphenation character to indicate the type of hyphenation point. In EBNF this is can be written as:</t>
<figure>
<artwork><![CDATA[UnfavourableHyphen
::= ( ( '~' | #x007E )
| ( '|' | #x007C )
| ( '¦' | #x00A6 )
| ( '=' | #x003D ) )?
( '.' | #x002E )+
]]></artwork>
</figure>
<t>Some examples of unfavourable hyphenation are:</t>
<figure>
<artwork><![CDATA[# unfavourable hyphenation in German
# dem + (ent + (spre + chend)) [accordingly]
dementsprechend;dem=ent|.spre-chend
# re + (in + (stal + liert) [reinstalled]
reinstalliert;re|in|.stal-liert
# Sprech + (er + (zie + hung) [elocution]
Sprecherziehung;Sprech=er|.zie-hung
# (Wind + (en + er + gie) + (an + (la + ge)))
# [wind-energy plant]
Windenergieanlage;Wind=en.er-gie==an|la-ge
# Ost + (en + de)
# [toponiem of place in Belgium]
Ostende;Ost=en-.de
# unfavourable hyphenation in Dutch
# (deur + waar + ders) + (ex + ploit) [lit: bailiff abuse]
deurwaardersexploit;deur~waar~ders=ex~..ploit
# (Koningin + (n + e)) + dag [Queen's Day]
Koninginnedag;Ko~nin~gin~.ne=dag
# could be ;Ko~nin~gin<~.ne>=dag
]]></artwork>
</figure>
</section>
</section>
<section anchor="dynamic-hyphenation" title="Dynamic hyphenation">
<section anchor="altered_spelling" title="Hyphenation with altered spelling">
<t>Hyphenation can result in a changed spelling of the word. How this affects a word depends on the language, as will be seen later on. A hyphenation definition of this type MUST contain both an unhyphenated and a hyphenated spelling for such word. This is called a substitution cluster. It MUST contain only the particular hyphenation point and adjacent character clusters that altered.</t>
<t>A substitution cluster MUST be provided between curly brackets LEFT CURLY BRACKET U+007B or '{' and RIGHT CURLY BRACKET U+007D or '}' with SOLIDUS U+002F or '/' as a separator. Left of the separator MUST be the unhyphenated spelling and on the right MUST be the hyphenated spelling. Examples later on will clarify this in detail. The exact rule in EBFN for this is:</t>
<figure>
<artwork><![CDATA[SubstitutionCluster
::= '{' CharacterCluster '/'
( CharacterCluster ( Hyphen CharacterCluster? )?
| Hyphen CharacterCluster? )
'}'
]]></artwork>
</figure>
<t>Some languages have transforming digraphs when hyphenating. In German the 'c' and 'k' are orthographic allographs for /k/. The digraph 'ck' can result in 'k-k' when hyphenation is in the middle of that digraph. Examples of transforming digraphs with orthographic allographs are:</t>
<figure>
<artwork><![CDATA[# German with altered spelling digraph
# "Zucker" or
# "Zuk-" "ker" [sugar]
Zucker;Zu{ck/k~k}er
]]></artwork>
</figure>
<t>In German it is also possible to have doubling of consonants in digraphs when hyphenating. The digraph 'll' can initially be a shorter spelling of the trigraph 'lll', which itself is a concatenation of the digraph 'll' and a glyph 'l'. When hyphenation is in the first mentioned digraph, the previously eliminated 'l' should be restored. Examples of restoring eliminated consonants from trigraphs are:</t>
<figure>
<artwork><![CDATA[# German with doubled consonant spelling
# "Ab-" "fallager" or
# "Abfall-" "lager" or
# "Abfalla-" "ger" [waste storage]
Abfallager;Ab~fa{ll/ll~l}a~ger
# "Stoffül-" "le" or
# "Stoff-" "fülle" [wealth of material]
Stoffülle;Sto{ff/ff=f}ül~le
# "Vollast" or
# "Voll-" "last" [maximum load, lit: full load]
Vollast;Vo{ll/ll=l}ast
# Norwegian with doubled consonant spelling
# "trykknapp" or "trykk-" "knapp" [snap fastener]
trykknapp;try{kk/kk=k}napp
# equivalent notation, less verbose but more searchable
#trykknapp;tryk{k/k=k}napp
]]></artwork>
</figure>
<t>Some languages have vowel doubling. This occurs when stress is on an open syllable and a suffix added after that syllable. This happens for example in Dutch for some diminutive forms. When these diminutives are hyphenated on that syllable, the vowel at the end of an open syllable needs to be duplicated, since the stress will ensure proper pronunciation. Examples of stressed open syllables with doubled vowels are:</t>
<figure>
<artwork><![CDATA[# Dutch vowel doubling in diminutive
# "omaatje" or
# "oma-" "tje" [granny] [degenitiv of grantmother]
omaatje;oma{a/-}tje
# equivalent notation, more verbose but less searchable
#omaatje;om{aa/a-}tje
]]></artwork>
</figure>
<t>In Dutch,s diaeresis can be used on vowels to prevent the so called vowel collision. However, when hyphenating before the vowel that received a diaeresis, that diaeresis will be eliminated in the hyphenated spelling. Examples of hyphenation definitions for eliminated diaeresis are:</t>
<figure>
<artwork><![CDATA[# Dutch eliminated diaeresis
# "geëerd" or
# "ge-" "eerd" [honoured] [past participle]
geëerd;ge{ë/-e}erd
]]></artwork>
</figure>
<t>As stated before, a hyphen can be a valid character in a normal word. Hence, the hyphen character is not a reserved character in this context. When hyphenation on a hyphen that is already part of a word, a new hyphen MUST NOT be inserted in the hyphenated text. A rare counterexample was given in hyphenation of a <xref target="word">word</xref>. Below, more common examples in which a hyphen is not allowed to be duplicated:</t>
<figure>
<artwork><![CDATA[# Dutch compounds with hyphen as character
# ex- < vriend [former boyfriend]
ex-vriend;ex{-/|}vriend
# (Dow- + Jones) + index [Dow Jones Index]
Dow-Jonesindex;Dow{-/~}Jones=index
# ((dé + jà)- + vu) + gevoel [déjà vu feeling]
déjà-vugevoel;dé~jà{-/~~}vu=ge~voel
# (gilles- + de- + la- + (tou + rette)) + (syn + droom)
# [Tourette syndrome]
#gilles-de-la-tourettesyndroom;
#gilles{-/~~}de{-/~}la{-/~~}tou~rette=syn~droom
# (ad + junct)- + ((al + ge + meen) + (di + rec + teur))
# [vice managing director]
adjunct-algemeendirecteur;ad~junct{-/==}al~ge~meen=di~rec~teur
# English compound with hyphen as character
# (ac + tor)- + (di + rec + tor)
actor-director;ac~tor{-/=}di~rec~tor
]]></artwork>
</figure>
</section>
<section anchor="homograph" title="Hyphenation of homographs">
<t>A word with multiple meanings but with the same spelling is called a homograph. Some homographs can differ in syllabification and pronunciation even though they are spelled with exactly the same characters. Examples in English are desert (leave to, or barren area of land) and dove (pigeon, or past tense to dive). A difference in pronunciation can result in different hyphenation points for each meaning of the homograph, which is more probable in German or Dutch than in, for example, English.</t>
<t>When this is the case, the following homograph cluster MUST be used for the hyphenation definition. Here a LEFT SQUARE BRACKET U+005B or '[' and a RIGHT SQUARE BRACKET U+005D or ']' MUST be used to group alternatives inside a hyphenation definition. These MUST be separated by a SOLIDUS U+002F or '/'. In the following rules in EBNF only two alternatives are allowed. The order of the alternatives is not important. However, the grammar introduces a small difference for the left and right side of the separator. One side, and only one side, of the separator may be empty to accommodate for certain definitions. Therefore, always one side of the separator MUST hold a definition. This is in EBNF:</t>
<figure>
<artwork><![CDATA[Series
::= ( CharacterCluster (Hyphen CharacterCluster)* Hyphen? )
| ( Hyphen (CharacterCluster Hyphen)* CharacterCluster? )
HomographCluster
::= '[' ( Series | (SubstitutionCluster Series? ) ) '/'
SubstitutionCluster? Series? ']'
]]></artwork>
</figure>
<t>The use of a nested substitution cluster will be described <xref target="nested">later on</xref>. Rare but valid examples with alternative hyphenation behaviour for homographs are:</t>
<figure>
<artwork><![CDATA[# English homographs
# rec + ord [vinyl medium]
# re + cord [first-person present of verb to record]
record;re[~c/c~]ord
# wa + les [plural of whale] or
# Wales [toponiem of part of UK]
wales;wa[~/]les
# German homographs
# Mas + ke or Maske
Maske;Mas[~/]ke
# Wach + (stu + be) [guardroom] or
# Wachs + (tu + be) [wax tube]
Wachstube;Wach[=s/s=]tu-be
# (Bahn + hof) + (strasse) [lit: station street] or
# (Bahn + hof) + s + (trasse) [lit: station's route]
Bahnhofstrasse;Bahn=hof[==stra-ss/s==tras-s]e
# Dutch homographs
# bal + le + tje [degerailnitiv of ball] or
# bal + let + je [degenitiv of ballet]
balletje;bal~le[~t/t~]je
# valk + uil [ninox, lit: falcon owl] or
# val + kuil [trapping pit, lit: trap pit]
valkuil;val[k=/=k]uil
]]></artwork>
</figure>
<t>Note that there is not a preferred order of mirrored homograph clusters but a fixed order could prove practical for automated processing such as validation.</t>
<t>Automated hyphenation of homographs poses an interesting challenge. How can the hyphenation recognise which hyphenation pattern to use? This is out of scope for this standard but important to discuss. All other forms of hyphenation can be handled directly by a hyphenation algorithm, but here extra information is need. This could be extracted from the context, but can proof difficult if no context is available or the context is ambiguous. On the other hand, the author of a text could provide the needed information. This could be stored in soft hyphens, for example. The hyphenator could assist the author here by playing an interactive role. Similarly to spell checking the author could be asked which meaning of a homograph is intended by having the author choose between expanded hyphenation patterns.</t>
<t>Something that has not been discussed up to this point, but is illustrated in the previous example with wales and Wales, is case sensitivity of hyphenation patterns. Hyphenation definitions MUST be specified as case sensitive as possible. <!--TODO homograph!!-->In case capitalised, upper case and/or lower case are merged a lower case notation is RECOMMENDED to be used, followed by capitalised and finally upper case. Reasons for this that casting to upper case or capitalised spelling can result in information reduction whereas casting to lower case can not restore the eliminated information. Examples:</t>
<figure>
<artwork><![CDATA[# German irreversible up and down casting
# Maße upcast -> MASSE
# MASSE ambiguous downcast -> Maße or Masse
# LATIN CAPITAL LETTER SHARP S U+1E9E is rarely used
# Dutch irreversible up and down casting
# officiëren upcast -> OFFICIEREN
# OFFICIEREN ambiguous downcast -> officiëren or officieren
# gêne upcast -> GENE
# GENE ambiguous downcast -> gêne or gene
# Dutch does not use diacritical marks in all upper case words
]]></artwork>
</figure>
</section>
<section anchor="nested" title="Nested hyphenation">
<t>Nesting of a substitution cluster inside a homograph cluster MAY be done. This is already defined in the grammar for <xref target="homograph">homograph hyphenation</xref>. Here the priority is on the enclosing homograph cluster. Deeper or other ways of nesting clusters is not allowed. This is very rare, but some examples for German are:</t>
<figure>
<artwork><![CDATA[# German de-1901 nested hyphenation definitions
Bettücher;Be[t=tü~/{tt/tt=t}ü.]cher
Druckerzeugnis;Dru[{ck/k~k}er~/ck=er.]zeug~nis
Fussballehren;Fuss=ba[ll=/{ll/ll=l}]eh~ren
griffest;gri[f~f/{ff/ff=f}]est
Irreligion;I[{rr/rr=r}/r|r]e.li~gi-on
Staubecken;Stau[~b/b~]e{ck/k~k}en
]]></artwork>
</figure>
</section>
</section>
<section anchor="priority" title="Hyphenation priority">
<figure>
<preamble>The following hyphenation priority is defined:</preamble>
<artwork><![CDATA[01 [] hyphenation of homograph,
definition depends on semantics
02 {} dynamic hyphenation,
change of spelling
03 =¦ hyphenation of compound's suffix,
multiple = have higher priority
04 |= hyphenation of compound's prefix,
multiple = have higher priority
05 = hyphenation of compound,
multiple = have higher priority
06 ¦ hyphenation of word's suffix,
priority order is from right to left
07 | hyphenation of word's prefix,
priority order is from left to right
08 ~ hyphenation of word,
multiple ~ have higher priority
09 =. unfavourable hyphenation of compound,
multiple . have lower priority
10 ¦. unfavourable hyphenation of word's suffix,
multiple . have lower priority
11 |. unfavourable hyphenation of word's prefix,
multiple . have lower priority
12 ~. unfavourable hyphenation of word,
multiple . have lower priority
13 . unfavourable hyphenation in general,
multiple . have lower priority
]]></artwork>
</figure>
</section>
<section anchor="reserved" title="Reserved characters">
<t>Reserved characters for this format are:</t>
<figure>
<artwork><![CDATA[/* Hyphenation Definitions 0.8
* https://raw.github.com/OpenTaal/hyphenation-definitions/master/
* grammar/grammar.ebnf
*
* Reserved characters
* tab U+0009 CHARACTER TABULATION '\t'
* line feed U+000A LINE FEED (LF) '\n'
* carriage return U+000D CARRIAGE RETURN (CR) '\r'
* space U+0020 SPACE ' '
* begin comment U+0023 NUMBER SIGN '#'
* unfavourable hyphen U+002E FULL STOP '.'
* cluster separator U+002F SOLIDUS '/'
* delimiter U+003B SEMICOLON ';'
* compound hyphen U+003D EQUALS SIGN '='
* begin homograph cluster U+005B LEFT SQUARE BRACKET '['
* end homograph cluster U+005D RIGHT SQUARE BRACKET ']'
* begin substitution cluster U+007B LEFT CURLY BRACKET '{'
* prefix hyphen U+007C VERTICAL LINE '|'
* end substitution cluster U+007D RIGHT CURLY BRACKET '}'
* morpheme hyphen U+007E TILDE '~'
* suffix hyphen U+00A6 BROKEN BAR '¦'
*/
]]></artwork>
</figure>
<t>Additionally, other characters may be used as placeholders inside of definitions where a hyphenation needs (re)work or reviewing. The following are recommended because these are rarely found in words and are visually quickly identified. The usage of these falls outside the definition of this format and should be filtered out before providing hyphenation patterns that comply with this standard. Examples are:</t>
<figure>
<artwork><![CDATA[# Examples of placeholders for reviewing purposes
#räche;rä·che # U+00B7 MIDDLE DOT '·'
#radio;ra*dio # U+002A ASTERISK '*'
#tafel;ta_fel # U+005F LOW LINE '_'
]]></artwork>
</figure>
<t>Note that the middle dot '·' can be part of a orthography such as Catalan of Franco-Provençal. Use it with care. See also the section on <xref target="compound_interfix">compound interfix</xref> for characters used to make interfix annotations.</t>
</section>
</middle>
<back>
<references>
<reference anchor="ISO14977" target="http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=26153">
<front>
<title>Information technology - Syntactic metalanguage - Extended BNF</title>
<author>
<organization abbrev="ISO/IEC">International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), JTC 1</organization>
<address>
<postal>
<!--street>ISO/IEC Copyright Office</street-->
<street>Case Postale 56</street>
<city>Geneve 20</city> <code>CH-1211</code>
<country>Switzerland</country>
</postal>
<uri>http://iso.org</uri>
</address>
</author>
<date month="December" year="1996" />
</front>
<seriesInfo name="ISO/IEC" value="14977:1996" />
</reference>
<reference anchor="Lia83" target="http://www.tug.org/docs/liang/">
<front>
<title>Word Hy-phen-a-tion by Com-put-er</title>
<author initials="F.M." surname="Liang" fullname="Franklin Mark Liang">
<organization>Stanford University, Department of Computer Science</organization>
<address>
<postal>
<street></street>
<city>Stanford</city> <region>CA</region> <code>94305</code>
<country>United States</country>
</postal>
<uri>http://www.stanford.edu</uri>
</address>
</author>
<date month="August" year="1983" />
</front>
</reference>
<reference anchor="Gel14" target="http://github.com/OpenTaal/hyphenation-definitions/">
<front>
<title>A standard for prioritised and dynamic hyphenation definitions</title>
<author initials="S." surname="van Geloven" fullname="Sander van Geloven">
<organization abbrev="OpenTaal">Stichting OpenTaal</organization>
<address>
<postal>
<street></street>
<city></city>
<country>Netherlands</country>
</postal>
<uri>http://www.opentaal.org</uri>
</address>
</author>
<date month="January" year="2014" />
</front>
</reference>
<reference anchor="W3C11" target="http://www.w3.org/TR/CSS2/">
<front>
<title>Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification</title>
<author>
<organization abbrev="W3C">World Wide Web Consortium</organization>
<address>
<postal>
<street>32 Vassar Street, Building 32-G514</street>
<city>Cambridge</city> <region>MA</region> <code>02139</code>
<country>United States</country>
</postal>
<uri>http://www.w3c.org</uri>
</address>
</author>
<date month="June" year="2011" />
</front>
</reference>
<reference anchor="W3C13a" target="http://www.w3.org/TR/css3-text/">
<front>
<title>CSS Text Module Level 3</title>
<author>
<organization abbrev="W3C">World Wide Web Consortium</organization>
<address>
<postal>
<street>32 Vassar Street, Building 32-G514</street>
<city>Cambridge</city> <region>MA</region> <code>02139</code>
<country>United States</country>
</postal>
<uri>http://www.w3c.org</uri>
</address>
</author>
<date month="October" year="2013" />
</front>
</reference>
<reference anchor="W3C13b" target="http://www.w3.org/TR/html51/">
<front>
<title>HTML 5.1, A vocabulary and associated APIs for HTML and XHTML</title>
<author>
<organization abbrev="W3C">World Wide Web Consortium</organization>
<address>
<postal>
<street>32 Vassar Street, Building 32-G514</street>
<city>Cambridge</city> <region>MA</region> <code>02139</code>
<country>United States</country>
</postal>
<uri>http://www.w3c.org</uri>
</address>
</author>
<date month="October" year="2013" />
</front>
</reference>
<reference anchor="W3C99" target="http://www.w3.org/TR/html401/">
<front>
<title>HTML 4.01 Specification</title>
<author>
<organization abbrev="W3C">World Wide Web Consortium</organization>
<address>
<postal>
<street>32 Vassar Street, Building 32-G514</street>
<city>Cambridge</city> <region>MA</region> <code>02139</code>
<country>United States</country>
</postal>
<uri>http://www.w3c.org</uri>
</address>
</author>
<date month="December" year="1999" />
</front>
</reference>
<reference anchor="UNICODE" target="http://www.unicode.org/versions/Unicode6.3.0/">
<front>
<title>The Unicode Standard, Version 6.3.0</title>
<author>
<organization>The Unicode Consortium</organization>
<address>
<postal>
<street></street>
<city>Mountain View</city> <region>CA</region>
<country>United States</country>
</postal>
<uri>http://www.unicode.org</uri>
</address>
</author>
<date month="September" year="2013" />