Skip to content

Commit fcf79a2

Browse files
committed
Remove redundant characters from Lst()'s
1 parent 58b6e30 commit fcf79a2

1 file changed

Lines changed: 6 additions & 10 deletions

File tree

tools/tokenisers/tokeniser-disamb-gt-desc.pmscript

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,13 @@ Define incondform Punct|
4444
!! * Medium Mathematical Space U+205F
4545
!! * Word joiner U+2060
4646
Define blank Whitespace | incondform
47-
| Lst({           ​‌‍  ⁠})
47+
| Lst({         ​‌‍  ⁠})
4848
;
4949

5050
Define nonprintable [ 0:? | inputmark | flags ];
5151
Define any [ ? | nonprintable ];
5252

53-
Define incondword morphology & [ any* incondform:[?*] nonprintable* ] ;
53+
Define incondword morphology & [ any* incondform:[?*] nonprintable* ] ;
5454
! Ends in punctuation – no context condition
5555

5656
Define morphoword morphology LC([blank | #]) RC([blank | # ]);
@@ -64,22 +64,18 @@ Define urlword url LC([blank | #]) RC([blank | # ]);
6464
!! Unknowns are made of:
6565
Define alphabet "a-z" !! * lower-case ASCII
6666
|"A-Z" !! * upper-case ASCII
67-
|"Ѐ-ӿ" !! * some cyrillics
68-
|Lst({àáâãāăȧäảåǎȁȃąạḁæǽǣèéêẽēĕėëẻěȅȇẹȩęḙḛìíîĩīīĭi̇ïỉǐịįȉȋḭɨòóôõōŏȯöỏőǒȍȏơǫọɵøờớỡởợǭộǿœùúûũūŭüủůűǔȕȗưụṳųṷṵừứữửựʉỳýŷỹȳẏÿỷƴỵɏÀÁÂÃĀĂȦÄẢÅǍȀȂĄẠḀÆǼǢÈÉÊẼĒĔĖËẺĚȄȆẸȨĘḘḚÌÍÎĨĪĪĬİÏỈǏỊĮȈȊḬƗÒÓÔÕŌŎȮÖỎŐǑȌȎƠǪỌƟØỜỚỠỞỢǬỘǾŒÙÚÛŨŪŬÜỦŮŰǓȔȖƯỤṲŲṶṴỪỨỮỬỰɄỲÝŶỸȲẎŸỶƳỴɎšžčđðíŋňŧñńŠŽČĐÐÍŊŇŦÑ})
69-
!! * select extended latin symbols
70-
! Then followed by all Cyrillic letters:
71-
| Lst({{ЁАА́БВГДЕЕ́ЁЖЗИИ́ЙЙКЛМНОО́ПРСТУУ́ФХЦЧШЩЪЫЬЭЭ́ЮЮ́ЯЯ́аа̀а́бвгдеѐе́ёжзиѝи́ййклмноо̀о́прстуу̀у́фхцчшщъыы̀ы́ьээ̀э́юю̀ю́яя̀я́ёё̀})
72-
!! * extended cyrillics
67+
|"Ѐ-ӿ" !! * Cyrillic block (U+0400 - U+04FF)
68+
|Lst({àâãāăȧäảåǎȁȃąạḁæǽǣèéêẽēĕėëẻěȅȇẹȩęḙḛìíîĩīīĭ̇ïỉǐịįȉȋḭɨòóôõōŏȯöỏőǒȍȏơǫọɵøờớỡởợǭộǿœùúûũūŭüủůűǔȕȗưụṳųṷṵừứữửựʉỳýŷỹȳẏÿỷƴỵɏÀÂÃĀĂȦÄẢÅǍȀȂĄẠḀÆǼǢÈÉÊẼĒĔĖËẺĚȄȆẸȨĘḘḚÌÍÎĨĪĪĬİÏỈǏỊĮȈȊḬƗÒÓÔÕŌŎȮÖỎŐǑȌȎƠǪỌƟØỜỚỠỞỢǬỘǾŒÙÚÛŨŪŬÜỦŮŰǓȔȖƯỤṲŲṶṴỪỨỮỬỰɄỲÝŶỸȲẎŸỶƳỴɎðíňñńÐÍŇÑ})
7369
| "0-9" !! ASCII digits
74-
| Lst({_§°†}) !! * select symbols
70+
| Lst({_†}) !! * select symbols
7571
!! * Combining diacritics as individual symbols,
7672
! to be able to analyse unknown words with
7773
! decomposed diacritics. All combining diacritics
7874
! U+0300—U+036F, U+20D0—U+20F0. Grouped according
7975
! to position relative to base char, then more or
8076
! less according to Unicode number.
8177
! NB: The following list will look odd in many editors!
82-
| Lst({̶̴̵̶̷̸⃥⃦⃪⃫⃒⃓⃘⃙⃚̡̢̧̨᷐᷎̛᷹̖̗̘̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎͙͚͓͔͕͖⃨⃬⃭⃮⃯᷂᷏᷹᷽᷿᷊᷷᷸̀́̂̌̃̄̅̆̇̈⃰̉̊̋̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌͐͑͒͗͛⃐⃑⃔⃕⃖⃗⃛⃜⃡⃧⃩᷀᷁᷃᷄᷅᷆᷇᷈᷉᷋᷌᷵᷻᷾ͣͤͥͦͧͨͩͪͫͬͭͮͯ᷑᷒ᷓᷔᷕᷖᷗᷘᷙᷚᷛᷜᷝᷞᷟᷠᷡᷢᷣᷤᷥᷦᷧᷨᷩᷪᷫᷬᷭᷮᷯᷰᷱᷲᷳᷴ̕̚͘᷶͜͟͢᷼͝͞͠͡᷍ͅ⃝⃞⃟⃠⃢⃣⃤})
78+
| Lst({̶̴̵̶̷̸⃥⃦⃪⃫⃒⃓⃘⃙⃚̡̢̧̨᷐᷎̛᷹̖̗̘̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎͙͚͓͔͕͖⃨⃬⃭⃮⃯᷂᷏᷹᷽᷿᷊᷷᷸̂̌̃̄̅̇⃰̉̊̋̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌͐͑͒͗͛⃐⃑⃔⃕⃖⃗⃛⃜⃡⃧⃩᷀᷁᷃᷄᷅᷆᷇᷈᷉᷋᷌᷵᷻᷾ͣͤͥͦͧͨͩͪͫͬͭͮͯ᷑᷒ᷓᷔᷕᷖᷗᷘᷙᷚᷛᷜᷝᷞᷟᷠᷡᷢᷣᷤᷥᷦᷧᷨᷩᷪᷫᷬᷭᷮᷯᷰᷱᷲᷳᷴ̕̚͘᷶͜͟͢᷼͝͞͠͡᷍ͅ⃝⃞⃟⃠⃢⃣⃤})
8379
!! * various symbols from Private area (probably Microsoft),
8480
!! so far:
8581
!! * U+F0B7 for "x in box"

0 commit comments

Comments
 (0)