-
-
Notifications
You must be signed in to change notification settings - Fork 24
Description
If I run this command :
uni list b | find " 2 "
We get :
3400 4DBF 2 CJK Unified Ideographs Extension A
4E00 9FFF 2 CJK Unified Ideographs
AC00 D7AF 2 Hangul Syllables
D800 DB7F 2 High Surrogates
DB80 DBFF 2 High Private Use Surrogates
DC00 DFFF 2 Low Surrogates
E000 F8FF 2 Private Use Area
17000 187FF 2 Tangut
18D00 18D7F 2 Tangut Supplement
20000 2A6DF 2 CJK Unified Ideographs Extension B
2A700 2B73F 2 CJK Unified Ideographs Extension C
2B740 2B81F 2 CJK Unified Ideographs Extension D
2B820 2CEAF 2 CJK Unified Ideographs Extension E
2CEB0 2EBEF 2 CJK Unified Ideographs Extension F
2EBF0 2EE5F 2 CJK Unified Ideographs Extension I
30000 3134F 2 CJK Unified Ideographs Extension G
31350 323AF 2 CJK Unified Ideographs Extension H
323B0 3347F 2 CJK Unified Ideographs Extension J
F0000 FFFFF 2 Supplementary Private Use Area-A
100000 10FFFF 2 Supplementary Private Use Area-B
The result for all these blocks is wrong; the correct result should be:
CJK Ideograph Extension A 6,592 characters
CJK Ideograph 20,992 characters
High Surrogates 896 characters
High Private Use Surrogates 128 characters
Low Surrogates 1,024 characters
Private Use Area 6,400 characters
Hangul Syllable 11,172 characters
Tangut Ideograph 6,144 characters
Tangut Ideograph Supplement 31 characters
CJK Ideograph Extension B 42,720 characters
CJK Ideograph Extension C 4,160 characters
CJK Ideograph Extension D 222 characters
CJK Ideograph Extension E 5,774 characters
CJK Ideograph Extension F 7,473 characters
CJK Ideograph Extension I 622 characters
CJK Ideograph Extension G 4,939 characters
CJK Ideograph Extension H 4,192 characters
CJK Ideograph Extension J 4,298 characters
Supplementary Private Use Area-A 65,536 characters
Supplementary Private Use Area-B 65,536 characters
I suppose that by taking in account the first and last decimal value of these blocks, it should be easy to indicate the TRUE number of characters for all these blocks !
Second issue :
If I create this simple batch file :
@echo off
uni p s:Adlam | find /C ";"
uni p s:Ahom | find /C ";"
[..]
uni p s:Yi | find /C ";"
uni p s:Zanabazar_Square | find /C ";"
It does give the right number of characters of each Unicode script except for the three scripts Han, Hangul and Tangut, due to the first issue described in the previous section, above :
Adlam 88
Ahom 65
[..]
Han 1389 ( WRONG )
Hangul 569 ( WRONG )
Tangut 888 ( WRONG )
[..]
However, If I list all scripts with the simple command:
uni list s
Then all the results displayed, in that list, seem erroneous:
Name Assigned
Adlam 83
Ahom 52
Anatolian Hieroglyphs 582
Arabic 1318
................................
................................
................................
Warang Citi 80
Yezidi 43
Yi 1216
Zanabazar Square 63Why ?
Third issue :
If I list all the General catogory items with the command :
uni list c
I noticed that two values are erroneous :
L Letter 26369 Ll | Lm | Lo | Lt | Lu
Lo Other_Letter 21759
Well, This happens because of the issue #1, above !. Indeed, if I list all the Asiatic blocks which are erroneous, I get this list :
True False
CJK Ideograph Extension A 6,592 ( 2 )
CJK Ideograph 20,992 ( 2 )
Hangul Syllable 11,172 ( 2 )
Tangut Ideograph 6,144 ( 2 )
Tangut Ideograph Supplement 31 ( 2 )
CJK Ideograph Extension B 42,720 ( 2 )
CJK Ideograph Extension C 4,160 ( 2 )
CJK Ideograph Extension D 222 ( 2 )
CJK Ideograph Extension E 5,774 ( 2 )
CJK Ideograph Extension F 7,473 ( 2 )
CJK Ideograph Extension I 622 ( 2 )
CJK Ideograph Extension G 4,939 ( 2 )
CJK Ideograph Extension H 4,192 ( 2 )
CJK Ideograph Extension J 4,298 ( 2 )
119,331 ( 28 )
And, if we take the present value for categories Lo and L* and add the true number of Asiatic characters 119,331 minus 28, we do get the right number of characters of these two general categories, which are 141,062 for the Lo category and 145,672 for the L* category !
=> 21,759 + 119,331 - 28 = 141,062 ( Lo category )
=> 26,369 + 119,331 - 28 = 145,672 ( L* category )
Fourth issue :
If I list all the Unicode propreties with the command :
uni list pr
we get this list :
Name Assigned
ASCII Hex Digit 22
Bidi Control 12
Dash 31
Deprecated 15
Diacritic 1247
Extender 62
Hex Digit 44
Hyphen 11
ID Compat Math Continue 43
ID Compat Math Start 13
IDS Binary Operator 13
IDS Trinary Operator 2
IDS Unary Operator 2
Ideographic 2810
Join Control 2
Logical Order Exception 19
Modifier Combining Mark 14
Noncharacter Code Point 0
Other Alphabetic 1510
Other Default Ignorable Code Point 7
Other Grapheme Extend 160
Other ID Continue 16
Other ID Start 6
Other Lowercase 312
Other Math 1362
Other Uppercase 120
Pattern Syntax 2681
Pattern White Space 11
Prepended Concatenation Mark 13
Quotation Mark 30
Radical 329
Regional Indicator 26
Sentence Terminal 170
Soft Dotted 50
Terminal Punctuation 291
Unified Ideograph 34
Variation Selector 260
White Space 25
Among all these results, a few seem erroneous:
True False
Ideographic 110,943 ( 2,810 )
Noncharacter_Code_Point 66 ( 0 )
Other_Default_Ignorable_Code_Point 3,776 ( 7 )
Pattern_Syntax 2,760 ( 2,681 )
Unified_Ideograph 101,996 ( 34 )
I suppose that these wrong results, except for the Noncharacter Code_Point, are also the consequence of the issue #1 !