Skip to content

The count of characters of some blocks and scripts are erroneous #58

@guy038

Description

@guy038

If I run this command :

uni list b | find " 2 "

We get :

   3400     4DBF  2         CJK Unified Ideographs Extension A
   4E00     9FFF  2         CJK Unified Ideographs
   AC00     D7AF  2         Hangul Syllables
   D800     DB7F  2         High Surrogates
   DB80     DBFF  2         High Private Use Surrogates
   DC00     DFFF  2         Low Surrogates
   E000     F8FF  2         Private Use Area
  17000    187FF  2         Tangut
  18D00    18D7F  2         Tangut Supplement
  20000    2A6DF  2         CJK Unified Ideographs Extension B
  2A700    2B73F  2         CJK Unified Ideographs Extension C
  2B740    2B81F  2         CJK Unified Ideographs Extension D
  2B820    2CEAF  2         CJK Unified Ideographs Extension E
  2CEB0    2EBEF  2         CJK Unified Ideographs Extension F
  2EBF0    2EE5F  2         CJK Unified Ideographs Extension I	
  30000    3134F  2         CJK Unified Ideographs Extension G
  31350    323AF  2         CJK Unified Ideographs Extension H
  323B0    3347F  2         CJK Unified Ideographs Extension J
  F0000    FFFFF  2         Supplementary Private Use Area-A
 100000   10FFFF  2         Supplementary Private Use Area-B

The result for all these blocks is wrong; the correct result should be:

CJK Ideograph Extension A              6,592  characters
CJK Ideograph                         20,992  characters
High Surrogates                          896  characters
High Private Use Surrogates              128  characters
Low Surrogates                         1,024  characters
Private Use Area                       6,400  characters
Hangul Syllable                       11,172  characters
Tangut Ideograph                       6,144  characters
Tangut Ideograph Supplement               31  characters
CJK Ideograph Extension B             42,720  characters
CJK Ideograph Extension C              4,160  characters
CJK Ideograph Extension D                222  characters
CJK Ideograph Extension E              5,774  characters
CJK Ideograph Extension F              7,473  characters
CJK Ideograph Extension I                622  characters
CJK Ideograph Extension G              4,939  characters
CJK Ideograph Extension H              4,192  characters
CJK Ideograph Extension J              4,298  characters
Supplementary Private Use Area-A      65,536  characters
Supplementary Private Use Area-B      65,536  characters

I suppose that by taking in account the first and last decimal value of these blocks, it should be easy to indicate the TRUE number of characters for all these blocks !

Second issue :

If I create this simple batch file :

@echo off

uni p s:Adlam                    | find /C ";"
uni p s:Ahom                     | find /C ";"
[..]
uni p s:Yi                       | find /C ";"
uni p s:Zanabazar_Square         | find /C ";"

It does give the right number of characters of each Unicode script except for the three scripts Han, Hangul and Tangut, due to the first issue described in the previous section, above :

Adlam                    88
Ahom                     65
[..]
Han                      1389    ( WRONG )
Hangul                   569     ( WRONG )
Tangut                   888     ( WRONG )
[..]

However, If I list all scripts with the simple command:

uni list s

Then all the results displayed, in that list, seem erroneous:

Name                    Assigned
Adlam                         83
Ahom                          52
Anatolian Hieroglyphs        582
Arabic                      1318
................................
................................
................................
Warang Citi                   80
Yezidi                        43
Yi                          1216
Zanabazar Square              63

Why ?

Third issue :

If I list all the General catogory items with the command :

uni list c

I noticed that two values are erroneous :

L      Letter                    26369  Ll | Lm | Lo | Lt | Lu

Lo     Other_Letter              21759

Well, This happens because of the issue #1, above !. Indeed, if I list all the Asiatic blocks which are erroneous, I get this list :

                                       True                        False
CJK Ideograph Extension A              6,592                    (   2  )
CJK Ideograph                         20,992                    (   2  )
Hangul Syllable                       11,172                    (   2  )
Tangut Ideograph                       6,144                    (   2  )
Tangut Ideograph Supplement               31                    (   2  )
CJK Ideograph Extension B             42,720                    (   2  )
CJK Ideograph Extension C              4,160                    (   2  )
CJK Ideograph Extension D                222                    (   2  )
CJK Ideograph Extension E              5,774                    (   2  )
CJK Ideograph Extension F              7,473                    (   2  )
CJK Ideograph Extension I                622                    (   2  )
CJK Ideograph Extension G              4,939                    (   2  )
CJK Ideograph Extension H              4,192                    (   2  )
CJK Ideograph Extension J              4,298                    (   2  )
                                     119,331                    (  28  )

And, if we take the present value for categories Lo and L* and add the true number of Asiatic characters 119,331 minus 28, we do get the right number of characters of these two general categories, which are 141,062 for the Lo category and 145,672 for the L* category !

=>  21,759  +  119,331  -  28  =  141,062   ( Lo category )
=>  26,369  +  119,331  -  28  =  145,672   ( L* category )

Fourth issue :

If I list all the Unicode propreties with the command :

uni list pr

we get this list :

Name                                Assigned
ASCII Hex Digit                           22
Bidi Control                              12
Dash                                      31
Deprecated                                15
Diacritic                               1247
Extender                                  62
Hex Digit                                 44
Hyphen                                    11
ID Compat Math Continue                   43
ID Compat Math Start                      13
IDS Binary Operator                       13
IDS Trinary Operator                       2
IDS Unary Operator                         2
Ideographic                             2810
Join Control                               2
Logical Order Exception                   19
Modifier Combining Mark                   14
Noncharacter Code Point                    0
Other Alphabetic                        1510
Other Default Ignorable Code Point         7
Other Grapheme Extend                    160
Other ID Continue                         16
Other ID Start                             6
Other Lowercase                          312
Other Math                              1362
Other Uppercase                          120
Pattern Syntax                          2681
Pattern White Space                       11
Prepended Concatenation Mark              13
Quotation Mark                            30
Radical                                  329
Regional Indicator                        26
Sentence Terminal                        170
Soft Dotted                               50
Terminal Punctuation                     291
Unified Ideograph                         34
Variation Selector                       260
White Space                               25

Among all these results, a few seem erroneous:

                                         True                  False
Ideographic                           110,943               (   2,810  )
Noncharacter_Code_Point                    66               (       0  )
Other_Default_Ignorable_Code_Point      3,776               (       7  )
Pattern_Syntax                          2,760               (   2,681  )
Unified_Ideograph                     101,996               (      34  )

I suppose that these wrong results, except for the Noncharacter Code_Point, are also the consequence of the issue #1 !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions