The count of characters of some blocks and scripts are erroneous

If I run this command :

	uni list b | find " 2 "

We get :

       3400     4DBF  2         CJK Unified Ideographs Extension A
       4E00     9FFF  2         CJK Unified Ideographs
       AC00     D7AF  2         Hangul Syllables
       D800     DB7F  2         High Surrogates
       DB80     DBFF  2         High Private Use Surrogates
       DC00     DFFF  2         Low Surrogates
       E000     F8FF  2         Private Use Area
      17000    187FF  2         Tangut
      18D00    18D7F  2         Tangut Supplement
      20000    2A6DF  2         CJK Unified Ideographs Extension B
      2A700    2B73F  2         CJK Unified Ideographs Extension C
      2B740    2B81F  2         CJK Unified Ideographs Extension D
      2B820    2CEAF  2         CJK Unified Ideographs Extension E
      2CEB0    2EBEF  2         CJK Unified Ideographs Extension F
      2EBF0    2EE5F  2         CJK Unified Ideographs Extension I	
      30000    3134F  2         CJK Unified Ideographs Extension G
      31350    323AF  2         CJK Unified Ideographs Extension H
      323B0    3347F  2         CJK Unified Ideographs Extension J
      F0000    FFFFF  2         Supplementary Private Use Area-A
     100000   10FFFF  2         Supplementary Private Use Area-B

The result for all these blocks is wrong; the correct result should be:

    CJK Ideograph Extension A              6,592  characters
    CJK Ideograph                         20,992  characters
    High Surrogates                          896  characters
    High Private Use Surrogates              128  characters
    Low Surrogates                         1,024  characters
    Private Use Area                       6,400  characters
    Hangul Syllable                       11,172  characters
    Tangut Ideograph                       6,144  characters
    Tangut Ideograph Supplement               31  characters
    CJK Ideograph Extension B             42,720  characters
    CJK Ideograph Extension C              4,160  characters
    CJK Ideograph Extension D                222  characters
    CJK Ideograph Extension E              5,774  characters
    CJK Ideograph Extension F              7,473  characters
    CJK Ideograph Extension I                622  characters
    CJK Ideograph Extension G              4,939  characters
    CJK Ideograph Extension H              4,192  characters
    CJK Ideograph Extension J              4,298  characters
    Supplementary Private Use Area-A      65,536  characters
    Supplementary Private Use Area-B      65,536  characters

I suppose that by taking in account the first and last decimal value of these blocks, it should be easy to indicate the TRUE number of characters for all these blocks !

## Second issue :

If I create this simple **batch** file :

    @echo off

    uni p s:Adlam                    | find /C ";"
    uni p s:Ahom                     | find /C ";"
    [..]
    uni p s:Yi                       | find /C ";"
    uni p s:Zanabazar_Square         | find /C ";"

It does give the right number of characters of each Unicode script except for the three scripts `Han`, `Hangul` and `Tangut`, due to the first issue described in the previous section, above :

    Adlam                    88
    Ahom                     65
    [..]
    Han                      1389    ( WRONG )
    Hangul                   569     ( WRONG )
    Tangut                   888     ( WRONG )
    [..]

However, If I list all scripts with the simple command:

    uni list s

Then all the results displayed, in that list, seem erroneous:

~~~diff
Name                    Assigned
Adlam                         83
Ahom                          52
Anatolian Hieroglyphs        582
Arabic                      1318
................................
................................
................................
Warang Citi                   80
Yezidi                        43
Yi                          1216
Zanabazar Square              63
~~~

Why ?

## Third issue :

If I list all the `General catogory` items with the command :

    uni list c

I noticed that two values are erroneous :

    L      Letter                    26369  Ll | Lm | Lo | Lt | Lu

    Lo     Other_Letter              21759

Well, This happens because of the issue `#1`, above !. Indeed, if I list all the Asiatic blocks which are *erroneous*, I get this list :

                                           True                        False
    CJK Ideograph Extension A              6,592                    (   2  )
    CJK Ideograph                         20,992                    (   2  )
    Hangul Syllable                       11,172                    (   2  )
    Tangut Ideograph                       6,144                    (   2  )
    Tangut Ideograph Supplement               31                    (   2  )
    CJK Ideograph Extension B             42,720                    (   2  )
    CJK Ideograph Extension C              4,160                    (   2  )
    CJK Ideograph Extension D                222                    (   2  )
    CJK Ideograph Extension E              5,774                    (   2  )
    CJK Ideograph Extension F              7,473                    (   2  )
    CJK Ideograph Extension I                622                    (   2  )
    CJK Ideograph Extension G              4,939                    (   2  )
    CJK Ideograph Extension H              4,192                    (   2  )
    CJK Ideograph Extension J              4,298                    (   2  )
                                         119,331                    (  28  )

And, if we take the present value for categories `Lo`  and `L*` and add the true number of Asiatic characters `119,331` minus `28`, we do get the right number of characters of these two general categories, which are `141,062` for the `Lo` category and `145,672`  for the `L*` category !

    =>  21,759  +  119,331  -  28  =  141,062   ( Lo category )
    =>  26,369  +  119,331  -  28  =  145,672   ( L* category )

## Fourth issue :

If I list all the Unicode `propreties` with the command :

    uni list pr

we get this list :

    Name                                Assigned
    ASCII Hex Digit                           22
    Bidi Control                              12
    Dash                                      31
    Deprecated                                15
    Diacritic                               1247
    Extender                                  62
    Hex Digit                                 44
    Hyphen                                    11
    ID Compat Math Continue                   43
    ID Compat Math Start                      13
    IDS Binary Operator                       13
    IDS Trinary Operator                       2
    IDS Unary Operator                         2
    Ideographic                             2810
    Join Control                               2
    Logical Order Exception                   19
    Modifier Combining Mark                   14
    Noncharacter Code Point                    0
    Other Alphabetic                        1510
    Other Default Ignorable Code Point         7
    Other Grapheme Extend                    160
    Other ID Continue                         16
    Other ID Start                             6
    Other Lowercase                          312
    Other Math                              1362
    Other Uppercase                          120
    Pattern Syntax                          2681
    Pattern White Space                       11
    Prepended Concatenation Mark              13
    Quotation Mark                            30
    Radical                                  329
    Regional Indicator                        26
    Sentence Terminal                        170
    Soft Dotted                               50
    Terminal Punctuation                     291
    Unified Ideograph                         34
    Variation Selector                       260
    White Space                               25

Among all these results,  a few seem erroneous:

                                             True                  False
    Ideographic                           110,943               (   2,810  )
    Noncharacter_Code_Point                    66               (       0  )
    Other_Default_Ignorable_Code_Point      3,776               (       7  )
    Pattern_Syntax                          2,760               (   2,681  )
    Unified_Ideograph                     101,996               (      34  )

I suppose that these wrong results, except for the `Noncharacter Code_Point`, are also the consequence of the issue `#1` !


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The count of characters of some blocks and scripts are erroneous #58

Second issue :

Third issue :

Fourth issue :

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

The count of characters of some blocks and scripts are erroneous #58

Description

Second issue :

Third issue :

Fourth issue :

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions