Skip to content

Conversion performance #8

Open
Open
@ScottPJones

Description

@ScottPJones

I've written some simple functions that create tables using iconv.jl, and then do the conversions in pure Julia code instead of calling iconv, as well as comparing the performance of

  1. converting from an 8-bit character set to UTF-8 via iconv.jl
  2. " " " " to UTF-16 via iconv.jl
  3. " " " " to UTF-16 via https://github.com/nolta/ICU.jl
  4. " " " " to UTF-8 via my convert code
  5. " " " " to UTF-16 via my convert code

I've made a Gist with benchmark results (using https://github.com/johnmyleswhite/Benchmarks.jl)
along with the code and benchmarking code, at:
https://gist.github.com/ScottPJones/fcd12f675edb3d79b5ce.
The tables created are also very small, at most couple hundred bytes (or less) per character set
(maximum, if the character set is ASCII compatible, is 256 bytes, if it an ANSI character set, max is 192 bytes, and only 64 bytes for CP1252 - which woud probably be the most used conversion).

Should we move towards using this approach at least for the 8-bit character set conversions?
It would also make it easy to add all of the options that Python 3 has, for handling invalid characters
(error, remove, replace with fixed replacement character (default 0xfffd) or string, insert quoted XML escape sequence, insert quoted as \uxxxx or \u{xxxx}.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions