Skip to content

XmlIO.Read does not handle XML encoding per spec #20818

Open
@damccorm

Description

@damccorm

Not sure what the implementation problem is but based on the API doc, there's a real flaw in XmlIO.Read:

 
By default, UTF-8 charset is used. To specify a different charset, use [XmlIO.Read.withCharset(java.nio.charset.Charset)|https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/xml/XmlIO.Read.html#withCharset-java.nio.charset.Charset-].

Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.

 

Properly handled, there is never any need to specify the character encoding when reading an XML document. XML documents fully identify their character encoding. The developer at this level doesn't need to know and shouldn't think about the character encoding. Perhaps in the source code someone is a using a Reader where they should be using an InputStream instead? That might lead this problem.

Also, the text contradicts itself. UTF-8 is a multibyte character set. I hope that doesn't lead to data loss or duplication by default.

 

 

 

 

 

 

Imported from Jira BEAM-11875. Original Jira may contain additional context.
Reported by: elharo.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions