Skip to content

Make illegal character sanitization more robust #206

@mmcdole

Description

@mmcdole

Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.

I've previously tried to have the code do something like the following:

func sanitizeXML(xmlData string) string {
	var buffer bytes.Buffer

	for _, r := range xmlData {
		if isLegalXMLChar(r) {
			buffer.WriteRune(r)
		} else {
			// Replace illegal characters with their XML character reference.
			// You can also skip writing illegal characters by commenting the next line.
			buffer.WriteString(fmt.Sprintf("&#x%X;", r))
		}
	}

	return buffer.String()
}

func isLegalXMLChar(r rune) bool {
	return r == 0x9 || r == 0xA || r == 0xD ||
		(r >= 0x20 && r <= 0xD7FF) ||
		(r >= 0xE000 && r <= 0xFFFD) ||
		(r >= 0x10000 && r <= 0x10FFFF)
}

However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.

If anyone has any suggestions for how to accommodate both requirements:

  • Stripping illegal characters from feeds to prevent the xml parser from throwing an error
  • Allowing the parsing of non-utf8 feeds

It would be much appreciated!

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions