-
Notifications
You must be signed in to change notification settings - Fork 220
Open
Labels
Description
Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.
I've previously tried to have the code do something like the following:
func sanitizeXML(xmlData string) string {
var buffer bytes.Buffer
for _, r := range xmlData {
if isLegalXMLChar(r) {
buffer.WriteRune(r)
} else {
// Replace illegal characters with their XML character reference.
// You can also skip writing illegal characters by commenting the next line.
buffer.WriteString(fmt.Sprintf("&#x%X;", r))
}
}
return buffer.String()
}
func isLegalXMLChar(r rune) bool {
return r == 0x9 || r == 0xA || r == 0xD ||
(r >= 0x20 && r <= 0xD7FF) ||
(r >= 0xE000 && r <= 0xFFFD) ||
(r >= 0x10000 && r <= 0x10FFFF)
}However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.
If anyone has any suggestions for how to accommodate both requirements:
- Stripping illegal characters from feeds to prevent the xml parser from throwing an error
- Allowing the parsing of non-utf8 feeds
It would be much appreciated!
Reactions are currently unavailable