Make illegal character sanitization more robust

Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.  

I've previously tried to have the code do something like the following:

```golang
func sanitizeXML(xmlData string) string {
	var buffer bytes.Buffer

	for _, r := range xmlData {
		if isLegalXMLChar(r) {
			buffer.WriteRune(r)
		} else {
			// Replace illegal characters with their XML character reference.
			// You can also skip writing illegal characters by commenting the next line.
			buffer.WriteString(fmt.Sprintf("&#x%X;", r))
		}
	}

	return buffer.String()
}

func isLegalXMLChar(r rune) bool {
	return r == 0x9 || r == 0xA || r == 0xD ||
		(r >= 0x20 && r <= 0xD7FF) ||
		(r >= 0xE000 && r <= 0xFFFD) ||
		(r >= 0x10000 && r <= 0x10FFFF)
}
```

However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.  

If anyone has any suggestions for how to accommodate both requirements:

- Stripping illegal characters from feeds to prevent the xml parser from throwing an error
- Allowing the parsing of non-utf8 feeds

It would be much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make illegal character sanitization more robust #206

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Make illegal character sanitization more robust #206

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions