Skip to content

Conversation

@james-willis
Copy link
Contributor

I'm not sure of the unintended consequences this might have but it seems to mitigate the hanging without breaking other test cases.

Fixes #448

@albarrentine
Copy link
Contributor

Not loving that as a solution. utf8proc_iterate returns an error code if the text isn't utf-8 and making a utf8proc_iterate_non_negative effectively means just swallowing the error and returning a length of 1 as though the UTF-8 is valid and moving along, which may be dangerous.

I'll post some suggestions in #448 of how to handle it but if you're using Python as a binding and have really gnarly encodings from Internet text, Mojibake, etc. an immediate fix is to use the ftfy library before passing input to libpostal. That will handle both the cases posted and more and should really be the preprocessing step in Python for most Internet text. Would like to port it to C at some point so it can be used in more places to deal with Internet text garbage, including here although I prefer to have a fast path for users who promise they'll take care of conforming the input.

The first case listed there is an issue we do need to address because it's a potential security issue, which is that if certain sequences are HTML entities are encoded as ASCII, then technically the client "passed valid UTF-8" to libpostal, but we may still have to deal with the invalid UTF-8 or Mojibake internally after replacing HTML entities. Though this is really something the user could do on their own, I did want to be flexible to the most basic mistakes like passing an & in the input, at my own peril because that allows all sorts of other input in. So I think we probably do need to do an additional round of UTF-8 validation at least after that one rare case (the transliterator only returns a new string if it finds matches in its trie, i.e. an HTML entity in this case, so the fast path would not incur any extra overhead).

The other case listed there is a problem of surrogates, which I guess we can fix for the user internally if encountered, similar to how ftfy does it, but also this only comes up because the user is not passing valid UTF-8 (can't even print that as a string in Python), which is part of the contract with the C library. That seems to have to do with Java's use of the "modified UTF-8" representation which represents all 4-byte UTF-8 characters as two UTF-16 surrogate pairs (idk why...) so the issue there I think is just converting Java's strings into standard UTF-8 before passing to libpostal.

@vcschapp
Copy link

vcschapp commented Jul 2, 2025

@albarrentine @james-willis Is there an interim intermediate solution that will fail fast with an error instead of swallowing it, is very surgical, but isn't perfect, i.e. isn't as thorough as the more complete solutions Al is discussing?

IMHO failing fast with some "false positive errors" would be an improvement over today if hanging the library completely is the alternative outcome. We could then try to drive the "false positive error rate" down to zero with further enhancements.

@albarrentine
Copy link
Contributor

implemented and merged in #703

@albarrentine
Copy link
Contributor

there is still a to-do for handling the Java encoding correctly (it won’t hang when this happens but Java can still use its bespoke encoding in the 4-byte UTF-8 range and encode surrogate pairs ).

There are two PRs out there, one of which should be fine to rebase and merge now that the JDK requirements are updated, if anyone wants to give it a go:

This was the original: openvenues/jpostal#38

And this is the one in the Overture Maps fork based on it: OvertureMaps/jpostal#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

expand_address hangs with certain strings, with invalid UTF-8 warnings

3 participants