-
Notifications
You must be signed in to change notification settings - Fork 460
Description
Hi
Firstly, thanks for all the work you have done. In order to avoid fluff, I'll be posting the context serially.
- United States
- I'm using Libpostal to dedupe addresses within the Hadoop ecosystem (Hive on Tez).
- I have a farily large set of over 200 million addresses, a size-able chunk of which are human entered values. Given the nature of my data, I have encountered a few cases which causes the expand_address function to hang and stop my job.
a) The most baffling case.
>>> expand_address(u'5-19�� Nakamachi')
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: Nonethis is an entirely ASCII string, which halts the program. Using parse_address also throws warnings, but continues gracefully.
>>> parse_address(u'5-19�� Nakamachi')
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
[(u'5-19� � Nakamachi', u'house')]b) Address: "No. \uD835\uDFE3\uD835\uDFE3"
This looks like "No. 11". Works fine using pypostal, however, it similarly halts the program when using jpostal. My guess is this has something to do with the C interface's GetStringUTFChars not working well with 4 byte utf-8 characters, since Java converts its internal UTF-16 String to a Modified UTF-8 format.
These cases are rare, but can block processes, which makes them problematic. Is there some way we can have this function exit gracefully in case of utf-8 parsing errors?
Thanks,
Nitin