You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-15Lines changed: 16 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ libpostal is a C library for parsing/normalizing street addresses around the wor
11
11
-**Original post**: [Statistical NLP on OpenStreetMap](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86)
12
12
-**Follow-up for 1.0 release**: [Statistical NLP on OpenStreetMap: Part 2](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718)
Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.
17
17
@@ -225,7 +225,7 @@ Examples of parsing
225
225
226
226
libpostal's international address parser uses machine learning (Conditional Random Fields) and is trained on over 1 billion addresses in every inhabited country on Earth. We use [OpenStreetMap](https://openstreetmap.org) and [OpenAddresses](https://openaddresses.io) as sources of structured addresses, and the OpenCage address format templates at: https://github.com/OpenCageData/address-formatting to construct the training data, supplementing with containing polygons, and generating sub-building components like apartment/floor numbers and PO boxes. We also add abbreviations, drop out components at random, etc. to make the parser as robust as possible to messy real-world input.
227
227
228
-
These example parse results are taken from the interactive address_parser program
228
+
These example parse results are taken from the interactive address_parser program
229
229
that builds with libpostal when you run ```make```. Note that the parser can handle
230
230
commas vs. no commas as well as various casings and permutations of components (if the input
231
231
is e.g. just city or just city/postcode).
@@ -306,14 +306,14 @@ Examples of normalization
306
306
-------------------------
307
307
308
308
The expand_address API converts messy real-world addresses into normalized
309
-
equivalents suitable for search indexing, hashing, etc.
309
+
equivalents suitable for search indexing, hashing, etc.
310
310
311
311
Here's an interactive example using the Python binding:
libpostal needs to download some data files from S3. The basic files are on-disk
462
463
representations of the data structures necessary to perform expansion. For address
463
-
parsing, since model training takes a few days, we publish the fully trained model
464
+
parsing, since model training takes a few days, we publish the fully trained model
464
465
to S3 and will update it automatically as new addresses get added to OSM, OpenAddresses, etc. Same goes for the language classifier model.
465
466
466
467
Data files are automatically downloaded when you run make. To check for and download
@@ -511,7 +512,7 @@ optionally be separated so Rosenstraße and Rosen Straße are equivalent.
511
512
- **International address parsing**: [Conditional Random Field](https://web.archive.org/web/20240104172655/http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/) which parses
512
513
"123 Main Street New York New York" into {"house_number": 123, "road":
513
514
"Main Street", "city": "New York", "state": "New York"}. The parser works
514
-
for a wide variety of countries and languages, not just US/English.
515
+
for a wide variety of countries and languages, not just US/English.
515
516
The model is trained on over 1 billion addresses and address-like strings, using the
516
517
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
517
518
tagged training examples for every inhabited country in the world. Many types of [normalizations](https://github.com/openvenues/libpostal/blob/master/scripts/geodata/addresses/components.py)
@@ -522,13 +523,13 @@ trained (using the [FTRL-Proximal](https://research.google.com/pubs/archive/4115
522
523
addresses. Labels are derived using point-in-polygon tests for both OSM countries
523
524
and official/regional languages for countries and admin 1 boundaries
524
525
respectively. So, for example, Spanish is the default language in Spain but
525
-
in different regions e.g. Catalunya, Galicia, the Basque region, the respective
526
+
in different regions e.g. Catalunya, Galicia, the Basque region, the respective
526
527
regional languages are the default. Dictionary-based disambiguation is employed in
527
528
cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
528
529
The dictionaries are also used to abbreviate canonical phrases like "Calle" => "C/"
529
530
(performed on both the language classifier and the address parser training sets)
"quatre-vingt-douze" => 92, again using data provided in CLDR), supports > 30
533
534
languages. Handles languages with concatenated expressions e.g.
534
535
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
@@ -543,9 +544,9 @@ strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
543
544
544
545
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
545
546
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same source data as used by [ICU](http://site.icu-project.org/),
546
-
though libpostal doesn't require pulling in all of ICU (might conflict
547
+
though libpostal doesn't require pulling in all of ICU (might conflict
547
548
with your system's version). Note: some languages, particularly Hebrew, Arabic
548
-
and Thai may not include vowels and thus will not often match a transliteration
549
+
and Thai may not include vowels and thus will not often match a transliteration
549
550
done by a human. It may be possible to implement statistical transliterators
550
551
for some of these languages.
551
552
@@ -570,7 +571,7 @@ places derived from terabytes of web pages from the [Common Crawl](http://common
570
571
The Common Crawl is published monthly, and so even merging the results of
571
572
two crawls produces significant duplicates.
572
573
573
-
Deduping is a relatively well-studied field, and for text documents
574
+
Deduping is a relatively well-studied field, and for text documents
574
575
like web pages, academic papers, etc. there exist pretty decent approximate
575
576
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
576
577
@@ -603,9 +604,9 @@ So it's not a geocoder?
603
604
-----------------------
604
605
605
606
If the above sounds a lot like geocoding, that's because it is in a way,
606
-
only in the OpenVenues case, we have to geocode without a UI or a user
607
-
to select the correct address in an autocomplete dropdown. Given a database
608
-
of source addresses such as OpenAddresses or OpenStreetMap (or all of the above),
607
+
only in the OpenVenues case, we have to geocode without a UI or a user
608
+
to select the correct address in an autocomplete dropdown. Given a database
609
+
of source addresses such as OpenAddresses or OpenStreetMap (or all of the above),
609
610
libpostal can be used to implement things like address deduping and server-side
610
611
batch geocoding in settings like MapReduce or stream processing.
611
612
@@ -614,7 +615,7 @@ document search engines like Elasticsearch using giant synonyms files, scripting
614
615
custom analyzers, tokenizers, and the like, geocoding can look like this:
615
616
616
617
1. Run the addresses in your database through libpostal's expand_address
617
-
2. Store the normalized string(s) in your favorite search engine, DB,
618
+
2. Store the normalized string(s) in your favorite search engine, DB,
618
619
hashtable, etc.
619
620
3. Run your user queries or fresh imports through libpostal and search
0 commit comments