Skip to content

Commit 2b0dc98

Browse files
authored
Merge pull request #699 from le0pard/patch-1
Update README.md with new server
2 parents 0d05426 + 47d8a30 commit 2b0dc98

File tree

1 file changed

+16
-15
lines changed

1 file changed

+16
-15
lines changed

README.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ libpostal is a C library for parsing/normalizing street addresses around the wor
1111
- **Original post**: [Statistical NLP on OpenStreetMap](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86)
1212
- **Follow-up for 1.0 release**: [Statistical NLP on OpenStreetMap: Part 2](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718)
1313

14-
<span>&#x1f1e7;&#x1f1f7;</span> <span>&#x1f1eb;&#x1f1ee;</span> <span>&#x1f1f3;&#x1f1ec;</span> :jp: <span>&#x1f1fd;&#x1f1f0; </span> <span>&#x1f1e7;&#x1f1e9; </span> <span>&#x1f1f5;&#x1f1f1; </span> <span>&#x1f1fb;&#x1f1f3; </span> <span>&#x1f1e7;&#x1f1ea; </span> <span>&#x1f1f2;&#x1f1e6; </span> <span>&#x1f1fa;&#x1f1e6; </span> <span>&#x1f1ef;&#x1f1f2; </span> :ru: <span>&#x1f1ee;&#x1f1f3; </span> <span>&#x1f1f1;&#x1f1fb; </span> <span>&#x1f1e7;&#x1f1f4; </span> :de: <span>&#x1f1f8;&#x1f1f3; </span> <span>&#x1f1e6;&#x1f1f2; </span> :kr: <span>&#x1f1f3;&#x1f1f4; </span> <span>&#x1f1f2;&#x1f1fd; </span> <span>&#x1f1e8;&#x1f1ff; </span> <span>&#x1f1f9;&#x1f1f7; </span> :es: <span>&#x1f1f8;&#x1f1f8; </span> <span>&#x1f1ea;&#x1f1ea; </span> <span>&#x1f1e7;&#x1f1ed; </span> <span>&#x1f1f3;&#x1f1f1; </span> :cn: <span>&#x1f1f5;&#x1f1f9; </span> <span>&#x1f1f5;&#x1f1f7; </span> :gb: <span>&#x1f1f5;&#x1f1f8; </span>
14+
<span>&#x1f1e7;&#x1f1f7;</span> <span>&#x1f1eb;&#x1f1ee;</span> <span>&#x1f1f3;&#x1f1ec;</span> :jp: <span>&#x1f1fd;&#x1f1f0; </span> <span>&#x1f1e7;&#x1f1e9; </span> <span>&#x1f1f5;&#x1f1f1; </span> <span>&#x1f1fb;&#x1f1f3; </span> <span>&#x1f1e7;&#x1f1ea; </span> <span>&#x1f1f2;&#x1f1e6; </span> <span>&#x1f1fa;&#x1f1e6; </span> <span>&#x1f1ef;&#x1f1f2; </span> :ru: <span>&#x1f1ee;&#x1f1f3; </span> <span>&#x1f1f1;&#x1f1fb; </span> <span>&#x1f1e7;&#x1f1f4; </span> :de: <span>&#x1f1f8;&#x1f1f3; </span> <span>&#x1f1e6;&#x1f1f2; </span> :kr: <span>&#x1f1f3;&#x1f1f4; </span> <span>&#x1f1f2;&#x1f1fd; </span> <span>&#x1f1e8;&#x1f1ff; </span> <span>&#x1f1f9;&#x1f1f7; </span> :es: <span>&#x1f1f8;&#x1f1f8; </span> <span>&#x1f1ea;&#x1f1ea; </span> <span>&#x1f1e7;&#x1f1ed; </span> <span>&#x1f1f3;&#x1f1f1; </span> :cn: <span>&#x1f1f5;&#x1f1f9; </span> <span>&#x1f1f5;&#x1f1f7; </span> :gb: <span>&#x1f1f5;&#x1f1f8; </span>
1515

1616
Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.
1717

@@ -225,7 +225,7 @@ Examples of parsing
225225

226226
libpostal's international address parser uses machine learning (Conditional Random Fields) and is trained on over 1 billion addresses in every inhabited country on Earth. We use [OpenStreetMap](https://openstreetmap.org) and [OpenAddresses](https://openaddresses.io) as sources of structured addresses, and the OpenCage address format templates at: https://github.com/OpenCageData/address-formatting to construct the training data, supplementing with containing polygons, and generating sub-building components like apartment/floor numbers and PO boxes. We also add abbreviations, drop out components at random, etc. to make the parser as robust as possible to messy real-world input.
227227

228-
These example parse results are taken from the interactive address_parser program
228+
These example parse results are taken from the interactive address_parser program
229229
that builds with libpostal when you run ```make```. Note that the parser can handle
230230
commas vs. no commas as well as various casings and permutations of components (if the input
231231
is e.g. just city or just city/postcode).
@@ -306,14 +306,14 @@ Examples of normalization
306306
-------------------------
307307
308308
The expand_address API converts messy real-world addresses into normalized
309-
equivalents suitable for search indexing, hashing, etc.
309+
equivalents suitable for search indexing, hashing, etc.
310310
311311
Here's an interactive example using the Python binding:
312312
313313
![expand](https://cloud.githubusercontent.com/assets/238455/14115012/52990d14-f5a7-11e5-9797-159dacdf8c5f.gif)
314314
315315
libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given
316-
address so it can apply the appropriate normalizations. The only input needed is the raw address string.
316+
address so it can apply the appropriate normalizations. The only input needed is the raw address string.
317317
Here's a short list of some less straightforward normalizations in various languages.
318318
319319
| Input | Output (may be multiple in libpostal) |
@@ -437,6 +437,7 @@ Libpostal is designed to be used by higher-level languages. If you don't see yo
437437
438438
**Unofficial servers**
439439
440+
- Libpostal REST GO Server (need ~4Gb memory) with basic security: [postal_server](https://github.com/le0pard/postal_server)
440441
- Libpostal REST Go Docker: [libpostal-rest-docker](https://github.com/johnlonganecker/libpostal-rest-docker)
441442
- Libpostal REST FastAPI Docker: [libpostal-fastapi](https://github.com/alpha-affinity/libpostal-fastapi)
442443
- Libpostal ZeroMQ Docker: [libpostal-zeromq](https://github.com/pasupulaphani/libpostal-docker)
@@ -460,7 +461,7 @@ Data files
460461
461462
libpostal needs to download some data files from S3. The basic files are on-disk
462463
representations of the data structures necessary to perform expansion. For address
463-
parsing, since model training takes a few days, we publish the fully trained model
464+
parsing, since model training takes a few days, we publish the fully trained model
464465
to S3 and will update it automatically as new addresses get added to OSM, OpenAddresses, etc. Same goes for the language classifier model.
465466
466467
Data files are automatically downloaded when you run make. To check for and download
@@ -511,7 +512,7 @@ optionally be separated so Rosenstraße and Rosen Straße are equivalent.
511512
- **International address parsing**: [Conditional Random Field](https://web.archive.org/web/20240104172655/http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/) which parses
512513
"123 Main Street New York New York" into {"house_number": 123, "road":
513514
"Main Street", "city": "New York", "state": "New York"}. The parser works
514-
for a wide variety of countries and languages, not just US/English.
515+
for a wide variety of countries and languages, not just US/English.
515516
The model is trained on over 1 billion addresses and address-like strings, using the
516517
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
517518
tagged training examples for every inhabited country in the world. Many types of [normalizations](https://github.com/openvenues/libpostal/blob/master/scripts/geodata/addresses/components.py)
@@ -522,13 +523,13 @@ trained (using the [FTRL-Proximal](https://research.google.com/pubs/archive/4115
522523
addresses. Labels are derived using point-in-polygon tests for both OSM countries
523524
and official/regional languages for countries and admin 1 boundaries
524525
respectively. So, for example, Spanish is the default language in Spain but
525-
in different regions e.g. Catalunya, Galicia, the Basque region, the respective
526+
in different regions e.g. Catalunya, Galicia, the Basque region, the respective
526527
regional languages are the default. Dictionary-based disambiguation is employed in
527528
cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
528529
The dictionaries are also used to abbreviate canonical phrases like "Calle" => "C/"
529530
(performed on both the language classifier and the address parser training sets)
530531
531-
- **Numeric expression parsing** ("twenty first" => 21st,
532+
- **Numeric expression parsing** ("twenty first" => 21st,
532533
"quatre-vingt-douze" => 92, again using data provided in CLDR), supports > 30
533534
languages. Handles languages with concatenated expressions e.g.
534535
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
@@ -543,9 +544,9 @@ strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
543544
544545
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
545546
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same source data as used by [ICU](http://site.icu-project.org/),
546-
though libpostal doesn't require pulling in all of ICU (might conflict
547+
though libpostal doesn't require pulling in all of ICU (might conflict
547548
with your system's version). Note: some languages, particularly Hebrew, Arabic
548-
and Thai may not include vowels and thus will not often match a transliteration
549+
and Thai may not include vowels and thus will not often match a transliteration
549550
done by a human. It may be possible to implement statistical transliterators
550551
for some of these languages.
551552
@@ -570,7 +571,7 @@ places derived from terabytes of web pages from the [Common Crawl](http://common
570571
The Common Crawl is published monthly, and so even merging the results of
571572
two crawls produces significant duplicates.
572573
573-
Deduping is a relatively well-studied field, and for text documents
574+
Deduping is a relatively well-studied field, and for text documents
574575
like web pages, academic papers, etc. there exist pretty decent approximate
575576
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
576577
@@ -603,9 +604,9 @@ So it's not a geocoder?
603604
-----------------------
604605
605606
If the above sounds a lot like geocoding, that's because it is in a way,
606-
only in the OpenVenues case, we have to geocode without a UI or a user
607-
to select the correct address in an autocomplete dropdown. Given a database
608-
of source addresses such as OpenAddresses or OpenStreetMap (or all of the above),
607+
only in the OpenVenues case, we have to geocode without a UI or a user
608+
to select the correct address in an autocomplete dropdown. Given a database
609+
of source addresses such as OpenAddresses or OpenStreetMap (or all of the above),
609610
libpostal can be used to implement things like address deduping and server-side
610611
batch geocoding in settings like MapReduce or stream processing.
611612
@@ -614,7 +615,7 @@ document search engines like Elasticsearch using giant synonyms files, scripting
614615
custom analyzers, tokenizers, and the like, geocoding can look like this:
615616
616617
1. Run the addresses in your database through libpostal's expand_address
617-
2. Store the normalized string(s) in your favorite search engine, DB,
618+
2. Store the normalized string(s) in your favorite search engine, DB,
618619
hashtable, etc.
619620
3. Run your user queries or fresh imports through libpostal and search
620621
the existing database using those strings

0 commit comments

Comments
 (0)