Skip to content

Latest commit

 

History

History
582 lines (410 loc) · 25.5 KB

File metadata and controls

582 lines (410 loc) · 25.5 KB

geocode

Geocodes a location against an updatable local copy of the Geonames cities & the Maxmind GeoLite2 databases — with caching and multi-threading, this offline path geocodes up to 360,000 records/sec! Can also geocode online (forward & reverse) via the OpenCage geocoder.

Table of Contents | Source: src/cmd/geocode.rs | 📇🧠🚀🌐🔣👆🌎

Description | Examples | Usage | Arguments | Geocode Options | Suggest Only Options | Reverse Only Option | Opencage Only Options | Dynamic Formatting Options | Cache-Prune Only Option | Index-Update Only Options | Common Options

Description

Geocodes a location in CSV data against an updatable local copy of the Geonames cities index and a local copy of the MaxMind GeoLite2 City database.

The Geonames cities index can be retrieved and updated using the geocode index-* subcommands.

The GeoLite2 City database will need to be MANUALLY downloaded from MaxMind. Though it is free, you will need to create a MaxMind account to download the GeoIP2 Binary database (mmdb) from https://www.maxmind.com/en/accounts/current/geoip/downloads. Copy the GeoLite2-City.mmdb file to the ~/.qsv-cache/ directory or point to it using the QSV_GEOIP2_FILENAME environment variable.

When you run the command for the first time, it will download a prebuilt Geonames cities index from the qsv GitHub repo and use it going forward. You can operate on the local index using the geocode index-* subcommands.

By default, the prebuilt index uses the Geonames Gazeteer cities15000.zip file using English names. It contains cities with populations > 15,000 (about ~26k cities). See https://download.geonames.org/export/dump/ for more information.

It has twelve major subcommands:

  • suggest - given a partial City name, return the closest City's location metadata per the local Geonames cities index (Jaro-Winkler distance)
  • suggestnow - same as suggest, but using a partial City name from the command line, instead of CSV data.
  • reverse - given a WGS-84 location coordinate, return the closest City's location metadata per the local Geonames cities index. (Euclidean distance - shortest distance "as the crow flies")
  • reversenow - sames as reverse, but using a coordinate from the command line, instead of CSV data.
  • countryinfo - returns the country information for the ISO-3166 2-letter country code (e.g. US, CA, MX, etc.)
  • countryinfonow - same as countryinfo, but using a country code from the command line, instead of CSV data.
  • iplookup - given an IP address or URL, return the closest City's location metadata per the local Maxmind GeoLite2 City database.
  • iplookupnow - same as iplookup, but using an IP address or URL from the command line, instead of CSV data.
  • opencage - ONLINE forward/reverse geocoding using the OpenCage API. Forward-geocodes a free-form address, or reverse-geocodes a "lat, long" coordinate. Requires an OpenCage API key.
  • opencagenow - same as opencage, but using an address/coordinate from the command line, instead of CSV data.
  • index-* - operations to update the local Geonames cities index. (index-check, index-update, index-load & index-reset)
  • cache-* - operations to manage the persistent on-disk OpenCage result cache. (cache-clear, cache-prune & cache-info)

Suggest

Suggest a Geonames city based on a partial city name. It returns the closest Geonames city record based on the Jaro-Winkler distance between the partial city name and the Geonames city name.

The geocoded information is formatted based on --formatstr, returning it in '%location' format (i.e. "(lat, long)") if not specified.

Use the --new-column option if you want to keep the location column, e.g.

Geocode file.csv city column and set the geocoded value to a new column named lat_long.

$ qsv geocode suggest city --new-column lat_long file.csv

Limit suggestions to the US, Canada and Mexico.

$ qsv geocode suggest city --country us,ca,mx file.csv

Limit suggestions to New York State and California, with matches in New York state having higher priority as its listed first.

$ qsv geocode suggest city --country us --admin1 "New York,US.CA" file.csv

If we use admin1 codes, we can omit --country as it will be inferred from the admin1 code prefix.

$ qsv geocode suggest city --admin1 "US.NY,US.CA" file.csv

Geocode file.csv city column with --formatstr=%state and set the geocoded value a new column named state.

$ qsv geocode suggest city --formatstr %state --new-column state file.csv

Use dynamic formatting to create a custom format.

$ qsv geocode suggest city -f "{name}, {admin1}, {country} in {timezone}" file.csv

Using French place names. You'll need to rebuild the index with the --languages option first

$ qsv geocode suggest city -f "{name}, {admin1}, {country} in {timezone}" -l fr file.csv

Suggestnow

Accepts the same options as suggest, but does not require an input file. Its default format is more verbose - "{name}, {admin1} {country}: {latitude}, {longitude}"

$ qsv geocode suggestnow "New York"
$ qsv geocode suggestnow --country US -f %cityrecord "Paris"
$ qsv geocode suggestnow --admin1 "US:OH" "Athens"

Reverse

Reverse geocode a WGS 84 coordinate to the nearest City. It returns the closest Geonames city record based on the Euclidean distance between the coordinate and the nearest city. It accepts "lat, long" or "(lat, long)" format.

The geocoded information is formatted based on --formatstr, returning it in '%city-admin1' format if not specified, e.g.

Reverse geocode file.csv LatLong column. Set the geocoded value to a new column named City.

$ qsv geocode reverse LatLong -c City file.csv

Reverse geocode file.csv LatLong column and set the geocoded value to a new column named CityState, output to a file named file_with_citystate.csv.

$ qsv geocode reverse LatLong -c CityState file.csv -o file_with_citystate.csv

The same as above, but get the timezone instead of the city and state.

$ qsv geocode reverse LatLong -f %timezone -c tz file.csv -o file_with_tz.csv

Reversenow

Accepts the same options as reverse, but does not require an input file.

$ qsv geocode reversenow "40.71427, -74.00597"
$ qsv geocode reversenow --country US -f %cityrecord "40.71427, -74.00597"
$ qsv geocode reversenow "(39.32924, -82.10126)"

Countryinfo

Returns the country information for the specified ISO-3166 2-letter country code.

$ qsv geocode countryinfo country_col data.csv
$ qsv geocode countryinfo --formatstr "%json" country_col data.csv
$ qsv geocode countryinfo -f "%continent" country_col data.csv
$ qsv geocode countryinfo -f "{country_name} ({fips}) in {continent}" country_col data.csv

Countryinfonow

Accepts the same options as countryinfo, but does not require an input file.

$ qsv geocode countryinfonow US
$ qsv geocode countryinfonow --formatstr "%pretty-json" US
$ qsv geocode countryinfonow -f "%continent" US
$ qsv geocode countryinfonow -f "{country_name} ({fips}) in {continent}" US

Iplookup

Given an IP address or URL, return the closest City's location metadata per the local Geonames cities index.

$ qsv geocode iplookup IP_col data.csv
$ qsv geocode iplookup --formatstr "%json" IP_col data.csv
$ qsv geocode iplookup -f "%cityrecord" IP_col data.csv

Iplookupnow

Accepts the same options as iplookup, but does not require an input file.

$ qsv geocode iplookupnow 140.174.222.253
$ qsv geocode iplookupnow https://amazon.com
$ qsv geocode iplookupnow --formatstr "%json" 140.174.222.253
$ qsv geocode iplookupnow -f "%cityrecord" 140.174.222.253

Opencage

Online forward or reverse geocoding using the OpenCage Geocoding API (https://opencagedata.com). Unlike the suggest/reverse subcommands which use the local Geonames index, opencage geocodes real street addresses online.

Requires an OpenCage API key. Set it with --api-key or the QSV_OPENCAGE_API_KEY environment variable (the --api-key flag takes precedence). Get a free key at https://opencagedata.com/users/sign_up.

The may contain either a free-form address (forward geocoding) or a "lat, long" / "(lat, long)" WGS-84 coordinate (reverse geocoding). The mode is auto-detected per row; pass --reverse to force reverse geocoding.

OpenCage's Terms of Service explicitly allow caching, so results are cached in a persistent on-disk cache (see --cache-ttl & --no-cache). Re-runs and duplicate queries do NOT re-hit the API. The free tier allows 2,500 requests/day at 1 request/second; rows are processed sequentially and rate-limited (see --rate-limit).

The --country option, if set, restricts results to the given ISO 3166-1 alpha-2 country code(s). The --timeout, --language, --invalid-result, --new-column, --rename and --output options behave as they do for the other subcommands.

The --formatstr option supports these OpenCage-specific formats:

  • '%+' | '%formatted' - the OpenCage formatted address (default)
  • '%lat-long' - ,
  • '%location' - (, )
  • '%city' - the city/town/village
  • '%state' | '%admin1' - the state/province
  • '%county' | '%admin2' - the county
  • '%country' - the ISO 3166-1 alpha-2 country code
  • '%country_name' - the country name
  • '%postcode' - the postal code
  • '%confidence' - the OpenCage confidence score (0-10)
  • '%json' - the first OpenCage result as JSON
  • '%pretty-json' - the first OpenCage result as pretty JSON Dynamic formatting is also supported, using dotted keys, e.g. "{components.city}, {components.country}" or "{annotations.timezone.name}". Available keys: formatted, lat, lng, confidence, components. and annotations.<dotted.path>.

The special "%dyncols:" format is also supported, adding multiple columns to the output CSV. Set --formatstr to "%dyncols:" followed by a comma-delimited list of "{col_name:key}" pairs, where key is one of the dynamic keys above, e.g. "%dyncols: {city:components.city}, {tz:annotations.timezone.name}" Like the other subcommands, "%dyncols:" cannot be combined with --new-column.

$ qsv geocode opencage address --api-key YOURKEY file.csv
$ qsv geocode opencage address --country us -f '%json' file.csv
$ qsv geocode opencage coord_col --reverse -c city file.csv
$ qsv geocode opencage address -f '{components.city}, {components.country}' file.csv
$ qsv geocode opencage address -f '%dyncols: {city:components.city}, {pc:components.postcode}' file.csv

Opencagenow

Accepts the same options as opencage, but does not require an input file.

$ qsv geocode opencagenow --api-key YOURKEY "Brooklyn, NY"
$ qsv geocode opencagenow "40.71427, -74.00597"
$ qsv geocode opencagenow -f '%pretty-json' "Eiffel Tower, Paris"

INDEX- Manage the local Geonames cities index used by the geocode command.

It has four operations:

  • check - checks if the local Geonames index is up-to-date compared to the Geonames website. returns the index file's metadata JSON to stdout.
  • update - updates the local Geonames index with the latest changes from the Geonames website. use this command judiciously as it downloads about ~200mb of data from Geonames and rebuilds the index from scratch using the --languages option. If you don't need a language other than English, use the index-load subcommand instead as it's faster and will not download any data from Geonames.
  • reset - resets the local Geonames index to the default prebuilt, English-only Geonames cities index (cities15000) - downloading it from the qsv GitHub repo for the current qsv version.
  • load - load a Geonames cities index from a file, making it the default index going forward. If set to 15000, it will download the prebuilt English-only cities15000 Geonames index rkyv file from the qsv GitHub repo for the current qsv version.

Update the Geonames cities index with the latest changes.

$ qsv geocode index-update

Rebuild the index using the latest Geonames data w/ English, French, German & Spanish place names

$ qsv geocode index-update --languages en,fr,de,es

Load an alternative Geonames cities index from a file, making it the default index going forward.

$ qsv geocode index-load my_geonames_index.rkyv

CACHE- Manage the persistent on-disk OpenCage result cache used by the opencage subcommands. This cache is separate from the Geonames cities index and is only populated by the opencage/opencagenow subcommands. It lives in {cache-dir}/geocode-opencage_v1.

It has three operations:

  • clear - wipe the entire OpenCage disk cache, removing all cached results.
  • prune - delete cache entries older than the --older-than value. The value is either an absolute date/datetime (e.g. 2025-01-31, "2025-01-31 12:00:00") or a relative age with a unit suffix - s(econds), m(inutes), h(ours), d(ays) or w(eeks). e.g. 30d, 2w, 48h, 90m, 3600s.
  • info - report the cache directory, entry count, on-disk size and the oldest/newest cached entry timestamps. Emits a JSON summary to stdout.

Wipe the entire OpenCage cache.

$ qsv geocode cache-clear

Delete cached entries older than 30 days.

$ qsv geocode cache-prune --older-than 30d

Delete cached entries created before a specific date.

$ qsv geocode cache-prune --older-than 2025-01-01

Show cache statistics.

$ qsv geocode cache-info

Examples

For US locations, you can retrieve the us_state_fips_code and us_county_fips_code fields of a US City to help with Census data enrichment.

qsv geocode suggest city_col --country US -f \
"%dyncols: {geocoded_city_col:name}, {state_col:admin1}, {county_col:admin2},  {state_fips_code:us_state_fips_code}, {county_fips_code:us_county_fips_code}"\
input_data.csv -o output_data_with_fips.csv

For US locations, you can reverse geocode the us_state_fips_code and us_county_fips_code fields of a WGS 84 coordinate to help with Census data enrichment. The coordinate can be in "lat, long" or "(lat, long)" format.

qsv geocode reverse wgs84_coordinate_col --country US -f \
"%dyncols: {geocoded_city_col:name}, {state_col:admin1}, {county_col:admin2},  {state_fips_code:us_state_fips_code}, {county_fips_code:us_county_fips_code}"\
input_data.csv -o output_data_with_fips.csv

For more examples, see tests.

See also https://github.com/dathere/qsv/wiki/Geospatial#geocode

Usage

qsv geocode suggest [--formatstr=<string>] [options] <column> [<input>]
qsv geocode suggestnow [options] <location>
qsv geocode reverse [--formatstr=<string>] [options] <column> [<input>]
qsv geocode reversenow [options] <location>
qsv geocode countryinfo [options] <column> [<input>]
qsv geocode countryinfonow [options] <location>
qsv geocode iplookup [options] <column> [<input>]
qsv geocode iplookupnow [options] <location>
qsv geocode opencage [--formatstr=<string>] [options] <column> [<input>]
qsv geocode opencagenow [options] <location>
qsv geocode index-load <index-file>
qsv geocode index-check
qsv geocode index-update [--languages=<lang>] [--cities-url=<url>] [--force] [--timeout=<seconds>]
qsv geocode index-reset
qsv geocode cache-clear [options]
qsv geocode cache-prune --older-than=<val> [options]
qsv geocode cache-info [options]
qsv geocode --help

Arguments

   Argument    Description
 <input>  The input file to read from. If not specified, reads from stdin.
 <column>  The column to geocode. Used by suggest, reverse & countryinfo subcommands. For suggest, it must be a column with a City string pattern. For reverse, it must be a column using WGS 84 coordinates in "lat, long" or "(lat, long)" format. For countryinfo, it must be a column with a ISO 3166-1 alpha-2 country code. For iplookup, it must be a column with an IP address or a URL. For opencage, it may be a free-form address OR a WGS 84 coordinate. Note that you can use column selector syntax to select the column, but only the first column will be used. See select --help for more information.
 <location>  The location to geocode for suggestnow, reversenow, countryinfonow and iplookupnow subcommands. For suggestnow, its a City string pattern. For reversenow, it must be a WGS 84 coordinate. For countryinfonow, it must be a ISO 3166-1 alpha-2 code. For iplookupnow, it must be an IP address or a URL. For opencagenow, it must be an address OR a WGS 84 coordinate.
 <index-file>  The alternate geonames index file to use. It must be a .rkyv file. For convenience, if this is set to 15000, it will download the prebuilt English-only cities15000 Geonames index rkyv file from the qsv GitHub repo for the current qsv version and use it. Only used by the index-load subcommand.

Geocode Options

     Option      Type Description Default
 ‑c,
‑‑new‑column 
string Put the transformed values in a new column instead. Not valid when using the '%dyncols:' --formatstr option.
 ‑r,
‑‑rename 
string New name for the transformed column.
 ‑‑country  string The comma-delimited, case-insensitive list of countries to filter for. Country is specified as a ISO 3166-1 alpha-2 (two-letter) country code. https://en.wikipedia.org/wiki/ISO_3166-2

Suggest Only Options

     Option      Type Description Default
 ‑‑min‑score  float The minimum Jaro-Winkler distance score. 0.8
 ‑‑admin1  string The comma-delimited, case-insensitive list of admin1s to filter for.

Reverse Only Option

     Option      Type Description Default
 ‑k,
‑‑k_weight 
string Use population-weighted distance for reverse subcommand. (i.e. nearest.distance - k * city.population) Larger values will favor more populated cities. If not set (default), the population is not used and the nearest city is returned.

Opencage Only Options

      Option       Type Description Default
 ‑‑api‑key  string The OpenCage API key for the opencage/opencagenow subcommands. If set, it takes precedence over the QSV_OPENCAGE_API_KEY environment variable. Get a free key at https://opencagedata.com/users/sign_up.
 ‑‑rate‑limit  integer Maximum number of OpenCage API requests per second. The free tier allows 1 request/second (2,500/day). 1
 ‑‑reverse  flag Force reverse geocoding for opencage/opencagenow (treat the query as a "lat, long" WGS-84 coordinate). If not set, forward and reverse mode is auto-detected per row.
 ‑‑no‑annotations  flag Omit OpenCage annotations (timezone, currency, etc.) from the result and from %json output.
 ‑‑cache‑ttl  integer Time-to-live for the persistent on-disk OpenCage result cache. 1209600
 ‑‑no‑cache  flag Disable the persistent on-disk OpenCage cache. Duplicate queries within a run are still de-duplicated.

Dynamic Formatting Options

      Option       Type Description Default
 ‑l,
‑‑language 
string The language to use when geocoding. The language is specified as a ISO 639-1 code. Note that the Geonames index must have been built with the specified language using the index-update subcommand with the --languages option. If the language is not available, the first language in the index is used. en
 ‑‑invalid‑result  string The string to return when the geocode result is empty/invalid. If not set, the original value is used.
 ‑j,
‑‑jobs 
integer The number of jobs to run in parallel. When not set, the number of jobs is set to the number of CPUs detected.
 ‑b,
‑‑batch 
integer The number of rows per batch to load into memory, before running in parallel. Set to 0 to load all rows in one batch. 50000
 ‑‑timeout  integer Timeout for downloading Geonames cities index. 120
 ‑‑cache‑dir  string The directory to use for caching the Geonames cities index and the persistent on-disk OpenCage result cache. If the directory does not exist, qsv will attempt to create it. If the QSV_CACHE_DIR envvar is set, it will be used instead. ~/.qsv-cache

Cache-Prune Only Option

     Option      Type Description Default
 ‑‑older‑than  string Delete OpenCage cache entries older than this value. Accepts an absolute date/datetime (e.g. 2025-01-31) or a relative age with a unit suffix (s/m/h/d/w = seconds, minutes, hours, days or weeks; e.g. 30d, 2w, 48h). Required by the cache-prune subcommand.

Index-Update Only Options

     Option      Type Description Default
 ‑‑languages  string The comma-delimited, case-insensitive list of languages to use when building the Geonames cities index. The languages are specified as a comma-separated list of ISO 639-2 codes. See https://download.geonames.org/export/dump/iso-languagecodes.txt to look up codes and https://download.geonames.org/export/dump/alternatenames/ for the supported language files. 253 languages are currently supported. en
 ‑‑cities‑url  string The URL to download the Geonames cities file from. There are several available at https://download.geonames.org/export/dump/. cities500.zip - cities with populations > 500; ~200k cities, 56mb cities1000.zip - population > 1000; ~140k cities, 44mb cities5000.zip - population > 5000; ~53k cities, 21mb cities15000.zip - population > 15000; ~26k cities, 13mb Note that the more cities are included, the larger the local index file will be, lookup times will be slower, and the search results will be different. For convenience, if this is set to 500, 1000, 5000 or 15000, it will be converted to a geonames cities URL. https://download.geonames.org/export/dump/cities15000.zip
 ‑‑force  flag Force update the Geonames cities index. If not set, qsv will check if there are updates available at Geonames.org before updating the index.

Common Options

     Option      Type Description Default
 ‑h,
‑‑help 
flag Display this message
 ‑o,
‑‑output 
string Write output to instead of stdout.
 ‑d,
‑‑delimiter 
string The field delimiter for reading CSV data. Must be a single character. (default: ,)
 ‑p,
‑‑progressbar 
flag Show progress bars. Will also show the cache hit rate upon completion. Not valid for stdin.

Source: src/cmd/geocode.rs | Table of Contents | README