Skip to content

mobidata-bw/postgis-gtfs-importer

Repository files navigation

PostGIS GTFS importer

This tool imports GTFS Schedule data into a PostGIS database using gtfs-via-postgres. It allows running a production service (e.g. an API) on top of programmatically re-imported data from a periodically changing GTFS feed without downtime.

Because it works as atomically as possible with PostgreSQL, it makes the import pipeline robust, even if an import fails or if simultaneous imports get started.

The ghcr.io/mobidata-bw/postgis-gtfs-importer Docker image is built automatically from this repo.

How it works

First, the GTFS data is downloaded to, unzipped into and cleaned within /tmp/gtfs; You can specify a custom path using $GTFS_TMP_DIR.

Each GTFS import gets its own PostgreSQL database called $GTFS_IMPORTER_DB_PREFIX_$unix_timestamp_$sha256_digest. The importer keeps track of (the most recent) successful imports by – once an import has succeeded – writing its DB name into a table latest_successful_imports within a "meta bookkeeping database".

The newly downloaded GTFS data will only get imported if it has changed since the last import. This is determined using a SHA-256 digest of the GTFS dataset (and of the post-processing scripts, if configured, see below).

Before each import, it also deletes all imports but the most recent two successful ones; This ensures that your disk won't overflow, but also that a rollback to the previous import is always possible.

Because the entire import script runs in a transaction, and because it acquires an exclusive lock on on latest_successful_imports in the beginning, it should be safe to abort an import at any time, or to (accidentally) run more than one process in parallel. Because creating and deleting DBs is not possible within a transaction, the importer opens a separate DB connection to do that; Therefore, aborting an import might leave an empty DB (not marked as the latest yet), which will be cleaned up as part of the next import (see above).

After the GTFS has been imported but before the import is marked as successful, it will run all post-processing scripts in /etc/gtfs/postprocessing.d (this path can be changed using $GTFS_POSTPROCESSING_D_PATH), if provided. This way, you can customise or augment the imported data. The execution of these scripts happens within the same transaction (in the bookkeeping DB) as the GTFS import. Files ending in .sql will be run using psql, all other files are assumed to be executable scripts. Note that the post-processing scripts also get hashed into the $sha256_digest, so if they change, the GTFS data will be imported again.

Usage

Prerequisites

You can configure access to the bookkeeping DB using the standard $PG… environment variables.

export PGDATABASE=''
export PGUSER=''
#

Note: postgis-gtfs-importer requires a database user/role that is allowed to create new databases (CREATEDB privilege).

Importing Data

The following commands demonstrate how to use the importer using Docker.

mkdir gtfs-tmp
docker run --rm -it \
	-v $PWD/gtfs-tmp:/tmp/gtfs \
	-e 'GTFS_DOWNLOAD_USER_AGENT=…' \
	-e 'GTFS_DOWNLOAD_URL=…' \
	ghcr.io/mobidata-bw/postgis-gtfs-importer:v5

Note: We mount a gtfs-tmp directory to prevent it from re-downloading the GTFS dataset every time, even when it hasn't changed.

You can configure access to the PostgreSQL by passing the standard PG* environment variables into the container.

If you run with GTFSTIDY_BEFORE_IMPORT=false, gtfsclean (a fork of gtfstidy) will not be used.

writing a DSN file

If you set $PATH_TO_DSN_FILE to a file path, the importer will also write a PostgreSQL key/value connection string (DSN) to that path. Note that you must also provide $POSTGREST_USER & $POSTGREST_PASSWORD in this case.

This feature is intended to be used with PgBouncer for "dynamic" routing of PostgreSQL clients to the database containing the latest GTFS import.

Breaking Changes

A new major version of postgis-gtfs-importer does not clean up imports done by the previous (major) versions.

About

Imports GTFS data into a PostGIS database, using gtfsclean & gtfs-via-postgres.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors