-
Notifications
You must be signed in to change notification settings - Fork 13
Importing CSV rows with geography parsing in cities table (EWKT WKT GeoJSON)
One-time large import — keep every row, set location = NULL when missing/malformed, no audit table.
This page documents the concept, the complete SQL schema, and a line-by-line explanation of the key INSERT logic of a GIS attribute, in our case, 'location'. We have a 'City' (later, cities) table in our DB that would contain at most all the cities in the world. For any future logic/use case that needs the comparison of a city's coordinates with the user's location coordinates, we take the initiative to include the location/coordinates in the cities table.
The location in the users table is of geography type, which is inserted dynamically. Since the 'cities' table is static, to make querying easy, we are considering adding a staging table that we have lon-lan, two separate attributes, which are further changed into one point in the insertion process.
Input CSV contains columns: city_id, state_id, city_name, city_point(where location is text-EWKT, WKT, or GeoJSON), and last_update_date. Example EWKT: "SRID=4326;POINT(-74.0060 40.7128)"
Goal: Insert every CSV row into cities (we keep CSV ID & name for all rows). If the location text is missing, malformed, not a POINT, or has out-of-range coords, we set cities.city_point = NULL.
Ensure PostGIS is enabled
CREATE EXTENSION IF NOT EXISTS postgis;
Main Table
CREATE TABLE IF NOT EXISTS cities(
city_id VARCHAR(255) PRIMARY KEY,
state_id VARCHAR(255),
city_name VARCHAR(255),
city_point geography(Point,4326), -- parsed geography or NULL
last_update_time timestamptz DEFAULT now()
);
Fast unlogged staging table (text for location)
An unlogged table is a special kind of table where PostgreSQL does not write to the Write-Ahead Log (WAL). With that, inserts/updates are much faster (great for staging or bulk loading). But the data is not crash-safe: If the server crashes, the table’s content is lost. On restart, PostgreSQL automatically truncates the table to empty. The tables cannot be replicated to a standby server via streaming replication and are often used as staging tables for temporary imports, ETL jobs, or intermediate computations, which is perfect for our task.
CREATE UNLOGGED TABLE staging_cities (
city_id VARCHAR(255) PRIMARY KEY,
state_id VARCHAR(255),
city_name VARCHAR(255),
city_point_wkt text
last_update_time timestamptz DEFAULT now()
);
Load CSV into staging (server-side). If CSV is on local, use \copy instead.
COPY staging_places (city_id, state_id, city_name, city_point_wkt, last_update_time)
FROM '/path/on/dbserver/places_ewkt.csv' CSV HEADER;
Insert ALL rows into the CITIES table
- Parsing strategy (fallbacks): ST_GeomFromEWKT -> ST_GeomFromText (strip SRID) -> ST_GeomFromGeoJSON
- Validation: must be POINT and coordinates in range; otherwise, location = NULL
INSERT INTO places (city_id, state_id, city_name, city_point, last_update_time)
SELECT
s.city_id,
s.state_id,
s.city_name,
CASE
WHEN parsed_geom IS NULL THEN NULL
WHEN GeometryType(parsed_geom) <> 'POINT' THEN NULL
WHEN NOT (ST_X(parsed_geom) BETWEEN -180 AND 180 AND ST_Y(parsed_geom) BETWEEN -90 AND 90) THEN NULL
ELSE ST_SetSRID(parsed_geom, COALESCE(NULLIF(ST_SRID(parsed_geom),0),4326))::geography
END AS city_point,
COALESCE(s.last_update_time, now()) AS last_update_time
FROM (
SELECT
city_id,
state_id,
city_name,
last_update_time,
COALESCE(
ST_GeomFromEWKT(city_point_wkt), -- EWKT: "SRID=4326;POINT(...)"
ST_GeomFromText(REGEXP_REPLACE(city_point_wkt, '^SRID=\\d+;', '')), -- WKT after stripping SRID prefix
ST_GeomFromGeoJSON(city_point_wkt) -- GeoJSON fallback
) AS parsed_geom
FROM staging_cities
) s;
Create spatial index if needed(after load)
CREATE INDEX IF NOT EXISTS idx_places_location ON cities USING GIST (city_point);
Refresh planner statistics
VACUUM ANALYZE places is run after a big load to (1) collect up-to-date table statistics so Postgres can make good query plans, and (2) perform basic vacuum work (mark visibility, free space, maintain visibility map).
VACUUM ANALYZE places;
Optional: drop staging when done
DROP TABLE IF EXISTS staging_cities;