Modernize rebuild scripts by davenquinn · Pull Request #231 · UW-Macrostrat/macrostrat

davenquinn · 2025-12-17T10:32:51Z

Modernize stratigraphy rebuild scripts

This should proceed similarly to modernizing the map rebuild scripts (Map ingestion #7) and Macrostrat API v2 updates.
Currently, these are run with macrostrat v1 rebuild <step> or macrostrat v1 rebuild all
This builds on MariaDB to PostgreSQL data migration #60
It probably makes sense to tackle alongside 01-add-foreign-keys.sql delete statements causing data loss after mariadb migration #228

Scripts

Tasks for each script

Convert the database access from MariaDB to PostgreSQL
Clean up control flow and use more modern parameter binding
Other fixes for usability

Some of these scripts are outdated, so they may not work without some modification. But we should start with direct SQL translation where possible.

Many run lots of SQL and so might benefit from being run/tested against a local database, or on the cluster.

Our new macrostrat.database library should make the SQL a lot terser and easier to read.

Overall process

Convert all scripts to PostgreSQL
Check for correct output in macrostrat schema and API results (number of rows, output structure)
Move the ~finalized scripts out of the v1 schema

…modernize-rebuild-scripts * origin/modernize-rebuild-scripts: Format code and sort imports

davenquinn · 2025-12-17T16:22:10Z

In working on the first few scripts, autocomplete and lookup_strat_names, it seems like one of the only major SQL changes needed to migrate from MariaDB to PostgreSQL syntax is a change from the
UPDATE <table> LEFT JOIN <other table> SET ... WHERE ... syntax
to the UPDATE <table> SET ... FROM <other table> WHERE ... <join_condition> syntax. This may cause weirdness about the join condition (i.e., whether rows with NULLS are included) – we'll have to monitor this.

…modernize-rebuild-scripts * origin/modernize-rebuild-scripts: Format code and sort imports

…ipts * origin/main: Format code and sort imports Reorganize rockd schema code Removed unused Rockd subsystem Break apart Rockd migrations and validate Updated Rockd migrations Basic migration is at least planned out Use submodule version of tile utils Updated ordering of migration checks returning tilejson if ANY lines polygons or points are available updating postgrest view permissions updating map ingest endpoint permissions updated sources postgrest endpoit Format code and sort imports add sources to postgrest Format code and sort imports fixing api v3

* stratigraphy-ingestion: (190 commits) Basic loading works Updated logging utils Refactor column units preparation Starting point for column ingestion Improve lithologies All tests pass Added failing tests for lithology ingestion Successfully integrate lithologies basic lithology tests pass Basic tests of lithology matching Starting points for database inserts Added basic database file Started working with metadata Start managing units table add a no-op cli Updated some typer dependencies Updated pyproject toml file Format code and sort imports Updated tileserver for paleogepgraphy layers Remove .idea project files from tracking ...

davenquinn · 2026-02-20T21:06:34Z

OK, I finished a round of updates on several of the scripts

lookup-strat-names
lookup-units
lookup-unit-attrs-api
lookup-unit-intervals (appears to be sort of legacy)
autocomplete

Generally, I improved them to be much more streamlined in their approach and avoid loops for row-by-row updates in Python (which was mostly the original approach).

The remaining "rebuild" scripts appear be fairly straightforward to migrate. I will create a new pull request for them.
The "match" and "process" scripts are generally more aligned with the mapping process and have been mostly superseded.

* main: Format code and sort imports Add new commits to submodules Format code and sort imports updated image ext to .jpg. need to update photo url based on parameter inputs Format code and sort imports made the convert endpoint match CheckinData from the rockd create-edit-checkin route Format code and sort imports updating convert endpoint to accept all planar orientations in an observation

davenquinn · 2026-02-20T22:08:50Z

@amyfromandi already found an error in the new version. It's small but emphasizes that we want to proceed carefully here. #256

…rostrat/macrostrat into modernize-rebuild-scripts * 'modernize-rebuild-scripts' of https://github.com/UW-Macrostrat/macrostrat: Format code and sort imports

amyfromandi · 2026-02-23T21:39:06Z

FYI: v2 macrostrat.units has max(id) 138596 and is a real unit, but v1 prod has max(id)=138597 = ‘test_delete_me’. This test unit was added as a hack because the rebuild scripts on v1 do not complete the last record for some scripts. We will need to ensure that the last unit record shows up with the test_delete_me unit not included. OR just add the test unit back into the database.

amyfromandi · 2026-02-24T19:38:01Z

Since there is no pbdb_coll_matrix table and is referenced within the pbdb script...and per Shanan's response below, I am skipping the pbdb_matches revision.
per Shanan: my sense is that you should skip any build scripts that involve pbdb data for now.

amyfromandi · 2026-02-25T23:25:50Z

Using these queries to diff the strat_name_footprints table.

--1
SELECT
  (SELECT count(*) FROM macrostrat.strat_name_footprints_new)      AS new_n,
  (SELECT count(*) FROM macrostrat.strat_name_footprints)  AS old_n;
--2
SELECT COUNT(*) AS new_minus_old
FROM (
  SELECT strat_name_id, name_no_lith, rank_name, concept_id, concept_names, best_t_age, best_b_age
  FROM macrostrat.strat_name_footprints_new
  EXCEPT
  SELECT strat_name_id, name_no_lith, rank_name, concept_id, concept_names, best_t_age, best_b_age
  FROM macrostrat.strat_name_footprints
) x;
SELECT COUNT(*) AS old_minus_new
FROM (
  SELECT strat_name_id, name_no_lith, rank_name, concept_id, concept_names, best_t_age, best_b_age
  FROM macrostrat.strat_name_footprints
  EXCEPT
  SELECT strat_name_id, name_no_lith, rank_name, concept_id, concept_names, best_t_age, best_b_age
  FROM macrostrat.strat_name_footprints_new
) x;

--3a
SELECT COUNT(*) AS geom_not_equal
FROM macrostrat.strat_name_footprints_new n
JOIN macrostrat.strat_name_footprints o USING (strat_name_id)
WHERE NOT ST_Equals(o.geom, n.geom);
--3b
SELECT COUNT(*) AS geom_binary_diff
FROM macrostrat.strat_name_footprints_new n
JOIN macrostrat.strat_name_footprints o USING (strat_name_id)
WHERE ST_AsBinary(o.geom) <> ST_AsBinary(n.geom);


--4
SELECT
  o.strat_name_id,
  o.concept_id AS old_concept_id,
  n.concept_id AS new_concept_id,
  ST_Equals(o.geom, n.geom) AS geom_equal,
  (ST_AsBinary(o.geom) = ST_AsBinary(n.geom)) AS geom_binary_equal,
  o.best_t_age AS old_best_t_age,
  n.best_t_age AS new_best_t_age,
  o.best_b_age AS old_best_b_age,
  n.best_b_age AS new_best_b_age
FROM macrostrat.strat_name_footprints_new n
JOIN macrostrat.strat_name_footprints o USING (strat_name_id)
WHERE n.concept_id IS DISTINCT FROM o.concept_id
   OR n.concept_names IS DISTINCT FROM o.concept_names
   OR n.best_t_age IS DISTINCT FROM o.best_t_age
   OR n.best_b_age IS DISTINCT FROM o.best_b_age
   OR NOT ST_Equals(o.geom, n.geom)
LIMIT 100;

amyfromandi · 2026-02-25T23:29:35Z

I tried optimizig the strat_name_footprints query (even though it was already written in postgresql), but only knocked off 3 minutes of execution time. This query takes a long time to run about 22-25 minutes due to the geometries being calculated for ~54k rows of data. The query works, but maybe there is another way to optimize these clauses SELECT COALESCE(ST_Union(ST_MakeValid(geom)), 'SRID=4326;POLYGON EMPTY') as geom to execute faster.
Need to test https://staging.macrostrat.org/api/v2/geologic_units/burwell which uses the strat_name_footprints table.

amyfromandi · 2026-02-26T22:56:16Z

I refactored the rebuild scripts into the command line. Below is an example to access the scripts:

afromandi@Amys-MacBook-Air macrostrat % macrostrat rebuild scripts --list
  autocomplete
  lookup-strat-names
  lookup-unit-attrs-api
  lookup-unit-intervals
  lookup-units
  stats
  strat-name-footprints
  unit-boundaries

I ran all of the rebuid scripts in staging and below are the diffs. Need to review why autocomplete has a negative diff. I also wonder if the scripts that have a -1 or -2 could be due to #231 (comment).

dataset	staging	v2 (production)	diff (staging − v2)
autocomplete	55,538	58,303	-2,765
lookup_strat_names	51,229	51,229	0
lookup_unit_attrs_api	133,417	133,418	-1
lookup_unit_intervals	133,417	133,419	-2
lookup_units	133,417	133,419	-2
stats	11	8	3
strat_name_footprints	54,293	48,536	5,757
unit_boundaries	144,251	144,250	1

davenquinn · 2026-02-26T23:38:13Z

I tried optimizig the strat_name_footprints query (even though it was already written in postgresql), but only knocked off 3 minutes of execution time. This query takes a long time to run about 22-25 minutes due to the geometries being calculated for ~54k rows of data. The query works, but maybe there is another way to optimize these clauses SELECT COALESCE(ST_Union(ST_MakeValid(geom)), 'SRID=4326;POLYGON EMPTY') as geom to execute faster. Need to test https://staging.macrostrat.org/api/v2/geologic_units/burwell which uses the strat_name_footprints table.

We don't have to do this now, but I expect that if we pre-validate the geometry column and pre-populate empty polygons where necessary, we can add a spatial index that will improve things. There are other ways to optimize this using topological relationships as well.

davenquinn · 2026-02-26T23:39:37Z

Also, it would be nice if there was a macrostrat rebuild all or macrostrat rebuild --all as there was in v1.

amyfromandi · 2026-02-27T16:11:47Z

Also, it would be nice if there was a macrostrat rebuild all or macrostrat rebuild --all as there was in v1.

Just executing macrostrat rebuild scripts without any additional tags, runs all of the scripts at once

amyfromandi · 2026-02-27T17:09:51Z

I found that the query below (from the autocomplete rebuild script) is a stale query. So we're actually not joining the strat_name_orphans. In Mariadb, the concept_id=0, but after the migration (I'm thinking a foreign key migration) the concept_id became null in postgresql. I updated the autocomplete query, but we may want to define the concept_id in the future.

(SELECT id,
         CONCAT(strat_name, ' ', rank) AS name,
         'strat_name_orphans'          AS type,
         'strat_name'                  AS category
  FROM strat_names
  WHERE concept_id is null)

davenquinn

Some proposed changes to how the code is wrapped together.

Take a look at https://github.com/UW-Macrostrat/macrostrat/blob/main/py-modules/map-integration/macrostrat/map_integration/process/__init__.py for an idea of the proposed structure (this is the root of the map scripts that have already been ported over).

Delete the old rebuild scripts from the v1 directory or move to an archive if you haven't already done that.

davenquinn · 2026-02-27T17:18:14Z

py-modules/cli/macrostrat/cli/subsystems/rebuild/__init__.py

+    }
+
+
+@cli.command()


I'd use the name="all" argument here to maintain parallelism with the old version (macrostrat rebuild all is easy to remember)

davenquinn · 2026-02-27T17:20:17Z

py-modules/cli/macrostrat/cli/subsystems/rebuild/scripts.py

+# ---------------------------------------------------------------------------
+
+
+class Autocomplete:


It looks like most/all of these scripts could just be functions rather than classes, which would be simpler and allow easier integration with Typer

davenquinn · 2026-02-27T17:22:57Z

py-modules/cli/macrostrat/cli/subsystems/rebuild/scripts.py

+
+
+# ---------------------------------------------------------------------------
+# Shared helpers (from lookup_units.py)


Maybe put these in a utils file? They are kind of less important overall.

davenquinn · 2026-02-27T17:24:56Z

py-modules/cli/macrostrat/cli/subsystems/rebuild/__init__.py

+        UnitBoundaries,
+    )
+
+    return {


Consider making all of these scripts CLI commands in their own right, so that Typer's semantics can be used and arguments can be added easily to each.

The scripts command can be retained to run all of them in sequence.

davenquinn · 2026-02-27T17:25:22Z

py-modules/cli/macrostrat/cli/subsystems/rebuild/__init__.py

+from rich.console import Console
+from typer import Option, Typer
+
+cli = Typer(help="Rebuild database tools")


Use no_args_is_help=True

amyfromandi · 2026-02-27T21:15:53Z

In reviewingn the other scripts these are some data parity issues for the 1-2 row count variances.

2 rows deleted for a col_id of 0,
one was a test_delete_me
the other was Lane Shale, unit_id 42143
43138 unit does not show up in the unit_liths table. Can either add a clay lith or ignore it.

amyfromandi · 2026-02-27T21:55:33Z

I found that the query below (from the autocomplete rebuild script) is a stale query. So we're actually not joining the strat_name_orphans. In Mariadb, the concept_id=0, but after the migration (I'm thinking a foreign key migration) the concept_id became null in postgresql. I updated the autocomplete query, but we may want to define the concept_id in the future.
(SELECT id,
         CONCAT(strat_name, ' ', rank) AS name,
         'strat_name_orphans'          AS type,
         'strat_name'                  AS category
  FROM strat_names
  WHERE concept_id is null)

Ran into this same issue for strat_name_footprints. Fixed and now there is a 2k variance. This variance is because there are 3054 duplicate strat_name_id's after the migration from mariadb. The rebuild script actually fixes this issue.

select (count(strat_name_footprints.strat_name_id) - count(distinct strat_name_footprints.strat_name_id)) from macrostrat.strat_name_footprints

…he mariadb migration. this is fixed after the rebuild script is ran

davenquinn and others added 6 commits December 17, 2025 01:25

Update autocomplete script for PostgreSQL

30a47a7

Start working on lookup_strat_names

863c144

Updated lookup_strat_names script, not quite finished

6deba42

Update lookup-strat-names script

a596910

Partially working lookup-strat-names script

39b3a38

Format code and sort imports

cf5d25c

davenquinn marked this pull request as draft December 17, 2025 10:39

davenquinn assigned davenquinn and amyfromandi Dec 17, 2025

davenquinn added 2 commits December 17, 2025 10:10

Reverted control flow error

e7232ed

Merge remote-tracking branch 'origin/modernize-rebuild-scripts' into …

478e917

…modernize-rebuild-scripts * origin/modernize-rebuild-scripts: Format code and sort imports

davenquinn and others added 8 commits December 17, 2025 13:37

Small control flow improvements

83fecc3

Format code and sort imports

1756b84

Partially working lookup-strat-names script

d8eae43

Insert strat names

2f0bb33

Greatly speed up strat name insertion

69de255

Mark foreign key scripts as alpha

3910dd1

Improved linking scripts

6b60924

Updated lookup_strat_names script; still quite slow

34038f6

davenquinn mentioned this pull request Dec 22, 2025

Strat names with multiple parents #233

Open

davenquinn added 9 commits December 23, 2025 15:27

Added more queries to notes

0e22d72

Merge remote-tracking branch 'origin/modernize-rebuild-scripts' into …

ed1c4fc

…modernize-rebuild-scripts * origin/modernize-rebuild-scripts: Format code and sort imports

Moved sql

8969bad

Got rid of unused, partially migrated rebuild script

897ee8e

Starting point for lookup_units migration

e0d0551

Lookup units script sort of works

ba8c768

Convert to a cached approach to unit age updates

a275c07

davenquinn added 2 commits February 20, 2026 15:11

Fix submodule reference

c630760

davenquinn force-pushed the modernize-rebuild-scripts branch from 7bd3ea8 to 4baf4ae Compare February 20, 2026 21:23

Format code and sort imports

4c20d79

davenquinn marked this pull request as ready for review February 20, 2026 21:42

davenquinn mentioned this pull request Feb 20, 2026

Error in new version of rebuild scripts #256

Closed

davenquinn added 2 commits February 20, 2026 18:43

Fix #256 with where clause update

0cc9f02

Merge branch 'modernize-rebuild-scripts' of https://github.com/UW-Mac…

5f9bcc1

…rostrat/macrostrat into modernize-rebuild-scripts * 'modernize-rebuild-scripts' of https://github.com/UW-Macrostrat/macrostrat: Format code and sort imports

amyfromandi and others added 4 commits February 25, 2026 17:57

updated mariadb queries to postgresql

35ad677

Format code and sort imports

b7d8b59

added to macrostrat cli

277adb8

Format code and sort imports

5345e1e

updated autocomplete table to include strat_name_orphans

c1f4bd9

davenquinn commented Feb 27, 2026

View reviewed changes

fixed concept_id issue. data variance issue now is duplicates after t…

1a274cd

…he mariadb migration. this is fixed after the rebuild script is ran

amyfromandi merged commit f9c0dad into main Feb 27, 2026
2 checks passed

amyfromandi deleted the modernize-rebuild-scripts branch February 27, 2026 22:18

		# ---------------------------------------------------------------------------


		class Autocomplete:



		# ---------------------------------------------------------------------------
		# Shared helpers (from lookup_units.py)

		}


		@cli.command()

Conversation

davenquinn commented Dec 17, 2025 • edited by amyfromandi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scripts

Tasks for each script

Overall process

Uh oh!

davenquinn commented Dec 17, 2025

Uh oh!

davenquinn commented Feb 20, 2026

Uh oh!

davenquinn commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyfromandi commented Feb 23, 2026

Uh oh!

amyfromandi commented Feb 24, 2026

Uh oh!

amyfromandi commented Feb 25, 2026

Uh oh!

amyfromandi commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyfromandi commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davenquinn commented Feb 26, 2026

Uh oh!

davenquinn commented Feb 26, 2026

Uh oh!

amyfromandi commented Feb 27, 2026

Uh oh!

amyfromandi commented Feb 27, 2026

Uh oh!

davenquinn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davenquinn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

davenquinn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

davenquinn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

davenquinn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

davenquinn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

amyfromandi commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyfromandi commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davenquinn commented Dec 17, 2025 •

edited by amyfromandi

Loading

davenquinn commented Feb 20, 2026 •

edited

Loading

amyfromandi commented Feb 25, 2026 •

edited

Loading

amyfromandi commented Feb 26, 2026 •

edited

Loading

davenquinn left a comment •

edited

Loading

amyfromandi commented Feb 27, 2026 •

edited

Loading