Skip to content

Modernize rebuild scripts#231

Merged
amyfromandi merged 53 commits intomainfrom
modernize-rebuild-scripts
Feb 27, 2026
Merged

Modernize rebuild scripts#231
amyfromandi merged 53 commits intomainfrom
modernize-rebuild-scripts

Conversation

@davenquinn
Copy link
Member

@davenquinn davenquinn commented Dec 17, 2025

Modernize stratigraphy rebuild scripts

Scripts

  • autocomplete
  • lookup_strat_names
  • lookup_unit_attrs_api
  • lookup_unit_intervals
  • lookup_units
  • pbdb_matches
  • stats
  • strat_name_footprints
  • unit_boundaries

Tasks for each script

  • Convert the database access from MariaDB to PostgreSQL
  • Clean up control flow and use more modern parameter binding
  • Other fixes for usability

Some of these scripts are outdated, so they may not work without some modification. But we should start with direct SQL translation where possible.

Many run lots of SQL and so might benefit from being run/tested against a local database, or on the cluster.

Our new macrostrat.database library should make the SQL a lot terser and easier to read.

Overall process

  • Convert all scripts to PostgreSQL
  • Check for correct output in macrostrat schema and API results (number of rows, output structure)
  • Move the ~finalized scripts out of the v1 schema

@davenquinn davenquinn marked this pull request as draft December 17, 2025 10:39
…modernize-rebuild-scripts

* origin/modernize-rebuild-scripts:
  Format code and sort imports
@davenquinn
Copy link
Member Author

In working on the first few scripts, autocomplete and lookup_strat_names, it seems like one of the only major SQL changes needed to migrate from MariaDB to PostgreSQL syntax is a change from the
UPDATE <table> LEFT JOIN <other table> SET ... WHERE ... syntax
to the UPDATE <table> SET ... FROM <other table> WHERE ... <join_condition> syntax. This may cause weirdness about the join condition (i.e., whether rows with NULLS are included) – we'll have to monitor this.

…modernize-rebuild-scripts

* origin/modernize-rebuild-scripts:
  Format code and sort imports
…ipts

* origin/main:
  Format code and sort imports
  Reorganize rockd schema code
  Removed unused Rockd subsystem
  Break apart Rockd migrations and validate
  Updated Rockd migrations
  Basic migration is at least planned out
  Use submodule version of tile utils
  Updated ordering of migration checks
  returning tilejson if ANY lines polygons or points are available
  updating postgrest view permissions
  updating map ingest endpoint permissions
  updated sources postgrest endpoit
  Format code and sort imports
  add sources to postgrest
  Format code and sort imports
  fixing api v3
* stratigraphy-ingestion: (190 commits)
  Basic loading works
  Updated logging utils
  Refactor column units preparation
  Starting point for column ingestion
  Improve lithologies
  All tests pass
  Added failing tests for lithology ingestion
  Successfully integrate lithologies
  basic lithology tests pass
  Basic tests of lithology matching
  Starting points for database inserts
  Added basic database file
  Started working with metadata
  Start managing units table
  add a no-op cli
  Updated some typer dependencies
  Updated pyproject toml file
  Format code and sort imports
  Updated tileserver for paleogepgraphy layers
  Remove .idea project files from tracking
  ...
@davenquinn
Copy link
Member Author

OK, I finished a round of updates on several of the scripts

  • lookup-strat-names
  • lookup-units
  • lookup-unit-attrs-api
  • lookup-unit-intervals (appears to be sort of legacy)
  • autocomplete

Generally, I improved them to be much more streamlined in their approach and avoid loops for row-by-row updates in Python (which was mostly the original approach).

  • The remaining "rebuild" scripts appear be fairly straightforward to migrate. I will create a new pull request for them.
  • The "match" and "process" scripts are generally more aligned with the mapping process and have been mostly superseded.

* main:
  Format code and sort imports
  Add new commits to submodules
  Format code and sort imports
  updated image ext to .jpg. need to update photo url based on parameter inputs
  Format code and sort imports
  made the convert endpoint match CheckinData from the rockd create-edit-checkin route
  Format code and sort imports
  updating convert endpoint to accept all planar orientations in an observation
@davenquinn davenquinn force-pushed the modernize-rebuild-scripts branch from 7bd3ea8 to 4baf4ae Compare February 20, 2026 21:23
@davenquinn davenquinn marked this pull request as ready for review February 20, 2026 21:42
@davenquinn
Copy link
Member Author

davenquinn commented Feb 20, 2026

@amyfromandi already found an error in the new version. It's small but emphasizes that we want to proceed carefully here. #256

@amyfromandi
Copy link
Collaborator

FYI: v2 macrostrat.units has max(id) 138596 and is a real unit, but v1 prod has max(id)=138597 = ‘test_delete_me’. This test unit was added as a hack because the rebuild scripts on v1 do not complete the last record for some scripts. We will need to ensure that the last unit record shows up with the test_delete_me unit not included. OR just add the test unit back into the database.

@amyfromandi
Copy link
Collaborator

Since there is no pbdb_coll_matrix table and is referenced within the pbdb script...and per Shanan's response below, I am skipping the pbdb_matches revision.
per Shanan: my sense is that you should skip any build scripts that involve pbdb data for now.

@amyfromandi
Copy link
Collaborator

Using these queries to diff the strat_name_footprints table.

--1
SELECT
  (SELECT count(*) FROM macrostrat.strat_name_footprints_new)      AS new_n,
  (SELECT count(*) FROM macrostrat.strat_name_footprints)  AS old_n;
--2
SELECT COUNT(*) AS new_minus_old
FROM (
  SELECT strat_name_id, name_no_lith, rank_name, concept_id, concept_names, best_t_age, best_b_age
  FROM macrostrat.strat_name_footprints_new
  EXCEPT
  SELECT strat_name_id, name_no_lith, rank_name, concept_id, concept_names, best_t_age, best_b_age
  FROM macrostrat.strat_name_footprints
) x;
SELECT COUNT(*) AS old_minus_new
FROM (
  SELECT strat_name_id, name_no_lith, rank_name, concept_id, concept_names, best_t_age, best_b_age
  FROM macrostrat.strat_name_footprints
  EXCEPT
  SELECT strat_name_id, name_no_lith, rank_name, concept_id, concept_names, best_t_age, best_b_age
  FROM macrostrat.strat_name_footprints_new
) x;

--3a
SELECT COUNT(*) AS geom_not_equal
FROM macrostrat.strat_name_footprints_new n
JOIN macrostrat.strat_name_footprints o USING (strat_name_id)
WHERE NOT ST_Equals(o.geom, n.geom);
--3b
SELECT COUNT(*) AS geom_binary_diff
FROM macrostrat.strat_name_footprints_new n
JOIN macrostrat.strat_name_footprints o USING (strat_name_id)
WHERE ST_AsBinary(o.geom) <> ST_AsBinary(n.geom);


--4
SELECT
  o.strat_name_id,
  o.concept_id AS old_concept_id,
  n.concept_id AS new_concept_id,
  ST_Equals(o.geom, n.geom) AS geom_equal,
  (ST_AsBinary(o.geom) = ST_AsBinary(n.geom)) AS geom_binary_equal,
  o.best_t_age AS old_best_t_age,
  n.best_t_age AS new_best_t_age,
  o.best_b_age AS old_best_b_age,
  n.best_b_age AS new_best_b_age
FROM macrostrat.strat_name_footprints_new n
JOIN macrostrat.strat_name_footprints o USING (strat_name_id)
WHERE n.concept_id IS DISTINCT FROM o.concept_id
   OR n.concept_names IS DISTINCT FROM o.concept_names
   OR n.best_t_age IS DISTINCT FROM o.best_t_age
   OR n.best_b_age IS DISTINCT FROM o.best_b_age
   OR NOT ST_Equals(o.geom, n.geom)
LIMIT 100;

@amyfromandi
Copy link
Collaborator

amyfromandi commented Feb 25, 2026

I tried optimizig the strat_name_footprints query (even though it was already written in postgresql), but only knocked off 3 minutes of execution time. This query takes a long time to run about 22-25 minutes due to the geometries being calculated for ~54k rows of data. The query works, but maybe there is another way to optimize these clauses SELECT COALESCE(ST_Union(ST_MakeValid(geom)), 'SRID=4326;POLYGON EMPTY') as geom to execute faster.
Need to test https://staging.macrostrat.org/api/v2/geologic_units/burwell which uses the strat_name_footprints table.

@amyfromandi
Copy link
Collaborator

amyfromandi commented Feb 26, 2026

I refactored the rebuild scripts into the command line. Below is an example to access the scripts:

afromandi@Amys-MacBook-Air macrostrat % macrostrat rebuild scripts --list
  autocomplete
  lookup-strat-names
  lookup-unit-attrs-api
  lookup-unit-intervals
  lookup-units
  stats
  strat-name-footprints
  unit-boundaries

I ran all of the rebuid scripts in staging and below are the diffs. Need to review why autocomplete has a negative diff. I also wonder if the scripts that have a -1 or -2 could be due to #231 (comment).

dataset staging v2 (production) diff (staging − v2)
autocomplete 55,538 58,303 -2,765
lookup_strat_names 51,229 51,229 0
lookup_unit_attrs_api 133,417 133,418 -1
lookup_unit_intervals 133,417 133,419 -2
lookup_units 133,417 133,419 -2
stats 11 8 3
strat_name_footprints 54,293 48,536 5,757
unit_boundaries 144,251 144,250 1

@davenquinn
Copy link
Member Author

I tried optimizig the strat_name_footprints query (even though it was already written in postgresql), but only knocked off 3 minutes of execution time. This query takes a long time to run about 22-25 minutes due to the geometries being calculated for ~54k rows of data. The query works, but maybe there is another way to optimize these clauses SELECT COALESCE(ST_Union(ST_MakeValid(geom)), 'SRID=4326;POLYGON EMPTY') as geom to execute faster. Need to test https://staging.macrostrat.org/api/v2/geologic_units/burwell which uses the strat_name_footprints table.

We don't have to do this now, but I expect that if we pre-validate the geometry column and pre-populate empty polygons where necessary, we can add a spatial index that will improve things. There are other ways to optimize this using topological relationships as well.

@davenquinn
Copy link
Member Author

Also, it would be nice if there was a macrostrat rebuild all or macrostrat rebuild --all as there was in v1.

@amyfromandi
Copy link
Collaborator

Also, it would be nice if there was a macrostrat rebuild all or macrostrat rebuild --all as there was in v1.

Just executing macrostrat rebuild scripts without any additional tags, runs all of the scripts at once

@amyfromandi
Copy link
Collaborator

I found that the query below (from the autocomplete rebuild script) is a stale query. So we're actually not joining the strat_name_orphans. In Mariadb, the concept_id=0, but after the migration (I'm thinking a foreign key migration) the concept_id became null in postgresql. I updated the autocomplete query, but we may want to define the concept_id in the future.

(SELECT id,
         CONCAT(strat_name, ' ', rank) AS name,
         'strat_name_orphans'          AS type,
         'strat_name'                  AS category
  FROM strat_names
  WHERE concept_id is null)

Copy link
Member Author

@davenquinn davenquinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some proposed changes to how the code is wrapped together.

Take a look at https://github.com/UW-Macrostrat/macrostrat/blob/main/py-modules/map-integration/macrostrat/map_integration/process/__init__.py for an idea of the proposed structure (this is the root of the map scripts that have already been ported over).

Delete the old rebuild scripts from the v1 directory or move to an archive if you haven't already done that.

}


@cli.command()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use the name="all" argument here to maintain parallelism with the old version (macrostrat rebuild all is easy to remember)

# ---------------------------------------------------------------------------


class Autocomplete:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like most/all of these scripts could just be functions rather than classes, which would be simpler and allow easier integration with Typer



# ---------------------------------------------------------------------------
# Shared helpers (from lookup_units.py)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put these in a utils file? They are kind of less important overall.

UnitBoundaries,
)

return {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making all of these scripts CLI commands in their own right, so that Typer's semantics can be used and arguments can be added easily to each.

The scripts command can be retained to run all of them in sequence.

from rich.console import Console
from typer import Option, Typer

cli = Typer(help="Rebuild database tools")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use no_args_is_help=True

@amyfromandi
Copy link
Collaborator

amyfromandi commented Feb 27, 2026

In reviewingn the other scripts these are some data parity issues for the 1-2 row count variances.

  • 2 rows deleted for a col_id of 0,
    one was a test_delete_me
    the other was Lane Shale, unit_id 42143
  • 43138 unit does not show up in the unit_liths table. Can either add a clay lith or ignore it.

@amyfromandi
Copy link
Collaborator

I found that the query below (from the autocomplete rebuild script) is a stale query. So we're actually not joining the strat_name_orphans. In Mariadb, the concept_id=0, but after the migration (I'm thinking a foreign key migration) the concept_id became null in postgresql. I updated the autocomplete query, but we may want to define the concept_id in the future.

(SELECT id,
         CONCAT(strat_name, ' ', rank) AS name,
         'strat_name_orphans'          AS type,
         'strat_name'                  AS category
  FROM strat_names
  WHERE concept_id is null)

Ran into this same issue for strat_name_footprints. Fixed and now there is a 2k variance. This variance is because there are 3054 duplicate strat_name_id's after the migration from mariadb. The rebuild script actually fixes this issue.

select (count(strat_name_footprints.strat_name_id) - count(distinct strat_name_footprints.strat_name_id)) from macrostrat.strat_name_footprints

…he mariadb migration. this is fixed after the rebuild script is ran
@amyfromandi amyfromandi merged commit f9c0dad into main Feb 27, 2026
2 checks passed
@amyfromandi amyfromandi deleted the modernize-rebuild-scripts branch February 27, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants