Skip to content

Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping and pyomop-migrate CLI#260

Merged
dermatologist merged 9 commits intodevelopfrom
copilot/add-generic-database-loader
Feb 23, 2026
Merged

Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping and pyomop-migrate CLI#260
dermatologist merged 9 commits intodevelopfrom
copilot/add-generic-database-loader

Conversation

Copy link
Contributor

Copilot AI commented Feb 19, 2026

CdmCsvLoader is coupled to a flat FHIR-derived CSV. This adds CdmGenericLoader — reads from any async SQLAlchemy source database (SQLite/Postgres/MySQL) and loads into an OMOP CDM target, guided by the same JSON mapping convention.

Core implementation

  • src/pyomop/migrate/pyomop_migrate.pyCdmGenericLoader class; create_source_engine(), load_mapping(), build_source_url(), and extract_schema_to_markdown() helpers, plus the pyomop-migrate CLI entry point
  • src/pyomop/migrate/__init__.py — sub-package re-exporting all public symbols
  • src/pyomop/generic_loader.py — thin backward-compatibility shim (deprecated; re-exports from pyomop.migrate)
  • src/pyomop/mapping.generic.example.json — annotated example mapping (patients → person, encounters → visit_occurrence, diagnoses → condition_occurrence, lab results → measurement, medications → drug_exposure)

Mapping format

Extends the existing mapping.default.json convention with a source_table key per entry. Filters translate to SQL WHERE clauses (pushed to the source DB, not pandas masks).

{
  "tables": [
    {
      "source_table": "patients",
      "name": "person",
      "filters": [{"column": "active", "equals": 1}],
      "columns": {
        "person_id":           "id",
        "gender_source_value": "gender",
        "gender_concept_id":   {"const": 0},
        "year_of_birth":       {"const": 0},
        "race_concept_id":     {"const": 0},
        "ethnicity_concept_id":{"const": 0}
      }
    }
  ]
}

Same 5 post-load steps as CdmCsvLoader: person_id FK normalisation, birth-date backfill, gender concept mapping, concept-code lookups. Missing source/target tables warn and skip rather than abort.

pyomop-migrate CLI script

All migration functionality lives in a dedicated pyomop-migrate entry point, keeping main.py (pyomop) unchanged. Source-database connection details use --src-* options; the target OMOP CDM database uses the standard connection options.

# Migrate SQLite source → SQLite OMOP target
pyomop-migrate --migrate \
  --src-dbtype sqlite --src-name hospital.sqlite \
  --dbtype sqlite --name omop.sqlite \
  --mapping ehr_to_omop.json

# PostgreSQL source → PostgreSQL OMOP target
pyomop-migrate --migrate \
  --src-dbtype pgsql --src-host srchost --src-user reader --src-pw secret --src-name ehr \
  --dbtype pgsql --host omophost --user writer --pw secret --name omop \
  --mapping ehr_to_omop.json --batch-size 500

# Extract source database schema to Markdown
pyomop-migrate --extract-schema \
  --src-dbtype sqlite --src-name hospital.sqlite \
  --schema-output hospital_schema.md

--extract-schema command

Introspects the source database and writes a Markdown document with full schema information (table names, column names, data types, nullable flags, PK/FK relationships, and row counts). This output is designed to be fed to AI agents to auto-generate a mapping JSON file.

Environment variable support

All source database connection parameters can be supplied via environment variables instead of CLI flags, keeping credentials out of shell history:

Environment variable CLI option
SRC_DB_HOST --src-host
SRC_DB_PORT --src-port
SRC_DB_USER --src-user
SRC_DB_PASSWORD --src-pw
SRC_DB_NAME --src-name

This applies to both --migrate and --extract-schema.

Tests & docs

  • tests/test_pyomop_migrate.py — 26 unit tests covering row loading, gender/birth backfill, SQL filters, missing-table skip, multi-table mapping, batch correctness, CLI migrate, schema extraction (content, PK/FK, CLI end-to-end), URL building (all backends + env var override), and error cases
  • docs/pyomop_migrate.md — full usage, API reference, CLI option tables with examples, schema extraction section, and environment variable reference
  • notes/pyomop_migrate.md — design decisions, env var security model, schema extraction design, and future work
  • README.md — new quick-start section, schema extraction example, env var note, and updated command-line reference
  • __init__.py / mkdocs.yml / pyproject.toml — export, navigation, and entry point wired up
Original prompt

This section details on the original issue you should resolve

<issue_title>Generate a generic database loader for OMOP CDM.</issue_title>
<issue_description># Generate a generic database loader for OMOP CDM.

Description

  • Currently, loader.py loads a csv file generated from FHIR into any database with the OMOP CDM schema, using a mapping JSON file to map the CSV columns to the database columns.
  • Read loader.py carefully and understand how it works.
  • Now create a generic_loader.py that can connect to any database using SQLAlchemy and load the tables into the OMOP CDM schema, using a mapping JSON file. In other words, the source can be any database with any schema, and the target is any database with the OMOP CDM schema. The mapping JSON file will specify how to map the source database tables and columns to the target database tables and columns.
  • The generic_loader.py should be able to handle different types of databases (e.g. MySQL, PostgreSQL, SQLite) and should be able to load data in batches for efficiency.
  • The mapping JSON file should follow the same structure as the current mapping.default.json file.
  • The generic_loader.py should also have error handling and logging to track the loading process and any issues that arise.
  • Write unit tests for the generic_loader.py to ensure it works correctly.
  • Provide documentation on how to use the generic_loader.py, including how to create the mapping JSON file and how to run the loader. Brief instructions should go in the README.md file, and more detailed instructions should go in a separate documentation file, docs/generic_loader.md. Implementation details and future plans should go into notes/generic_loader.md.
  • Check if the new generic_loader.py aligns in functionality with the existing loader.py
  • Do a final pass to ensure the code is clean, of good quality, well-documented, and follows best practices.</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: dermatologist <6259644+dermatologist@users.noreply.github.com>
Copilot AI changed the title [WIP] Add generic database loader for OMOP CDM Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping Feb 19, 2026
Copilot AI requested a review from dermatologist February 19, 2026 13:49
@dermatologist
Copy link
Owner

@copilot
Provide a --migrate command line option that will migrate data from the source database to the target database. All options for source and target database connections should be provided as command-line arguments. The mapping JSON file should also be provided as a command-line argument. The loader should connect to the source database, read the data according to the mapping, and then connect to the target database and load the data into the OMOP CDM schema. Update documentation as needed. Do a final pass to ensure the code is clean, of good quality, well-documented, and follows best practices.

Co-authored-by: dermatologist <6259644+dermatologist@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 19, 2026

@copilot
Provide a --migrate command line option that will migrate data from the source database to the target database. All options for source and target database connections shoul...

Done in dc34df0. Added --migrate to the pyomop CLI with:

  • --src-dbtype / --src-host / --src-port / --src-user / --src-pw / --src-name / --src-schema for the source database connection
  • --mapping FILE (-m) for the JSON mapping file (required with --migrate)
  • --batch-size (default 1000) for INSERT batch tuning
  • Existing --dbtype / --host / --port / --user / --pw / --name / --schema for the OMOP CDM target

Example usage:

# SQLite → SQLite
pyomop --migrate \
  --src-dbtype sqlite --src-name hospital.sqlite \
  --dbtype sqlite --name omop.sqlite \
  --mapping ehr_to_omop.json

# PostgreSQL → PostgreSQL
pyomop --migrate \
  --src-dbtype pgsql --src-host srchost --src-user reader --src-pw secret --src-name ehr \
  --dbtype pgsql --host omophost --user writer --pw secret --name omop \
  --mapping ehr_to_omop.json --batch-size 500

Full CLI reference and examples added to docs/generic_loader.md and README.md. Three new CLI tests cover the end-to-end flow, the missing-mapping error, and the unsupported source DB type error — all 13 tests pass.

Copilot AI changed the title Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping and --migrate CLI Feb 19, 2026
@codecov
Copy link

codecov bot commented Feb 21, 2026

Codecov Report

❌ Patch coverage is 64.96914% with 227 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.3%. Comparing base (e9265ad) to head (b69cd5e).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
src/pyomop/migrate/pyomop_migrate.py 64.8% 187 Missing and 40 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           develop    #260     +/-   ##
=========================================
- Coverage     77.1%   75.3%   -1.8%     
=========================================
  Files           19      21      +2     
  Lines         3334    3982    +648     
  Branches       268     384    +116     
=========================================
+ Hits          2572    3002    +430     
- Misses         667     836    +169     
- Partials        95     144     +49     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dermatologist
Copy link
Owner

@copilot

  • Add a new feature to generic_loader.py that will introspect the source database and generate a markdown file with schema information, including table names, column names, data types and any relationships between tables. Also include PK/FK information if available. This will help AI agents to understand the source database schema and generate the appropriate mapping JSON file for the migration process.
  • Add an --extract-schema command line option that will trigger this functionality. The generated markdown file should be saved to a specified location provided as a command-line argument. All options for source database connection should be provided as command-line arguments.
  • Update generic_loader.py to look for environment variables for database connection parameters (e.g. DB_HOST, DB_PORT, DB_USER, DB_PASSWORD, DB_NAME) first, as an alternative to command line arguments for better security and flexibility. Update the documentation to reflect this new option for database connection parameters. This should apply to both the --migrate and --extract-schema functionalities.
  • Generate appropriate unit tests for this new feature to ensure it works correctly. Improve coverage for the existing code as well.
  • Provide documentation on how to use the --extract-schema option, including how to specify the source database connection and the output location for the generated markdown file in the README.md file and in a separate documentation file docs/generic_loader.md. Implementation details and future plans should go into notes/generic_loader.md.
  • Do a final pass to ensure the code is clean, of good quality, well-documented, and follows best practices.

Co-authored-by: dermatologist <6259644+dermatologist@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 21, 2026

@copilot

  • Add a new feature to generic_loader.py that will introspect the source database and generate a markdown file with schema information, including table names, column names, data types and an...

Done in e3a0515. Here's what was added:

Schema extraction (--extract-schema)

  • extract_schema_to_markdown(engine, output_path) in generic_loader.py introspects the source DB and writes a Markdown file containing a summary table (table name, row count, primary keys) and per-table sections with column name, type, nullable, default, PK/FK annotations, and a foreign key relationship table.
  • --extract-schema / --schema-output CLI options added to pyomop:
pyomop --extract-schema \
  --src-dbtype sqlite --src-name hospital.sqlite \
  --schema-output hospital_schema.md

Environment variable support

  • build_source_url() helper reads SRC_DB_HOST, SRC_DB_PORT, SRC_DB_USER, SRC_DB_PASSWORD, SRC_DB_NAME as fallbacks, so credentials never need to appear in shell history.
  • Click envvar= is also set on each --src-* option so they're resolved at the CLI level too.
  • Both --migrate and --extract-schema benefit from this.
export SRC_DB_HOST=db.hospital.org
export SRC_DB_USER=readonly
export SRC_DB_PASSWORD=secret
export SRC_DB_NAME=ehr_db

pyomop --extract-schema --src-dbtype pgsql --schema-output ehr_schema.md

Tests: 9 new tests covering schema Markdown content, PK/FK rendering, the --extract-schema CLI end-to-end, bad dbtype error, and all URL-building paths including env var override. Total: 22 → 26 tests, all passing.

Copilot AI changed the title Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping and --migrate CLI Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping, --migrate and --extract-schema CLI Feb 21, 2026
@dermatologist
Copy link
Owner

@copilot

  • Create a new module in src/pyomop/migrate and move the generic_loader.py to that module. This will help to organize the code better and separate the migration functionality.
  • Rename generic_loader.py to pyomop_migrate.py.
  • Create a new command-line script called pyomop-migrate and move the --migrate and --extract-schema command line arguments from main.py to this script.
  • The main.py should be reverted to its original state before the migration functionality was added (revert main.py to the state in the develop branch).
  • The new commands will be pyomop-migrate --migrate and pyomop-migrate --extract-schema.
  • Rename generic_loader.md to pyomop_migrate.md to reflect the new module and command-line script name. Change links in mkdocs.yml to the new file name as well.
  • Update the documentation to reflect the new module and command-line script in docs/pyomop_migrate.md and notes/pyomop_migrate.md. Make required name changes only in README.md to reflect the new command-line script name.
  • Update unit tests as required.
  • Do a final pass to ensure the code is clean, of good quality, well-documented, and follows best practices.

Co-authored-by: dermatologist <6259644+dermatologist@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 22, 2026

@copilot

  • Create a new module in src/pyomop/migrate and move the generic_loader.py to that module. This will help to organize the code better and separate the migration functionality.
  • Rename gene...

Done in 117a1b5. Here's what was changed:

New module src/pyomop/migrate/

  • pyomop_migrate.py — the full ETL code (previously generic_loader.py) plus the new migrate_cli Click command and main() entry point
  • __init__.py — re-exports CdmGenericLoader, build_source_url, create_source_engine, extract_schema_to_markdown, load_mapping

pyomop-migrate CLI script
Registered as an entry point in pyproject.toml. Commands:

pyomop-migrate --migrate --src-dbtype sqlite --src-name hospital.sqlite \
  --dbtype sqlite --name omop.sqlite --mapping mapping.json

pyomop-migrate --extract-schema --src-dbtype sqlite --src-name hospital.sqlite \
  --schema-output schema.md

main.py reverted to its original state (no --migrate/--extract-schema options).

src/pyomop/generic_loader.py replaced with a thin deprecation shim that re-exports from pyomop.migrate.pyomop_migrate to preserve backward compatibility.

Renamed files:

  • docs/generic_loader.mddocs/pyomop_migrate.md
  • notes/generic_loader.mdnotes/pyomop_migrate.md
  • tests/test_generic_loader.pytests/test_pyomop_migrate.py (imports updated to use pyomop.migrate.pyomop_migrate)

mkdocs.yml nav, __init__.py import, and README.md command references all updated accordingly. All 26 tests pass.

Copilot AI changed the title Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping, --migrate and --extract-schema CLI Add CdmGenericLoader: database-to-OMOP CDM ETL via JSON mapping and pyomop-migrate CLI Feb 22, 2026
@dermatologist dermatologist marked this pull request as ready for review February 22, 2026 23:24
@dermatologist dermatologist merged commit 212437e into develop Feb 23, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generate a generic database loader for OMOP CDM.

2 participants