Skip to content

hms-dbmi/picsure-dictionary-etl

Repository files navigation

Code Formatting

Before contributing code, please set up our git hook:

  cp code-formatting/pre-commit.sh .git/hooks/pre-commit
  chmod +x .git/hooks/pre-commit

To skip formatting on a block of code, wrap in spotless:off, spotless:on comments

Uploading a TSV file for /anvil/upload-tsv data ingest

  curl -X POST \
      -H "Content-Type: text/plain" \
      --data-binary @path/to/your/file.tsv \
      http://localhost:8080/anvil/upload-tsv

Local Database for Development

To set up a local development database, follow these steps:

  1. Run the docker-compose.postgres.yml file to start the required services:
   docker-compose -f docker-compose.postgres.yml up --build -d
  1. Execute the db/schema.sql script to create the dict schema and all necessary database tables.

  2. (Optional) If you have a dump file generated with pg_dump, you can restore it using pg_restore. If the backup is from our BDC database, the following command will work:

   pg_restore \
      --host="localhost" \
      --port="5432" \
      --username="username" \
      --dbname="dictionary_db" \
      --jobs=10 \
      --verbose \
      --no-owner \
      ./dictionary_db

You will be prompted to enter the database password after executing this command.

  1. Now you can run the Spring Boot application with dev as the active profile.

Installing pg_restore and pg_dump on Mac

If you do not have pg_restore and pg_dump, you can install them by following these steps:

  1. Install postgresql@16, which includes both pg_restore and pg_dump:
  brew install postgresql@16
  1. Link the installed PostgreSQL version:
   brew link postgresql@16
  1. Add PostgreSQL to your zshrc or bash profile. For zshrc, use the following command:
  echo 'export PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH"' >> ~/.zshrc
  1. Verify the versions of pg_dump and pg_restore:
  pg_dump --version
  pg_restore --version

If the command doesn’t work immediately, you may need to restart your terminal.

Accessing the docker database

  docker exec -it postgres-db psql -U username -d dictionary_db

From within, you can execute set search_path to dict; to search within the dict schema (similar to running use dict; in mysql databases).

FHIR Controller

The FHIR Controller provides endpoints to interact with a FHIR (Fast Healthcare Interoperability Resources) API for fetching and processing research study metadata.

Configuration:

Add the following properties to your application configuration file (e.g., application-dev.properties):

fhir.api.base.url=https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1
fhir.api.bulk.endpoint=/ResearchStudy

Endpoints:

  1. GET /api/fhir/research-studies/

    • Fetches all research studies from the FHIR API.
    • Returns a list of ResearchStudy objects containing study metadata.
    • Uses pagination (500 results per page) to handle large datasets.
    curl -X GET http://localhost:8080/api/fhir/research-studies/
  2. GET /api/fhir/research-studies/find-dbgap

    • Extracts distinct phs (dbGaP study) values from all research studies.
    • Returns a list of unique study identifiers.
    curl -X GET http://localhost:8080/api/fhir/research-studies/find-dbgap
  3. PATCH /api/fhir/load/metadata/refresh

    • Refreshes dataset metadata by fetching studies from the FHIR API.
    • Updates dataset records based on study identifiers and metadata.
    • Returns a summary of datasets updated and metadata records processed.
    curl -X PATCH http://localhost:8080/api/fhir/load/metadata/refresh
  4. PATCH /api/fhir/load/metadata/mapping

    • Maps FHIR metadata keys to dataset metadata fields.
    • Reads a JSON mapping configuration to associate metadata attributes with datasets.
    • Requires fhir.api.url.to.key.map.json configuration property.
    curl -X PATCH http://localhost:8080/api/fhir/load/metadata/mapping

Implementation notes:

  • The service uses pagination with a page size of 500 results to stay within the default 10MB WebClient buffer limit.
  • Each paginated request fetches approximately 9.7MB of data, safely below the buffer threshold.
  • The Meta model supports versionId, lastUpdated, and source fields returned by the FHIR API.

Facet Loader

This project includes a Facet Loader that ingests a JSON payload describing Facet Categories, their Facets (including nested children), and mapping rules to automatically associate facets with concept nodes based on a concept path.

What it does:

  • Upserts facet categories and facets by name (idempotent). If a name already exists it is updated; otherwise it is created.
  • Builds a facet hierarchy using parent_id based on the JSON nesting of facets.
  • Evaluates expressions to map facets to concept nodes (rows in dict.concept_node) and writes relationships into dict.facet__concept_node.

API endpoint:

  • POST /api/facet/loader/load
  • Request body is a JSON array of objects each containing a Facet_Category.
  • Returns a JSON object with counts: categoriesCreated, categoriesUpdated, facetsCreated, facetsUpdated.

Example request body (minimal):

[
  {
    "Facet_Category": {
      "Name": "Consortium_Curated_Facets",
      "Display": "Consortium Curated Facets",
      "Description": "Consortium Curated Facets Description",
      "Metadata": [
        {
          "Key": "order",
          "Value": "1"
        }
      ],
      "Facets": [
        {
          "Name": "Recover Adult",
          "Display": "RECOVER Adult",
          "Description": "Recover adult parent facet.",
          "Facets": [
            {
              "Name": "Infected",
              "Display": "Infected",
              "Description": "Infected Facet Description"
            }
          ]
        }
      ]
    }
  }
]

Example with expressions (maps facets to concept nodes by concept_path):

  • Concept path sample: \phs003436\Recover_Adult\visits\inf\12\pasc_cc_2024\
  • Nodes (by index):
    • 0: phs003436
    • 1: Recover_Adult
    • 2: biostats_derived
    • 3: visits
    • 4: inf
    • 5: 12
    • 6: pasc_cc_2024
  • Negative indices are allowed (e.g., -1 is last node, -2 second to last, etc.).

Example payload fragment using OR expression groups (OR-of-ANDs):

[
  {
    "Facet_Category": {
      "Name": "Consortium_Curated_Facets",
      "Facets": [
        {
          "Name": "Recover Adult Infected or PASC",
          "Expression_Groups": [
            [
              { "regex": "(?i)\\binf(ected)?\\b", "node": -3 }
            ],
            [
              { "contains": "pasc", "node": -1 }
            ]
          ]
        },
        {
          "Name": "Same using camelCase alias",
          "expressionGroups": [
            [ { "regex": "(?i)\\binf(ected)?\\b", "node": -3 } ],
            [ { "contains": "pasc", "node": -1 } ]
          ]
        }
      ]
    }
  }
]

Expression evaluation rules

  • Supported keys per expression entry: exactly, contains, regex (use one or more).
  • Two ways to define rules:
    • Expression_Groups (new): an array of groups; OR across groups, AND within each group.
  • node is optional for any entry:
    • If node is provided, we evaluate only that node (zero-based; negatives allowed, e.g., -1 is last node).
    • If node is omitted, the expression scans all nodes in the concept_path and evaluates true if any node matches.
  • Evaluation semantics:
    • If Expression_Groups is present and non-empty: the facet matches a concept if ANY group matches; each group requires ALL of its entries to match (OR-of-ANDs).
    • Within a single entry, all provided keys (exactly/contains/regex) must match the same node value.
  • Regex uses Java syntax; inline flags like (?i) are supported. Exactly/contains are literal string matches.
  • Out-of-bounds node indices or invalid regex cause that entry to evaluate to false.
  • Inheritance to children:
    • Child facets inherit their parents' effective groups. Effective groups are formed as a cross-product: each parent group is AND-combined with each child group.
    • If a parent has groups and a child defines no groups, the child inherits the parent groups as-is. If neither parent nor child has groups, no automatic mapping occurs.
  • If a facet contains no groups at any level (no Expression_Groups on it or any ancestor), it is not automatically mapped to any concept nodes (it is still created in the hierarchy).

Alias handling in payload:

  • Facet fields support camelCase aliases: Name/name, Display/display, Description/description.
  • For grouped rules, Expression_Groups is preferred; expressionGroups and ExpressionGroups are also accepted as aliases.
  • Each expression entry supports keys: exactly, contains, regex, and optional node.

Pre-requisites for mapping:

  • dict.concept_node should already be populated (e.g., via the Dictionary Loader) so that concept_path values exist to evaluate.

cURL examples:

Save payload to payload.json and run:

  curl -X POST \
      -H "Content-Type: application/json" \
      --data-binary @payload.json \
      http://localhost:8080/api/facet/loader/load

Sample response:

{
  "categoriesCreated": 1,
  "categoriesUpdated": 0,
  "facetsCreated": 5,
  "facetsUpdated": 0
}

Idempotency and updates

  • Re-posting the same payload will update existing categories/facets and keep the hierarchy; counts will reflect updates rather than creations.
  • Facet-to-concept mappings avoid duplicates using ON CONFLICT semantics in the repository/service layer.

Running tests

Tests you can run (Docker required for Testcontainers):

  • mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetExpressionEvaluatorTest test
  • mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetLoaderClearIntegrationTest test
  • mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetLoaderControllerTest test
  • mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetLoaderMappingIntegrationTest test
  • mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetLoaderServiceTest test

Or run all tests:

  • mvn test

Troubleshooting

  • If no concept nodes are mapped after posting, verify that concept_node rows exist and your node positions/regex align with your concept_path structure.
  • Ensure the app is running on the expected port and CORS origin (Facet Loader controller is configured with CrossOrigin for http://localhost:8081 by default).

Clear facets and categories

Use this endpoint to remove facet categories and/or specific facets (including all their descendants) by name. The service will also remove facet-to-concept mappings first to preserve referential integrity.

  • Endpoint: POST /api/facet/loader/clear
  • Request body (JSON):
{
  "Facet_Categories": ["CategoryName1", "CategoryName2"],
  "Facets": ["FacetNameA", "FacetNameB"]
}
  • Both properties are optional. Provide one or both:
    • facetCategories: Deletes each listed category by name, all facets within, and all mappings. Returns counts.
    • facets: Deletes each listed facet by name and all of its descendants in the hierarchy, plus all related mappings. Other categories/facets remain intact.

Response body (JSON):

{
  "categoriesDeleted": 1,
  "facetsDeleted": 7,
  "mappingsDeleted": 42
}

Behavior notes:

  • Deletions are name-based and case-sensitive to match stored names.
  • For categories, the service deletes mappings, then facets, then the category record and any associated category metadata.
  • For facets, the service deletes mappings, then deletes the facet and all descendants in a breadth-first traversal.
  • Requests are idempotent: nonexistent names are ignored (counted as zero deletions).

cURL examples:

Clear by category:

  curl -X POST \
      -H "Content-Type: application/json" \
      --data-binary '{
        "Facet_Categories": ["Consortium_Curated_Facets"]
      }' \
      http://localhost:8080/api/facet/loader/clear

Clear by facet names:

  curl -X POST \
      -H "Content-Type: application/json" \
      --data-binary '{
        "facets": ["Recover Adult", "Infected"]
      }' \
      http://localhost:8080/api/facet/loader/clear

Testing requests with FacetLoader.http

A FacetLoader.http file is included at the repository root with ready-to-run sample requests compatible with the IntelliJ HTTP Client.

How to use:

  • Open FacetLoader.http in IntelliJ IDEA or any IDE that supports the HTTP Client.
  • Adjust the @host variable at the top if your application runs on a different port (e.g., 8086 in Docker).
  • Click the gutter icons to execute requests for loading facets and clearing by category/facet names.

Running dev application locally on command line

  mvn spring-boot:run -Dspring-boot.run.arguments="--spring.profiles.active=dev"

You can reduce logging level by setting these properties in the applications-dev.properties file.

spring.jpa.show-sql=false
logging.level.org.hibernate.type=INFO
logging.level.org.hibernate.SQL=INFO
logging.level.org.hibernate.type.descriptor.sql=INFO

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages