GitHub - hms-dbmi/picsure-dictionary-etl

Code Formatting
Uploading a TSV file for /anvil/upload-tsv data ingest
Local Database for Development
- Installing pg_restore and pg_dump on Mac
- Accessing the docker database
FHIR Controller
Facet Loader
Running dev application locally on command line

Code Formatting

Before contributing code, please set up our git hook:

  cp code-formatting/pre-commit.sh .git/hooks/pre-commit
  chmod +x .git/hooks/pre-commit

To skip formatting on a block of code, wrap in spotless:off, spotless:on comments

Uploading a TSV file for `/anvil/upload-tsv` data ingest

  curl -X POST \
      -H "Content-Type: text/plain" \
      --data-binary @path/to/your/file.tsv \
      http://localhost:8080/anvil/upload-tsv

Local Database for Development

To set up a local development database, follow these steps:

Run the docker-compose.postgres.yml file to start the required services:

   docker-compose -f docker-compose.postgres.yml up --build -d

Execute the db/schema.sql script to create the dict schema and all necessary database tables.
(Optional) If you have a dump file generated with pg_dump, you can restore it using pg_restore. If the backup is from our BDC database, the following command will work:

   pg_restore \
      --host="localhost" \
      --port="5432" \
      --username="username" \
      --dbname="dictionary_db" \
      --jobs=10 \
      --verbose \
      --no-owner \
      ./dictionary_db

You will be prompted to enter the database password after executing this command.

Now you can run the Spring Boot application with dev as the active profile.

Installing `pg_restore` and `pg_dump` on Mac

If you do not have pg_restore and pg_dump, you can install them by following these steps:

Install postgresql@16, which includes both pg_restore and pg_dump:

  brew install postgresql@16

Link the installed PostgreSQL version:

   brew link postgresql@16

Add PostgreSQL to your zshrc or bash profile. For zshrc, use the following command:

  echo 'export PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH"' >> ~/.zshrc

Verify the versions of pg_dump and pg_restore:

  pg_dump --version
  pg_restore --version

If the command doesn’t work immediately, you may need to restart your terminal.

Accessing the docker database

  docker exec -it postgres-db psql -U username -d dictionary_db

From within, you can execute set search_path to dict; to search within the dict schema (similar to running use dict; in mysql databases).

FHIR Controller

The FHIR Controller provides endpoints to interact with a FHIR (Fast Healthcare Interoperability Resources) API for fetching and processing research study metadata.

Configuration:

Add the following properties to your application configuration file (e.g., application-dev.properties):

fhir.api.base.url=https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1
fhir.api.bulk.endpoint=/ResearchStudy

Endpoints:

GET /api/fhir/research-studies/
- Fetches all research studies from the FHIR API.
- Returns a list of ResearchStudy objects containing study metadata.
- Uses pagination (500 results per page) to handle large datasets.
```
curl -X GET http://localhost:8080/api/fhir/research-studies/
```
GET /api/fhir/research-studies/find-dbgap
- Extracts distinct phs (dbGaP study) values from all research studies.
- Returns a list of unique study identifiers.
```
curl -X GET http://localhost:8080/api/fhir/research-studies/find-dbgap
```
PATCH /api/fhir/load/metadata/refresh
- Refreshes dataset metadata by fetching studies from the FHIR API.
- Updates dataset records based on study identifiers and metadata.
- Returns a summary of datasets updated and metadata records processed.
```
curl -X PATCH http://localhost:8080/api/fhir/load/metadata/refresh
```
PATCH /api/fhir/load/metadata/mapping
- Maps FHIR metadata keys to dataset metadata fields.
- Reads a JSON mapping configuration to associate metadata attributes with datasets.
- Requires fhir.api.url.to.key.map.json configuration property.
```
curl -X PATCH http://localhost:8080/api/fhir/load/metadata/mapping
```

Implementation notes:

The service uses pagination with a page size of 500 results to stay within the default 10MB WebClient buffer limit.
Each paginated request fetches approximately 9.7MB of data, safely below the buffer threshold.
The Meta model supports versionId, lastUpdated, and source fields returned by the FHIR API.

Facet Loader

This project includes a Facet Loader that ingests a JSON payload describing Facet Categories, their Facets (including nested children), and mapping rules to automatically associate facets with concept nodes based on a concept path.

What it does:

Upserts facet categories and facets by name (idempotent). If a name already exists it is updated; otherwise it is created.
Builds a facet hierarchy using parent_id based on the JSON nesting of facets.
Evaluates expressions to map facets to concept nodes (rows in dict.concept_node) and writes relationships into dict.facet__concept_node.

API endpoint:

POST /api/facet/loader/load
Request body is a JSON array of objects each containing a Facet_Category.
Returns a JSON object with counts: categoriesCreated, categoriesUpdated, facetsCreated, facetsUpdated.

Example request body (minimal):

[
  {
    "Facet_Category": {
      "Name": "Consortium_Curated_Facets",
      "Display": "Consortium Curated Facets",
      "Description": "Consortium Curated Facets Description",
      "Metadata": [
        {
          "Key": "order",
          "Value": "1"
        }
      ],
      "Facets": [
        {
          "Name": "Recover Adult",
          "Display": "RECOVER Adult",
          "Description": "Recover adult parent facet.",
          "Facets": [
            {
              "Name": "Infected",
              "Display": "Infected",
              "Description": "Infected Facet Description"
            }
          ]
        }
      ]
    }
  }
]

Example with expressions (maps facets to concept nodes by concept_path):

Concept path sample: \phs003436\Recover_Adult\visits\inf\12\pasc_cc_2024\
Nodes (by index):
- 0: phs003436
- 1: Recover_Adult
- 2: biostats_derived
- 3: visits
- 4: inf
- 5: 12
- 6: pasc_cc_2024
Negative indices are allowed (e.g., -1 is last node, -2 second to last, etc.).

Example payload fragment using OR expression groups (OR-of-ANDs):

[
  {
    "Facet_Category": {
      "Name": "Consortium_Curated_Facets",
      "Facets": [
        {
          "Name": "Recover Adult Infected or PASC",
          "Expression_Groups": [
            [
              { "regex": "(?i)\\binf(ected)?\\b", "node": -3 }
            ],
            [
              { "contains": "pasc", "node": -1 }
            ]
          ]
        },
        {
          "Name": "Same using camelCase alias",
          "expressionGroups": [
            [ { "regex": "(?i)\\binf(ected)?\\b", "node": -3 } ],
            [ { "contains": "pasc", "node": -1 } ]
          ]
        }
      ]
    }
  }
]

Expression evaluation rules

Supported keys per expression entry: exactly, contains, regex (use one or more).
Two ways to define rules:
- Expression_Groups (new): an array of groups; OR across groups, AND within each group.
node is optional for any entry:
- If node is provided, we evaluate only that node (zero-based; negatives allowed, e.g., -1 is last node).
- If node is omitted, the expression scans all nodes in the concept_path and evaluates true if any node matches.
Evaluation semantics:
- If Expression_Groups is present and non-empty: the facet matches a concept if ANY group matches; each group requires ALL of its entries to match (OR-of-ANDs).
- Within a single entry, all provided keys (exactly/contains/regex) must match the same node value.
Regex uses Java syntax; inline flags like (?i) are supported. Exactly/contains are literal string matches.
Out-of-bounds node indices or invalid regex cause that entry to evaluate to false.
Inheritance to children:
- Child facets inherit their parents' effective groups. Effective groups are formed as a cross-product: each parent group is AND-combined with each child group.
- If a parent has groups and a child defines no groups, the child inherits the parent groups as-is. If neither parent nor child has groups, no automatic mapping occurs.
If a facet contains no groups at any level (no Expression_Groups on it or any ancestor), it is not automatically mapped to any concept nodes (it is still created in the hierarchy).

Alias handling in payload:

Facet fields support camelCase aliases: Name/name, Display/display, Description/description.
For grouped rules, Expression_Groups is preferred; expressionGroups and ExpressionGroups are also accepted as aliases.
Each expression entry supports keys: exactly, contains, regex, and optional node.

Pre-requisites for mapping:

dict.concept_node should already be populated (e.g., via the Dictionary Loader) so that concept_path values exist to evaluate.

cURL examples:

Save payload to payload.json and run:

  curl -X POST \
      -H "Content-Type: application/json" \
      --data-binary @payload.json \
      http://localhost:8080/api/facet/loader/load

Sample response:

{
  "categoriesCreated": 1,
  "categoriesUpdated": 0,
  "facetsCreated": 5,
  "facetsUpdated": 0
}

Idempotency and updates

Re-posting the same payload will update existing categories/facets and keep the hierarchy; counts will reflect updates rather than creations.
Facet-to-concept mappings avoid duplicates using ON CONFLICT semantics in the repository/service layer.

Running tests

Tests you can run (Docker required for Testcontainers):

mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetExpressionEvaluatorTest test
mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetLoaderClearIntegrationTest test
mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetLoaderControllerTest test
mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetLoaderMappingIntegrationTest test
mvn -Dtest=edu.harvard.dbmi.avillach.dictionaryetl.facet.FacetLoaderServiceTest test

Or run all tests:

mvn test

Troubleshooting

If no concept nodes are mapped after posting, verify that concept_node rows exist and your node positions/regex align with your concept_path structure.
Ensure the app is running on the expected port and CORS origin (Facet Loader controller is configured with CrossOrigin for http://localhost:8081 by default).

Clear facets and categories

Use this endpoint to remove facet categories and/or specific facets (including all their descendants) by name. The service will also remove facet-to-concept mappings first to preserve referential integrity.

Endpoint: POST /api/facet/loader/clear
Request body (JSON):

{
  "Facet_Categories": ["CategoryName1", "CategoryName2"],
  "Facets": ["FacetNameA", "FacetNameB"]
}

Both properties are optional. Provide one or both:
- facetCategories: Deletes each listed category by name, all facets within, and all mappings. Returns counts.
- facets: Deletes each listed facet by name and all of its descendants in the hierarchy, plus all related mappings. Other categories/facets remain intact.

Response body (JSON):

{
  "categoriesDeleted": 1,
  "facetsDeleted": 7,
  "mappingsDeleted": 42
}

Behavior notes:

Deletions are name-based and case-sensitive to match stored names.
For categories, the service deletes mappings, then facets, then the category record and any associated category metadata.
For facets, the service deletes mappings, then deletes the facet and all descendants in a breadth-first traversal.
Requests are idempotent: nonexistent names are ignored (counted as zero deletions).

cURL examples:

Clear by category:

  curl -X POST \
      -H "Content-Type: application/json" \
      --data-binary '{
        "Facet_Categories": ["Consortium_Curated_Facets"]
      }' \
      http://localhost:8080/api/facet/loader/clear

Clear by facet names:

  curl -X POST \
      -H "Content-Type: application/json" \
      --data-binary '{
        "facets": ["Recover Adult", "Infected"]
      }' \
      http://localhost:8080/api/facet/loader/clear

Testing requests with FacetLoader.http

A FacetLoader.http file is included at the repository root with ready-to-run sample requests compatible with the IntelliJ HTTP Client.

How to use:

Open FacetLoader.http in IntelliJ IDEA or any IDE that supports the HTTP Client.
Adjust the @host variable at the top if your application runs on a different port (e.g., 8086 in Docker).
Click the gutter icons to execute requests for loading facets and clearing by category/facet names.

Running dev application locally on command line

  mvn spring-boot:run -Dspring-boot.run.arguments="--spring.profiles.active=dev"

You can reduce logging level by setting these properties in the applications-dev.properties file.

spring.jpa.show-sql=false
logging.level.org.hibernate.type=INFO
logging.level.org.hibernate.SQL=INFO
logging.level.org.hibernate.type.descriptor.sql=INFO

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github/workflows		.github/workflows
code-formatting		code-formatting
src		src
wrapper		wrapper
.gitignore		.gitignore
Dockerfile		Dockerfile
FacetLoader.http		FacetLoader.http
LICENSE		LICENSE
README.md		README.md
docker-compose.postgres.yml		docker-compose.postgres.yml
docker-compose.yml		docker-compose.yml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Formatting

Uploading a TSV file for `/anvil/upload-tsv` data ingest

Local Database for Development

Installing `pg_restore` and `pg_dump` on Mac

Accessing the docker database

FHIR Controller

Facet Loader

Expression evaluation rules

Idempotency and updates

Running tests

Troubleshooting

Clear facets and categories

Testing requests with FacetLoader.http

Running dev application locally on command line

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code Formatting

Uploading a TSV file for /anvil/upload-tsv data ingest

Local Database for Development

Installing pg_restore and pg_dump on Mac

Accessing the docker database

FHIR Controller

Facet Loader

Expression evaluation rules

Idempotency and updates

Running tests

Troubleshooting

Clear facets and categories

Testing requests with FacetLoader.http

Running dev application locally on command line

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uploading a TSV file for `/anvil/upload-tsv` data ingest

Installing `pg_restore` and `pg_dump` on Mac

Packages