Skip to content

feature/documentation #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

feature/documentation #3

wants to merge 19 commits into from

Conversation

bwalsh
Copy link

@bwalsh bwalsh commented Apr 14, 2025

This PR:

  • creates a "documentation first" approach
  • deprecates g3t collaborator/projects in favor of git-sync (similar to synapse-sync used for bridge2ai)
  • addresses "remote bucket" use cases not covered by git-lfs
  • addresses "metadata" use cases not covered by git-lfs
  • includes epics and sprints

@bwalsh bwalsh requested a review from Copilot April 14, 2025 18:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

docs/README.md:9

  • [nitpick] Consider adding a space after the numeral and period (e.g., '0. [README-comparison]') for more consistent and readable list formatting.
0.[README-comparison](README-comparison.md)

docs/README-comparison.md:12

  • The citation marker 'citeturn0search0' appears to be an unintended placeholder; consider removing or replacing it with a proper citation if needed.
- **Pointer-Based Storage:** Replaces large files in a Git repo with lightweight text pointers, while storing the actual file contents on a remote server. citeturn0search0

@lbeckman314
Copy link

lbeckman314 commented Apr 14, 2025

Note to self: @lbeckman314 🔮

Review this PR

initial draft

Co-authored-by: Copilot <[email protected]>
@bwalsh bwalsh force-pushed the feature/documentation branch from 568c2a2 to 31723c0 Compare April 16, 2025 17:09
@bwalsh
Copy link
Author

bwalsh commented Apr 28, 2025

@kellrott Does this capture intent?


Comparison of git-gen3 vs Git LFS

Feature git-gen3 Git LFS
Purpose Manage external document references in research projects (esp. Gen3/DRS/Genomics data) Manage large binary files directly attached to git repositories
Tracking Method Metadata about files (e.g., path, etag, MD5, SHA256, multiple remote locations) LFS pointer files (.gitattributes, .git/lfs/objects) point to large file storage
Download on Clone No automatic download; metadata only on clone. Explicit git drs pull needed to retrieve files. Automatically downloads necessary objects when needed, or lazily during checkout
State Management Tracks file states: Remote (R), Local (L), Modified (M), Untracked (U), Git-tracked (G) Files either exist in repo checkout or not; no explicit remote vs. local state tracking
Adding Files Add files to metadata index (git drs add), choose between upload, symlink, external S3 or DRS refs. git lfs track files, then git add to push objects into LFS server (gen3 backend via client side transport customization)
Remote Options Supports multiple remote backends: Gen3 DRS, S3, local filesystems, others Cient side transport customization required to redirect to alternate backends
Push Behavior Push uploads only modified files; unchanged references remain metadata-only Push uploads any committed LFS objects
Symlink Support Native symlink references supported (git drs add -l) No native symlink tracking; must be handled manually
Flexibility with External Sources Easy to reference existing DRS URIs, S3 paths, shared file paths Requires a)large objects to be added locally or b) separate handling for existing references, transport customization
Intended Usage Domain Scientific data, genomics workflows, distributed datasets General-purpose large file versioning (source code, game assets, media files, etc.)
Integration with Git Tools Acts as a git plugin (git drs), not a transparent layer Fully integrated into Git plumbing; transparent after setup
Maturity & Ecosystem Early stage, focused on Calypr and Gen3 integrations Mature, standardized, wide tooling ecosystem
Integration with clinical metadata requires integration requires integration

@bwalsh
Copy link
Author

bwalsh commented Apr 28, 2025

README-associating-biomedical-entities.md

📄 Associating Files with Biomedical Entities

Overview

In genomics, imaging, and clinical research, it is essential to associate data files with key biomedical entities such as:

  • Patient — the individual from whom data was collected.
  • Specimen — the biological sample (e.g., blood, tissue).
  • Assay — the experimental or clinical test performed (modeled as a ServiceRequest in FHIR).

git-gen3 supports direct tagging of files with these identifiers, enabling rich traceability and enhancing downstream data integration.


How to Associate Files

When adding a file using git drs add, you can attach biomedical entity identifiers:

git drs add path/to/file.vcf \
    --patient-id "Patient-12345" \
    --specimen-id "Specimen-67890" \
    --assay-id "ServiceRequest-ABCDE"

TODO - detail exact specification of .drs/ content

This command records the tags alongside the file metadata.

Example Metadata Entry

{
  "path": "path/to/file.vcf",
  "etag": "a7c1c0...",
  "size": 4534678,
  "remote": "s3://bucket/path/to/file.vcf",
  "patient_id": "Patient-12345",
  "specimen_id": "Specimen-67890",
  "assay_id": "ServiceRequest-ABCDE"
}

Tagging Fields

Field Description Example
patient_id Unique identifier for the Patient Patient-12345
specimen_id Unique identifier for the Specimen Specimen-67890
assay_id Unique identifier for the Assay (ServiceRequest) ServiceRequest-ABCDE

All fields are optional but highly recommended for structured datasets.


Bulk Association

To associate large numbers of files efficiently, see:
👉 [Bulk Tagging Files with Biomedical Identifiers](./README-bulk-association.md)


Best Practices

  • Use anonymized identifiers — avoid PHI/PII (e.g., no names, DOBs).
  • Consistent formats — align Patient, Specimen, and Assay IDs with external databases if applicable.
  • Version control — if an entity's data changes over time, use versioned identifiers or snapshots.
  • Metadata auditing — periodically validate the presence and consistency of biomedical tags.

Future Features

  • Search by Tags: Filter files by patient, specimen, or assay with upcoming git drs ls --filter commands.
  • Validation Hooks: Pre-push validation to ensure biomedical identifiers are populated correctly.

✅ Summary

Tagging files with biomedical entity identifiers using git-gen3:

  • Improves dataset traceability.
  • Enables clinical-grade data management.
  • Facilitates FAIR principles (Findable, Accessible, Interoperable, Reusable).

@bwalsh
Copy link
Author

bwalsh commented Apr 28, 2025

README-bulk-association.md

📄 Bulk Tagging Files with Biomedical Identifiers

Overview

In large research projects, it is often necessary to associate hundreds or thousands of files with biomedical identifiers such as Patient, Specimen, or Assay (ServiceRequest) IDs.

To streamline this process, git-gen3 supports bulk tagging using a simple manifest file.

This document describes how to prepare, import, and manage bulk file associations.

Preparing a Bulk Manifest

The manifest must be a CSV file with the following columns:

Column Name Description
file_path Relative path to the file in the repository
patient_id (Optional) Patient identifier
specimen_id (Optional) Specimen identifier
assay_id (Optional) Assay (ServiceRequest) identifier

Example Manifest

file_path,patient_id,specimen_id,assay_id
path/to/data1.vcf,Patient-12345,Specimen-67890,ServiceRequest-ABCDE
path/to/data2.vcf,Patient-54321,Specimen-09876,ServiceRequest-EDCBA
path/to/image1.dcm,Patient-11223,,ServiceRequest-ZYXWV
path/to/notes.txt,,,ServiceRequest-98765
  • Only file_path is required.
  • Leave a field blank if no identifier applies.

Importing the Manifest

Use the git drs import-manifest command:

git drs import-manifest path/to/manifest.csv

This will:

  • Lookup each file_path in the drs_metadata.json.
  • Attach or update the patient_id, specimen_id, and assay_id fields.
  • Skip entries for non-existent files with a warning.

Notes

  • Files must already exist in the repository or have been tracked via git drs add.
  • No overwriting of other metadata (e.g., etag, remote link) occurs.
  • Existing biomedical identifiers will be overwritten if the manifest provides new values.
  • Identifiers should be anonymous and non-identifiable to comply with data privacy guidelines.

Example Workflow

# 1. Track your files as usual
git drs add path/to/data1.vcf
git drs add path/to/data2.vcf

# 2. Prepare a manifest.csv linking files to biomedical IDs

# 3. Import the manifest
git drs import-manifest manifest.csv

# 4. (Optional) Verify updates
git drs ls --patient-id Patient-12345

Future Enhancements (Planned)

  • Validation of patient/specimen/assay ID formats.
  • Support for JSON manifest input (for non-tabular workflows).
  • Manifest-driven selective git drs pull (download files matching a manifest).

| Function | Description |
|-----------------------------|-------------|
| `fetch_github_teams()` | Get org teams, members, and slugs |
| `map_to_gen3_roles()` | Transform GitHub teams → Gen3 roles |
Copy link

@matthewpeterkort matthewpeterkort Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by this. Are you implying that all data repos in github would be under one organization and permissions per user would be done in that way in github and then synched to gen3 equivalent terms ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewpeterkort

Given typical git roles:

  • Read: Recommended for non-code contributors who want to view or discuss your project
  • Triage: Recommended for contributors who need to proactively manage issues, discussions, and pull requests without write access
  • Write: Recommended for contributors who actively push to your project
  • Maintain: Recommended for project managers who need to manage the repository without access to sensitive or destructive actions
  • Admin: Recommended for people who need full access to the project, including sensitive and destructive actions like managing security or deleting a repository

Mapping:

  • Read, Triage, Maintain: Mapped to gen3 read-only access
  • Admin, Write: Mapped to gen3 submitter, sower access


```yaml
projects:
project-xyz:
Copy link

@matthewpeterkort matthewpeterkort Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes it seem like each gen3 project is a github organization

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact mapping is TBD, but point taken: does github.organization map to gen3.program 🤔

| | |
+------------+---------------+------------------------------+
|
v

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this without much background knowledge what is unclear to me what the RoleSourceAdapter Interface aims to do

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level GitLab, Bitbucket, Git Enterprise have different APIs. The RoleSourceAdapter would adapt to standard


### 📦 Track Remote File
```bash
lfs-meta track-remote s3://my-bucket/data/foo.vcf \
Copy link

@matthewpeterkort matthewpeterkort Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look to simplify this further if possible. I understand that path is a subset of the full bucket path s3://my-bucket/data/foo.vcf but why is it needed?


### 🧬 Generate FHIR Metadata
```bash
lfs-meta init-meta \
Copy link

@matthewpeterkort matthewpeterkort Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like an cool pattern. Guessing it would make our existing push metadata from META directory pattern backwards compatible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes at the end of the day, I don't anticipate many (any?) changes to publish

exit 0
fi

lfs-meta validate --file .lfs-meta/metadata.json || {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it what is the validate command doing here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as g3t meta validate is current metadata complete?

@@ -0,0 +1,301 @@
# Overview `git-sync`

Copy link

@matthewpeterkort matthewpeterkort Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this is another server micro-service running in a pod server side. I understand there would be a some sort of sync operation with github style inputs and expecting to reflect it in gen3, I'd be curious to see a simplified openAPi spec on what this micro-service would exactly look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants