feature/documentation #3

bwalsh · 2025-04-14T18:25:40Z

This PR:

creates a "documentation first" approach
deprecates g3t collaborator/projects in favor of git-sync (similar to synapse-sync used for bridge2ai)
addresses "remote bucket" use cases not covered by git-lfs
addresses "metadata" use cases not covered by git-lfs
includes epics and sprints

Copilot

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

docs/README.md:9

[nitpick] Consider adding a space after the numeral and period (e.g., '0. [README-comparison]') for more consistent and readable list formatting.

0.[README-comparison](README-comparison.md)

docs/README-comparison.md:12

The citation marker 'citeturn0search0' appears to be an unintended placeholder; consider removing or replacing it with a proper citation if needed.

- **Pointer-Based Storage:** Replaces large files in a Git repo with lightweight text pointers, while storing the actual file contents on a remote server. citeturn0search0

docs/README-comparison.md

lbeckman314 · 2025-04-14T19:20:36Z

Note to self: @lbeckman314 🔮

Review this PR

initial draft Co-authored-by: Copilot <[email protected]>

bwalsh · 2025-04-28T15:20:10Z

@kellrott Does this capture intent?

Comparison of git-gen3 vs Git LFS

Feature	git-gen3	Git LFS
Purpose	Manage external document references in research projects (esp. Gen3/DRS/Genomics data)	Manage large binary files directly attached to git repositories
Tracking Method	Metadata about files (e.g., path, etag, MD5, SHA256, multiple remote locations)	LFS pointer files (.gitattributes, .git/lfs/objects) point to large file storage
Download on Clone	No automatic download; metadata only on clone. Explicit git drs pull needed to retrieve files.	Automatically downloads necessary objects when needed, or lazily during checkout
State Management	Tracks file states: Remote (R), Local (L), Modified (M), Untracked (U), Git-tracked (G)	Files either exist in repo checkout or not; no explicit remote vs. local state tracking
Adding Files	Add files to metadata index (git drs add), choose between upload, symlink, external S3 or DRS refs.	git lfs track files, then git add to push objects into LFS server (gen3 backend via client side `transport customization`)
Remote Options	Supports multiple remote backends: Gen3 DRS, S3, local filesystems, others	Cient side `transport customization` required to redirect to alternate backends
Push Behavior	Push uploads only modified files; unchanged references remain metadata-only	Push uploads any committed LFS objects
Symlink Support	Native symlink references supported (git drs add -l)	No native symlink tracking; must be handled manually
Flexibility with External Sources	Easy to reference existing DRS URIs, S3 paths, shared file paths	Requires a)large objects to be added locally or b) separate handling for existing references, `transport customization`
Intended Usage Domain	Scientific data, genomics workflows, distributed datasets	General-purpose large file versioning (source code, game assets, media files, etc.)
Integration with Git Tools	Acts as a git plugin (git drs), not a transparent layer	Fully integrated into Git plumbing; transparent after setup
Maturity & Ecosystem	Early stage, focused on Calypr and Gen3 integrations	Mature, standardized, wide tooling ecosystem
Integration with clinical metadata	requires integration	requires integration

bwalsh · 2025-04-28T15:50:48Z

README-associating-biomedical-entities.md

📄 Associating Files with Biomedical Entities

Overview

In genomics, imaging, and clinical research, it is essential to associate data files with key biomedical entities such as:

Patient — the individual from whom data was collected.
Specimen — the biological sample (e.g., blood, tissue).
Assay — the experimental or clinical test performed (modeled as a ServiceRequest in FHIR).

git-gen3 supports direct tagging of files with these identifiers, enabling rich traceability and enhancing downstream data integration.

How to Associate Files

When adding a file using git drs add, you can attach biomedical entity identifiers:

git drs add path/to/file.vcf \
    --patient-id "Patient-12345" \
    --specimen-id "Specimen-67890" \
    --assay-id "ServiceRequest-ABCDE"

TODO - detail exact specification of .drs/ content

This command records the tags alongside the file metadata.

Example Metadata Entry

{
  "path": "path/to/file.vcf",
  "etag": "a7c1c0...",
  "size": 4534678,
  "remote": "s3://bucket/path/to/file.vcf",
  "patient_id": "Patient-12345",
  "specimen_id": "Specimen-67890",
  "assay_id": "ServiceRequest-ABCDE"
}

Tagging Fields

Field	Description	Example
`patient_id`	Unique identifier for the Patient	`Patient-12345`
`specimen_id`	Unique identifier for the Specimen	`Specimen-67890`
`assay_id`	Unique identifier for the Assay (ServiceRequest)	`ServiceRequest-ABCDE`

All fields are optional but highly recommended for structured datasets.

Bulk Association

To associate large numbers of files efficiently, see:
👉 [Bulk Tagging Files with Biomedical Identifiers](./README-bulk-association.md)

Best Practices

Use anonymized identifiers — avoid PHI/PII (e.g., no names, DOBs).
Consistent formats — align Patient, Specimen, and Assay IDs with external databases if applicable.
Version control — if an entity's data changes over time, use versioned identifiers or snapshots.
Metadata auditing — periodically validate the presence and consistency of biomedical tags.

Future Features

Search by Tags: Filter files by patient, specimen, or assay with upcoming git drs ls --filter commands.
Validation Hooks: Pre-push validation to ensure biomedical identifiers are populated correctly.

✅ Summary

Tagging files with biomedical entity identifiers using git-gen3:

Improves dataset traceability.
Enables clinical-grade data management.
Facilitates FAIR principles (Findable, Accessible, Interoperable, Reusable).

bwalsh · 2025-04-28T15:56:26Z

README-bulk-association.md

📄 Bulk Tagging Files with Biomedical Identifiers

Overview

In large research projects, it is often necessary to associate hundreds or thousands of files with biomedical identifiers such as Patient, Specimen, or Assay (ServiceRequest) IDs.

To streamline this process, git-gen3 supports bulk tagging using a simple manifest file.

This document describes how to prepare, import, and manage bulk file associations.

Preparing a Bulk Manifest

The manifest must be a CSV file with the following columns:

Column Name	Description
`file_path`	Relative path to the file in the repository
`patient_id`	(Optional) Patient identifier
`specimen_id`	(Optional) Specimen identifier
`assay_id`	(Optional) Assay (ServiceRequest) identifier

Example Manifest

file_path,patient_id,specimen_id,assay_id
path/to/data1.vcf,Patient-12345,Specimen-67890,ServiceRequest-ABCDE
path/to/data2.vcf,Patient-54321,Specimen-09876,ServiceRequest-EDCBA
path/to/image1.dcm,Patient-11223,,ServiceRequest-ZYXWV
path/to/notes.txt,,,ServiceRequest-98765

Only file_path is required.
Leave a field blank if no identifier applies.

Importing the Manifest

Use the git drs import-manifest command:

git drs import-manifest path/to/manifest.csv

This will:

Lookup each file_path in the drs_metadata.json.
Attach or update the patient_id, specimen_id, and assay_id fields.
Skip entries for non-existent files with a warning.

Notes

Files must already exist in the repository or have been tracked via git drs add.
No overwriting of other metadata (e.g., etag, remote link) occurs.
Existing biomedical identifiers will be overwritten if the manifest provides new values.
Identifiers should be anonymous and non-identifiable to comply with data privacy guidelines.

Example Workflow

# 1. Track your files as usual
git drs add path/to/data1.vcf
git drs add path/to/data2.vcf

# 2. Prepare a manifest.csv linking files to biomedical IDs

# 3. Import the manifest
git drs import-manifest manifest.csv

# 4. (Optional) Verify updates
git drs ls --patient-id Patient-12345

Future Enhancements (Planned)

Validation of patient/specimen/assay ID formats.
Support for JSON manifest input (for non-tabular workflows).
Manifest-driven selective git drs pull (download files matching a manifest).

matthewpeterkort · 2025-04-29T20:05:35Z

docs/README-git-sync.md

+| Function                     | Description |
+|-----------------------------|-------------|
+| `fetch_github_teams()`      | Get org teams, members, and slugs |
+| `map_to_gen3_roles()`       | Transform GitHub teams → Gen3 roles |


Not sure what you mean by this. Are you implying that all data repos in github would be under one organization and permissions per user would be done in that way in github and then synched to gen3 equivalent terms ?

@matthewpeterkort

Given typical git roles:

Read: Recommended for non-code contributors who want to view or discuss your project

Triage: Recommended for contributors who need to proactively manage issues, discussions, and pull requests without write access

Write: Recommended for contributors who actively push to your project

Maintain: Recommended for project managers who need to manage the repository without access to sensitive or destructive actions

Admin: Recommended for people who need full access to the project, including sensitive and destructive actions like managing security or deleting a repository

Mapping:

Read, Triage, Maintain: Mapped to gen3 read-only access

Admin, Write: Mapped to gen3 submitter, sower access

matthewpeterkort · 2025-04-29T20:06:37Z

docs/README-git-sync.md

+
+```yaml
+projects:
+  project-xyz:


this makes it seem like each gen3 project is a github organization

Exact mapping is TBD, but point taken: does github.organization map to gen3.program 🤔

matthewpeterkort · 2025-04-29T20:10:11Z

docs/README-git-sync.md

+                |                            |                              |
+                +------------+---------------+------------------------------+
+                             |
+                             v


looking at this without much background knowledge what is unclear to me what the RoleSourceAdapter Interface aims to do

At a high level GitLab, Bitbucket, Git Enterprise have different APIs. The RoleSourceAdapter would adapt to standard

matthewpeterkort · 2025-04-29T20:42:23Z

docs/README-gitlfs-remote-buckets.md

+
+### 📦 Track Remote File
+```bash
+lfs-meta track-remote s3://my-bucket/data/foo.vcf \


I would look to simplify this further if possible. I understand that path is a subset of the full bucket path s3://my-bucket/data/foo.vcf but why is it needed?

matthewpeterkort · 2025-04-29T20:43:45Z

docs/README-gitlfs-remote-buckets.md

+
+### 🧬 Generate FHIR Metadata
+```bash
+lfs-meta init-meta \


this looks like an cool pattern. Guessing it would make our existing push metadata from META directory pattern backwards compatible?

Yes at the end of the day, I don't anticipate many (any?) changes to publish

matthewpeterkort · 2025-04-29T20:45:57Z

docs/README-gitlfs-template-project.md

+  exit 0
+fi
+
+lfs-meta validate --file .lfs-meta/metadata.json || {


I missed it what is the validate command doing here?

Same as g3t meta validate is current metadata complete?

matthewpeterkort · 2025-04-29T20:55:00Z

docs/README-git-sync.md

@@ -0,0 +1,301 @@
+# Overview `git-sync`
+


Guessing this is another server micro-service running in a pod server side. I understand there would be a some sort of sync operation with github style inputs and expecting to reflect it in gen3, I'd be curious to see a simplified openAPi spec on what this micro-service would exactly look like.

bwalsh requested a review from Copilot April 14, 2025 18:25

Copilot AI reviewed Apr 14, 2025

View reviewed changes

docs/README-comparison.md Outdated Show resolved Hide resolved

bwalsh requested review from kellrott, teslajoy, matthewpeterkort, jordan2lee and quinnwai April 14, 2025 18:40

initial draft

31723c0

initial draft Co-authored-by: Copilot <[email protected]>

bwalsh force-pushed the feature/documentation branch from 568c2a2 to 31723c0 Compare April 16, 2025 17:09

bwalsh and others added 9 commits April 16, 2025 13:05

adds sequence diagram

ce76709

adds sequence diagram txt

41c23c5

adds overview

7e9b3c6

adds trackremote

16e2082

cleanup

59d2967

Starting to add user story based on DRS

0fc9231

Fixing small text issue

dfadf12

Adding more details about interfacing with a DRS server

04d8def

format table

2986b34

bwalsh added 4 commits April 28, 2025 10:10

Update README.md

db3ad45

adds hybrid-oid sha256

3f07306

Update README.md

b218d75

Update README-epic.md

e9fac72

matthewpeterkort reviewed Apr 29, 2025

View reviewed changes

kellrott and others added 13 commits May 7, 2025 09:13

Experimenting with integrating patterns from git-lfs

c6f9ff8

Adding DRS query test to code

be0294c

Starting to outline the DRS/indexd client support

c3c70a3

1st draft DRS query and download, make sure to setup .drsconfig

317750b

make DownloadFile more loosely coupled

90f6b63

drafted README requirements

4761c89

typos

a6b254a

clarify requirements, provide git lfs vs git drs comparison

cda55f9

Adds quickstart

dc107bc

Adds quickstart

d1eb8ea

Adds quickstart

00f927c

Adds quickstart

8734d0d

Adds quickstart

3972500

feature/documentation #3

Are you sure you want to change the base?

feature/documentation #3

Uh oh!

Conversation

bwalsh commented Apr 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lbeckman314 commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bwalsh commented Apr 28, 2025

Comparison of git-gen3 vs Git LFS

Uh oh!

bwalsh commented Apr 28, 2025

📄 Associating Files with Biomedical Entities

Overview

How to Associate Files

Example Metadata Entry

Tagging Fields

Bulk Association

Best Practices

Future Features

✅ Summary

Uh oh!

bwalsh commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

README-bulk-association.md

📄 Bulk Tagging Files with Biomedical Identifiers

Overview

Preparing a Bulk Manifest

Example Manifest

Importing the Manifest

Notes

Example Workflow

Future Enhancements (Planned)

Uh oh!

matthewpeterkort Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwalsh Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

matthewpeterkort Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwalsh Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

matthewpeterkort Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

bwalsh Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

matthewpeterkort Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewpeterkort Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwalsh Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

matthewpeterkort Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

bwalsh Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

matthewpeterkort Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

lbeckman314 commented Apr 14, 2025 •

edited

Loading

bwalsh commented Apr 28, 2025 •

edited

Loading

matthewpeterkort Apr 29, 2025 •

edited

Loading

matthewpeterkort Apr 29, 2025 •

edited

Loading

matthewpeterkort Apr 29, 2025 •

edited

Loading

matthewpeterkort Apr 29, 2025 •

edited

Loading

matthewpeterkort Apr 29, 2025 •

edited

Loading