Skip to content

Commit 31723c0

Browse files
bwalshCopilot
andcommitted
initial draft
initial draft Co-authored-by: Copilot <[email protected]>
1 parent a75df59 commit 31723c0

10 files changed

+1536
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.idea/
2+
.DS_Store

docs/README-comparison.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Comparison: Git LFS and g3t Integrated Data Platform (ACED-IDP)
2+
A comparative overview of two distinct approaches to managing and storing large project data files: Git Large File Storage (Git LFS) and the ACED Integrated Data Platform (ACED-IDP).
3+
4+
---
5+
6+
## Git Large File Storage (Git LFS)
7+
8+
**Purpose:** Git LFS is an open-source Git extension designed to handle large files efficiently within Git repositories.
9+
10+
**Key Features:**
11+
12+
- **Pointer-Based Storage:** Replaces large files (e.g., audio, video, datasets) in the Git repository with lightweight text pointers, while storing the actual file contents on a remote server.
13+
14+
- **Seamless Git Integration:** Allows developers to use standard Git commands (`add`, `commit`, `push`, `pull`) without altering their workflow.
15+
16+
- **Selective File Tracking:** Developers specify which file types to track using `.gitattributes`, enabling granular control over large file management.
17+
18+
- **Storage Efficiency:** By offloading large files, it keeps the Git repository size manageable, improving performance for cloning and fetching operations.
19+
20+
**Use Cases:**
21+
22+
- Software development projects involving large binary assets, such as game development, multimedia applications, or data science projects.
23+
24+
---
25+
26+
## ACED Integrated Data Platform (ACED-IDP)
27+
28+
**Purpose:** ACED-IDP is a specialized data commons platform developed by the International Alliance for Cancer Early Detection (ACED) to facilitate secure and structured sharing of research data among member institutions.
29+
30+
**Key Features:**
31+
32+
- **Gen3-Based Infrastructure:** Utilizes Gen3, an open-source data commons framework, to manage data submission, storage, and access.
33+
34+
- **Command-Line Interface (CLI):** Provides the `gen3-tracker (g3t)` CLI tool for researchers to create projects, upload files, and associate metadata incrementally.
35+
36+
- **FHIR Metadata Integration:** Supports the addition of Fast Healthcare Interoperability Resources (FHIR) metadata, enhancing data interoperability and standardization.
37+
38+
- **Role-Based Access Control:** Implements fine-grained access controls to ensure data security and compliance with privacy regulations.
39+
40+
- **Data Exploration and Querying:** Offers tools for data exploration and querying, facilitating collaborative research and analysis.
41+
**Use Cases:**
42+
43+
- Biomedical research projects requiring secure, standardized, and collaborative data management, particularly in multi-institutional settings.
44+
45+
---
46+
47+
## Comparative Summary
48+
49+
| Feature | Git LFS | ACED-IDP |
50+
|---------------------------|--------------------------------------------------------|-----------------------------------------------------------|
51+
| **Primary Use Case** | Managing large files in software development projects | Collaborative biomedical research data management |
52+
| **Integration** | Seamless with Git workflows | Built on Gen3 framework with specialized CLI tools |
53+
| **Data Storage** | Remote storage with Git pointers | Structured data commons with metadata support |
54+
| **Access Control** | Inherits Git repository permissions | Role-based access control for data security |
55+
| **Metadata Support** | Limited | Comprehensive, including FHIR standards |
56+
| **Collaboration Features**| Standard Git collaboration tools | Enhanced tools for data exploration and querying |
57+
58+
---
59+
60+
**Conclusion:**
61+
62+
- **Git LFS** is ideal for developers seeking to manage large files within their existing Git workflows, offering a straightforward solution without the need for additional infrastructure.
63+
64+
- **ACED-IDP** caters to the complex needs of collaborative biomedical research, providing a robust platform for secure data sharing, standardized metadata integration, and advanced data exploration capabilities.
65+
66+
The choice between Git LFS and ACED-IDP depends on the specific requirements of the project, including the nature of the data, collaboration needs, and compliance considerations.

docs/README-epic.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
2+
3+
# 🚀 Epic: Develop `git-gen3` Tool for Git-Based Gen3 Integration
4+
5+
> Create a Git-native utility to track and synchronize remote object metadata, generate FHIR-compliant metadata, and manage Gen3 access control using `git-sync`.
6+
7+
---
8+
9+
## 🧭 Sprint 0: Architecture Spike
10+
11+
### 🎯 Goal:
12+
De-risk implementation by validating core architectural assumptions and tool compatibility.
13+
14+
### 🔬 Tasks:
15+
| ID | Task Description | Est. |
16+
|--------|------------------------------------------------------------------------|------|
17+
| SPK-1 | Prototype `track-remote` to fetch metadata (e.g., ETag, size) from S3/GCS | 1d |
18+
| SPK-2 | Simulate `.lfs-meta/metadata.json` usage in Git repo + commit/push | 0.5d |
19+
| SPK-3 | Test `init-meta` to produce `DocumentReference.ndjson` via `g3t`-style logic | 1d |
20+
| SPK-4 | Validate `git-sync` role mappings and diffs against Gen3 fence API | 1d |
21+
| SPK-5 | Evaluate GitHub template DX: hooks, portability, local usage | 0.5d |
22+
23+
### ✅ Deliverables:
24+
- Prototype CLI for `track-remote`
25+
- Sample `.lfs-meta/metadata.json` and generated `META/DocumentReference.ndjson`
26+
- Credential access matrix (S3, GCS, Azure)
27+
- Feasibility report for Git-driven role syncing via `git-sync`
28+
- Recommendation on proceeding with full implementation
29+
30+
---
31+
32+
## 🧭 Sprint 1: CLI Bootstrapping & Remote File Tracking
33+
34+
### 🎯 Goal:
35+
Create the `git-gen3` CLI structure and implement the ability to track remote cloud objects in Git without downloading them.
36+
37+
### 🔨 Tasks:
38+
| ID | Task Description | Est. |
39+
|------|------------------------------------------------------|------|
40+
| S1-1 | Scaffold `git-gen3` CLI with Click (Python) or Cobra (Go) | 2d |
41+
| S1-2 | Implement `track` and `track-remote` subcommands | 2d |
42+
| S1-3 | Write to `.lfs-meta/metadata.json` | 1d |
43+
| S1-4 | Support auth with AWS, GCS, Azure (env vars + profiles) | 1d |
44+
| S1-5 | Add `pre-push` hook to validate metadata before push | 1d |
45+
| S1-6 | Unit tests for `track-remote` and metadata structure | 1d |
46+
47+
### ✅ Deliverables:
48+
- Functional CLI command: `git-gen3 track-remote s3://...`
49+
- `.lfs-meta/metadata.json` updated and committed in Git
50+
- Git hook active for metadata validation
51+
- CI-ready foundation for next sprint
52+
53+
---
54+
55+
## 🧭 Sprint 2: Metadata Initialization + FHIR Generation
56+
57+
### 🎯 Goal:
58+
Transform `.lfs-meta/metadata.json` entries into Gen3-compatible `DocumentReference.ndjson` metadata using FHIR structure.
59+
60+
### 🔨 Tasks:
61+
| ID | Task Description | Est. |
62+
|------|--------------------------------------------------------------------|------|
63+
| S2-1 | Implement `init-meta` to emit `META/DocumentReference.ndjson` | 2d |
64+
| S2-2 | Populate FHIR fields: `subject`, `context.related`, `attachment` | 1d |
65+
| S2-3 | Create `validate-meta` command to check metadata completeness | 1d |
66+
| S2-4 | Write tests for `init-meta` and FHIR formatting | 1d |
67+
| S2-5 | Document schema, CLI usage, and FHIR integration points | 1d |
68+
69+
### ✅ Deliverables:
70+
- `git-gen3 init-meta` produces valid FHIR NDJSON
71+
- Tool handles patient/specimen references
72+
- Tests validate output conformance
73+
- Documentation aligns with `g3t upload` workflows
74+
75+
---
76+
77+
## 🧭 Sprint 3: Git-Sync Integration & Access Control
78+
79+
### 🎯 Goal:
80+
Replace `collaborator` and `project-management` with Git-based role assignments using `git-sync` and Gen3 fence APIs.
81+
82+
### 🔨 Tasks:
83+
| ID | Task Description | Est. |
84+
|------|-------------------------------------------------------------------|------|
85+
| S3-1 | Integrate `git-sync` YAML/CSV parser into `git-gen3 sync-users` | 2d |
86+
| S3-2 | Implement dry-run and apply modes for syncing to Gen3 fence | 1d |
87+
| S3-3 | Add change auditing (diff viewer from Git commits) | 1d |
88+
| S3-4 | End-to-end test: Git → Gen3 user role propagation | 1d |
89+
| S3-5 | Write user guide and governance documentation | 1d |
90+
91+
### ✅ Deliverables:
92+
- `git-gen3 sync-users` CLI reads Git-tracked access config
93+
- Git diffs capture permission changes over time
94+
- Gen3 access control (via Fence) is synced reliably
95+
- Finalized documentation for institutional onboarding
96+
97+
---
98+
99+
## 📅 Sprint Timeline Summary
100+
101+
| Sprint | Focus | Duration | Deliverables |
102+
|--------|----------------------------------|----------|-----------------------------------------------|
103+
| 0 | Architecture validation (spike) | 1 week | Prototypes + greenlight for implementation |
104+
| 1 | Remote file tracking | 2 weeks | `track-remote`, `.lfs-meta`, validation hooks |
105+
| 2 | Metadata generation (FHIR) | 2 weeks | FHIR output, `init-meta`, validation tooling |
106+
| 3 | Git-based access control | 2 weeks | `sync-users`, Git audit trail, Fence sync |
107+
108+
---
109+
110+
## 🛠 Toolchain
111+
112+
| Purpose | Tool/Stack |
113+
|------------------------|---------------------------|
114+
| CLI Language | Python (Click) or Go (Cobra) |
115+
| Object Store APIs | boto3 (S3), gcsfs, Azure SDK |
116+
| Metadata Serialization | JSON, FHIR NDJSON |
117+
| Access Sync | git-sync + Gen3 Fence |
118+
| Testing | `pytest` or `go test` |
119+
| Docs | Markdown, GitHub Pages |
120+
121+
---
122+
123+
## 🧭 Sprint 4: User Testing, Documentation, and Release Planning
124+
125+
### 🎯 Goal:
126+
Conduct functional and usability testing, finalize user documentation, and prepare for internal/external release of the `git-gen3` tool.
127+
128+
---
129+
130+
### 🔨 Tasks:
131+
| ID | Task Description | Est. |
132+
|------|------------------------------------------------------------------------------|------|
133+
| S4-1 | Recruit early adopters from internal teams or pilot projects | 0.5d |
134+
| S4-2 | Collect and triage feedback via GitHub issues or survey | 1d |
135+
| S4-3 | Perform functional validation of all workflows (track, init-meta, sync) | 1d |
136+
| S4-4 | Finalize and polish all CLI command help strings and usage messages | 0.5d |
137+
| S4-5 | Write end-user guide (markdown or GitHub Pages) with examples and FAQs | 1d |
138+
| S4-6 | Create changelog and release notes for v1.0 | 0.5d |
139+
| S4-7 | Define release checklist and governance process (e.g., approval flow) | 0.5d |
140+
| S4-8 | Tag first release, publish GitHub release, optionally register PyPI/Homebrew| 0.5d |
141+
142+
---
143+
144+
### ✅ Deliverables:
145+
- End-user documentation published and linked from the repo
146+
- Feedback collected from test users and incorporated as GitHub issues
147+
- Final `v1.0.0` tag and release notes
148+
- Optional: Package published to PyPI (Python) or Homebrew (Go binary)
149+
150+
---
151+
152+
### 📅 Sprint Timeline Summary (Updated)
153+
154+
| Sprint | Focus | Duration | Deliverables |
155+
|--------|----------------------------------|----------|-----------------------------------------------|
156+
| 0 | Architecture validation (spike) | 1 week | Prototypes + greenlight for implementation |
157+
| 1 | Remote file tracking | 2 weeks | `track-remote`, `.lfs-meta`, validation hooks |
158+
| 2 | Metadata generation (FHIR) | 2 weeks | FHIR output, `init-meta`, validation tooling |
159+
| 3 | Git-based access control | 2 weeks | `sync-users`, Git audit trail, Fence sync |
160+
| 4 | Testing, docs, release planning | 1 week | Docs, feedback, `v1.0.0` release |
161+
162+
163+
---

0 commit comments

Comments
 (0)