ghreplica should be built as an event-driven GitHub mirror, not as a thin HTTP wrapper around a single set of GitHub-shaped tables.
The core flow is:
- ingest GitHub changes from webhooks and explicit backfills or repairs
- persist raw source data
- normalize into a canonical internal model
- project into API-ready read models
- serve GitHub-compatible responses through Echo
- optionally export the same data into analytics systems
- Mirror GitHub repository data with high fidelity.
- Stay backend-agnostic so storage can be swapped without rewriting domain logic.
- Support near-real-time sync via webhooks and explicit correctness repair via targeted jobs.
- Expose a GitHub-compatible API for tooling, agents, and triage systems.
- Make analytics exports first-class, but not the primary transactional store.
- Full GitHub parity on day one.
- Using a dataset store as the primary write path.
- Re-implementing every GitHub behavior before the core replication loop works.
Use an event-driven replication pipeline with explicit storage ports.
That gives you:
- idempotent webhook ingestion
- replayability when parsers change
- repairability when webhooks are missing or out of order
- multiple downstream representations of the same source data
- a clean separation between "what GitHub said" and "how we serve it"
This is materially better than a direct webhook -> database row -> API design because GitHub webhooks can arrive out of order, can be redelivered, and do not always cover every repair scenario you need.
The intended application stack is:
Echofor the HTTP surfaceGORMfor persistence models and database accessPostgresas the primary transactional backend
How to apply that choice cleanly:
- use GORM models for canonical tables, raw sync tables, and projection tables
- keep model boundaries explicit instead of letting handlers query tables ad hoc
- keep schema changes versioned and reviewable even if GORM is used for model management
- reserve direct SQL for exceptional cases where GORM would make a query or migration path unclear
Yes, the system should have tables when the backend is relational.
The important point is that not all tables serve the same purpose. You should avoid one flat schema that is both the ingestion store and the API surface.
Recommended table groups:
- raw ingestion tables
- webhook deliveries
- crawl responses
- request metadata and headers
- canonical domain tables
- repositories
- users
- issues
- pull requests
- comments
- labels
- commits
- checks
- projection tables
- issue list views
- pull request detail views
- label indexes
- commit status summaries
- control tables
- cursors
- sync checkpoints
- leases
- outbox/jobs
- health/status
So the design is still table-based. The difference is that tables are split by role in the replication pipeline.
Datasets are a good export target and analytics substrate, not a good OLTP replication store.
The write path needs:
- low-latency upserts
- idempotency keys
- cursors and leases
- transactional projector updates
- efficient point reads and filtered queries
That is database work. If Hugging Face integration is useful, treat it as a sink:
- primary store: Postgres or another transactional backend
- blob store: S3, buckets, or filesystem
- analytics sink: Parquet or JSONL exports to datasets or buckets
Two ingestion paths feed the same replication log:
Webhook ingester- Receives GitHub webhook deliveries.
- Verifies signatures.
- Deduplicates by delivery ID.
- Persists the raw payload before doing anything else.
Backfill and repair ingester- Runs only when explicitly requested by policy or operator action.
- Fetches bounded GitHub REST resources.
- Repairs missed or insufficient webhook state.
Webhooks should be the default path. Backfills and repairs should be explicit and bounded.
Store raw inputs exactly as received:
- webhook payloads
- crawl responses
- headers relevant to rate limits, pagination, and caching
- fetch metadata such as
observed_at, source endpoint, delivery ID, and installation or repo scope
This log is the source material for the rest of the system. Never make the normalized tables your only copy.
Normalize GitHub data into an internal model that is stable across sources.
Examples:
- repositories
- users
- issues
- pull requests
- issue comments
- pull review comments
- pull reviews
- labels
- commits
- branches
- check suites
- check runs
- statuses
- releases
Important rule: keep GitHub identifiers and local identifiers separate.
- external IDs: GitHub numeric IDs, node IDs, full names, SHAs
- internal IDs: local surrogate keys if needed
The canonical model should also track:
source_versionorobserved_at- tombstones and deletions
- partial vs complete hydration state
- provenance: webhook, crawl, or manual import
Project canonical entities into read models optimized for API serving.
Examples:
- issue list view with joins already resolved
- pull request detail view
- timeline and event view
- commit status summary
- label index per repo
Projectors must be:
- idempotent
- replayable
- version-aware
- independently runnable
This is what lets you change how responses are shaped without changing ingestion.
Webhook projectors should be event-specific:
repositoryissuesissue_commentpull_requestpull_request_reviewpull_request_review_comment
The default should be:
- apply what the webhook already tells us
- schedule targeted repair only if the payload is insufficient
- never trigger a full repo bootstrap because a single event arrived
Serve GitHub-like endpoints from read models using Echo.
Keep this layer thin:
- parse GitHub-shaped requests
- translate filters and pagination into projection queries
- render GitHub-shaped JSON responses
- expose matching headers where practical
Handlers should depend on query services backed by read models. They should not reach directly into raw ingestion storage.
Fan out the canonical model or projections into secondary systems:
- HF datasets for batch analysis
- buckets or S3 for archival JSON or Parquet
- DuckDB or Parquet for offline analytics
- Kafka or NATS if other services want change streams
These sinks should be asynchronous and disposable. They are not the system of record.
The storage abstraction should be capability-based, not one giant repository interface.
Define narrow ports such as:
EventLog- append raw ingress records
- read by offset, time, or scope
CursorStore- persist crawl cursors, ETags, sync checkpoints, and leases
CanonicalStore- upsert normalized entities and relationships
ProjectionStore- store and query API-ready read models
BlobStore- persist large payloads, diffs, archives, and compressed JSON
Outbox- drive projectors and async export workers reliably
This is what backend agnostic should mean in practice: the domain depends on these ports, and concrete adapters implement them for Postgres, SQLite, object storage, or HF-backed sinks where appropriate.
In the default implementation, the relational adapters should be GORM-backed.
For the first real version:
Postgresfor cursors, canonical entities, projections, and outboxGORMfor relational persistence and model mappingS3,MinIO, or buckets for large raw payloads and archivesRedisonly if distributed queues or caching become necessary
This gives you the simplest reliable baseline. Then add alternative adapters later:
SQLitefor local single-node development- HF buckets as blob storage
- HF datasets as async export targets
Every repo mirror should have explicit policy and explicit job types.
Per-repo policy should decide whether a repo is:
webhook_onlywebhook_plus_backfillmanual_only
Jobs should be typed and narrow. Prefer:
apply_webhook_deliveryrepair_issuerepair_pull_requestrepair_issue_commentsrepair_pull_request_reviewsrepair_pull_request_review_commentsbackfill_issues_pagebackfill_pulls_page
Avoid one generic refresh repo job as the main primitive.
When a broader sync is required, make it explicit:
backfill- historical fetch for timelines, comments, reviews, or commits
repair- targeted reconciliation of missing or suspicious objects
incremental_backfill- bounded page-by-page catch-up for one resource family
Expose lag and health metadata:
- last successful webhook delivery time
- last successful repair or backfill time
- last projector lag
- last consistency repair result
Clients need to know how stale the mirror is.
Trying to replicate all of GitHub immediately will kill the project. Start with the repo-scoped read API that triage and agent systems actually use:
GET /v1/github/repos/{owner}/{repo}GET /v1/github/repos/{owner}/{repo}/issuesGET /v1/github/repos/{owner}/{repo}/issues/{number}GET /v1/github/repos/{owner}/{repo}/issues/{number}/commentsGET /v1/github/repos/{owner}/{repo}/pullsGET /v1/github/repos/{owner}/{repo}/pulls/{number}GET /v1/github/repos/{owner}/{repo}/pulls/{number}/commentsGET /v1/github/repos/{owner}/{repo}/pulls/{number}/reviewsGET /v1/github/repos/{owner}/{repo}/commitsGET /v1/github/repos/{owner}/{repo}/commits/{sha}GET /v1/github/repos/{owner}/{repo}/labelsGET /v1/github/repos/{owner}/{repo}/check-runs
Then add:
- timelines and events
- search-like local indexes
- GraphQL compatibility for the subset you actually need
- optional write-through endpoints later if ever needed
Read-only parity is the right first target.
Compatibility is the product, so test it directly.
For a fixed fixture repo:
- fetch from GitHub
- fetch from
ghreplica - compare status codes
- compare headers you care about
- compare pagination behavior
- compare normalized JSON fields
Build endpoint contract tests before claiming compatibility.
Expect these failure cases:
- duplicate webhook deliveries
- missing webhook deliveries
- out-of-order deliveries
- force-pushes and rebases
- deleted branches or comments
- GitHub API pagination changes
- temporary rate limiting
- partial crawls that stop mid-stream
The architecture should survive all of them through raw logging, idempotent projectors, and scheduled repair crawls.
cmd/
ghreplica/
main.go
internal/
api/
echo/
githubrest/
auth/
config/
domain/
model/
normalize/
syncstate/
ingest/
webhooks/
crawler/
project/
canonical/
projections/
jobs/
storage/
ports/
gorm/
postgres/
sqlite/
blob/
export/
parquet/
hf/
GitHub webhook or API
-> ingesters
-> raw event log
-> normalizer
-> canonical store
-> outbox
-> projectors
-> projection store
-> Echo API handlers
-> GitHub-compatible responses
- Define canonical entities and storage ports.
- Implement GORM-backed Postgres models plus blob storage adapters.
- Implement webhook receiver and raw event persistence.
- Implement repo bootstrap crawler.
- Implement normalizers for repos, issues, PRs, comments, labels, and commits.
- Implement projection workers for the first read endpoints.
- Implement GitHub-compatible REST handlers.
- Add contract tests against real GitHub fixture repos.
- Add async export adapters for HF datasets and buckets.
The recommended architecture is:
- Echo for the HTTP surface
- GORM for relational persistence
- event-driven ingestion from both webhooks and crawls
- canonical internal model
- replayable projections
- capability-based storage ports
- Postgres plus object storage as the initial reliable backend
- HF datasets or buckets as optional downstream sinks, not the primary transactional database