Skip to content

DM-54689: Load dashboard templates from GitHub #232

@jonathansick

Description

@jonathansick

Metadata

Field Value
Jira Key DM-54689
Jira URL https://rubinobs.atlassian.net/browse/DM-54689

Problem Statement

As a Docverse tenant administrator, I want my organization's and projects' documentation dashboards to use templates that I control in a GitHub repository, so that my site's look, structure, and behavior match our own branding and conventions without requiring a Docverse deployment.

As a template author, I want pushes to my template repository to automatically update the live dashboards that use that template, so that I can iterate on templates through normal PR workflows rather than contacting Docverse operators.

As a Docverse operator, I want the server to keep rendering dashboards from the last known-good template if a sync fails, so that a broken commit or GitHub outage does not take documentation down.

As a project author whose documentation and dashboard template live in the same repository, I want Docverse to re-sync only when files under the template's subdirectory actually change, so that unrelated commits to my docs repo do not churn the dashboard.

Solution

Docverse grows a new module — dashboard template sync — that sits behind the existing `TemplateSource` protocol and keeps a local database cache of templates fetched from GitHub.

From an org admin's point of view:

  1. Org admin installs the Docverse GitHub App on the repository hosting their template(s).
  2. Org admin sets the org's default dashboard-template binding via `PUT /orgs/{org}/dashboard-template` with `owner`, `repo`, `ref` (branch or tag), and `root_path` (default `"/"`).
  3. Docverse immediately enqueues an initial `dashboard_sync`, which fetches the template tree, validates `template.toml`, stores the bytes in Postgres, then fans out `dashboard_build` jobs for every project whose effective template resolves to that binding.
  4. Subsequent pushes to the tracked ref fire a GitHub webhook. Docverse matches the push against all bindings for `(owner, repo, ref)`, filters out bindings whose `root_path` was not touched by the push (with a GitHub API fallback when the payload's changed-file list is truncated), and enqueues one `dashboard_sync` per surviving binding.
  5. Individual projects can override the org default via `PUT /orgs/{org}/projects/{project}/dashboard-template`; the override shadows the default for that project only.

Resolution order when rendering is: project override → org default → `BuiltInTemplateSource`. The built-in stays as the universal last-resort fallback so new orgs and the test suite work without any GitHub setup.

User Stories

  1. As an org admin, I want to register a GitHub repo + ref + root path as the default template for my organization so every project in it inherits that template without further configuration.
  2. As an org admin, I want to override the default template on a specific project so one noisy project can experiment without affecting the rest of the org.
  3. As an org admin, I want to see the last-sync status, last-synced commit SHA, and last-sync time for each of my bindings so I can diagnose whether a template change has landed.
  4. As an org admin, I want a sync failure surfaced with a human-readable reason (invalid `template.toml`, app not installed, GitHub 5xx, missing `root_path`, etc.) so I know how to fix it.
  5. As an org admin, I want to unset a binding (delete) so projects fall back to the org default or built-in.
  6. As an org admin, I want the binding endpoints scoped under my own org so I don't need super-admin privileges for routine template operations.
  7. As a Docverse super-admin, I want a `POST /admin/dashboard-templates/{id}/sync` endpoint so I can force a re-sync when debugging a stuck binding.
  8. As a template author, I want a push to the tracked branch of my template repo to re-render every dependent dashboard within seconds, so that PR merges feel live.
  9. As a template author, I want a push that only touches paths outside my template's `root_path` to be a no-op, so that working in a multi-template or monorepo repository doesn't trigger wasteful renders.
  10. As a template author, I want Docverse to read template sources and assets from the path I specified (not assume repo root) so I can organize a single repo into multiple template variants.
  11. As a Docverse operator, I want template bytes to live in Postgres alongside the binding so I do not need to provision an additional storage backend for the template system.
  12. As a Docverse operator, I want template content keyed by `(owner, repo, ref, root_path)` and written only when the GitHub ETag/tree SHA changes, so repeated identical pushes do not churn the table or re-render dashboards.
  13. As a Docverse operator, I want sync failures to keep the previously-cached bytes in place so a bad commit upstream does not blank out production dashboards.
  14. As a Docverse operator, I want one GitHub App per Docverse instance so tenants install it on their own repos and credentials stay scoped to my deployment.
  15. As a Docverse operator, I want GitHub App secrets (app ID, private key, webhook secret) loaded from `Config` using `SecretStr` so they follow the same handling as the existing Slack webhook secret.
  16. As a developer, I want a `GithubTemplateSource` that implements the existing `TemplateSource` protocol so the renderer pipeline and built-in source remain unchanged.
  17. As a developer, I want `dashboard_sync` to follow the shape of existing arq jobs (`dashboard_build`, `publish_edition`) — `async def f(ctx, payload) -> str`, queue-job progress tracking, structured logging, advisory-lock acquisition — so nothing about the worker runtime is surprising.
  18. As a developer, I want a new `DASHBOARD_TEMPLATE` lock class on `LockKey` so concurrent syncs of the same `(owner, repo, ref, root_path)` content serialize cleanly without contending with per-project build locks.
  19. As a developer, I want creation or update of a binding to enqueue an initial sync automatically, so bindings become usable without a separate admin action.
  20. As a developer, I want `DashboardPublisher` to resolve the `TemplateSource` for a project at render time (not at construction time) so per-project overrides work without bespoke wiring.
  21. As a developer, I want a respx-based fake GitHub in the test suite (app-JWT issuance, installation-token exchange, repo tree fetch, blob fetch, webhook HMAC verification) modeled after `respx` patterns already used for `DiscoveryClient`, so tests exercise the GitHub flow without network calls.
  22. As a developer, I want `dashboard_sync` tests that drive the worker function directly with a seeded binding and verify both the content upsert and the fan-out (dashboard_build jobs landing in `MockArqQueue`), mirroring existing `tests/worker/dashboard_build_test.py`.
  23. As a developer, I want the webhook handler to use `safir.github`'s `GitHubWebhookRouter` (or the equivalent registration API) for payload validation and dispatch, so Docverse does not hand-roll HMAC or event parsing.
  24. As a developer, I want a project's effective template resolution (project override → org default → built-in) encapsulated in a single `TemplateResolver`-style service, so the worker and the admin API compute it identically.
  25. As a developer, I want a clean way to list "projects affected by this template content" given a `(owner, repo, ref, root_path)` — this is the fan-out input — and I want it expressed in terms of bindings and resolution order, not a denormalized reverse-index column (the table has no "rendered_with_template_id" on projects).
  26. As a documentation reader, I don't care which template was used — but I want the dashboard I'm looking at to reflect the latest template the org admin merged, ideally within a minute of the merge.
  27. As a Docverse tenant, I want a clear error (not a silent fallback) when I bind to a repo the Docverse GitHub App is not installed on, so I know to install the app.
  28. As a QA engineer, I want built-in templates to keep working with zero configuration so every test that previously rendered a dashboard continues to pass without DB fixtures for template bindings.

Implementation Decisions

Storage model

  • Two new tables — `dashboard_template_bindings` (configuration) and `dashboard_template_contents` (synced bytes) — are deliberately separated so multiple bindings pointing at the same `(owner, repo, ref, root_path)` share one cached content set.
  • `dashboard_template_bindings`: `id` PK, `org_id` FK, `project_id` FK (nullable; null = org default), `github_owner`, `github_repo`, `github_ref`, `root_path` (default `"/"`), `content_id` FK (nullable until first successful sync), `last_sync_status` (`pending`/`succeeded`/`failed`), `last_sync_error`, `date_created`, `date_updated`. Unique constraint on `(org_id, project_id)`.
  • `dashboard_template_contents`: `id` PK, `github_owner`, `github_repo`, `github_ref`, `root_path`, `commit_sha`, `etag` (GitHub ETag or tree SHA), `template_toml` (BYTEA), `date_synced`. Unique constraint on `(github_owner, github_repo, github_ref, root_path)` — the content dedup key.
  • `dashboard_template_content_files`: `id` PK, `content_id` FK, `relative_path`, `is_text` (Jinja/TOML/CSS/JS vs binary asset), `data` (BYTEA), `size_bytes`. Unique constraint on `(content_id, relative_path)`.
  • Overwrite-in-place: no per-commit history. Rollback means re-binding at an older ref.

Domain & service layout

  • New `services/dashboard_templates/` module:
    • `resolver.py` — `TemplateResolver.resolve_for_project(project) → ResolvedTemplate`: project override → org default → built-in.
    • `sync.py` — `DashboardTemplateSyncer`: fetches a tree from GitHub, validates `template.toml`, upserts content + files.
    • `fanout.py` — `DashboardRebuildFanout`: given a content row, enqueues `dashboard_build` for every project whose resolved template equals that content.
    • `template_source.py` — `DbTemplateSource` implementing the existing `TemplateSource` protocol, constructed from a `content_id`.
  • Renderer pipeline (`DashboardPublisher` and the four renderers) is unchanged; the worker resolves the right `TemplateSource` per project at render time.

GitHub client

  • New `storage/github/` module:
    • `app_client.py` — thin wrapper around `safir.github.GitHubAppClientFactory` reading `github_app_id`, `github_app_private_key`, `github_webhook_secret` from `Config`. Mirrors Times Square / Semaphore.
    • `tree_fetcher.py` — `GitHubTreeFetcher`: uses GitHub REST tree API (`recursive=1`) to discover files within `root_path`, fetches blob contents, captures `ETag` and commit SHA.
    • `changed_paths.py` — extracts changed files from a push webhook payload, with a fallback to `GET /repos/{owner}/{repo}/compare/{before}...{after}` when the payload signals truncation (GitHub truncates at 20 commits / 3000 files).

Webhook + admin API

  • New webhook handler at `/docverse/webhooks/github` using Safir's GitHub webhook router. Dispatches `push` events to a `PushEventProcessor` that:
    1. Finds all bindings matching `(github_owner, github_repo, github_ref)`.
    2. Filters bindings by intersection of the push's changed-path set with each binding's `root_path` (with GitHub compare API fallback on truncation).
    3. Enqueues one `dashboard_sync` per surviving binding.
  • New org-admin-scoped handlers:
    • `GET/PUT/DELETE /orgs/{org_slug}/dashboard-template` — org default binding.
    • `GET/PUT/DELETE /orgs/{org_slug}/projects/{project_slug}/dashboard-template` — project override.
    • `PUT` creates/updates and enqueues an initial `dashboard_sync`.
  • New super-admin handler:
    • `POST /admin/dashboard-templates/{id}/sync` — force re-sync of a specific binding.

Arq job: `dashboard_sync`

  • Signature `async def dashboard_sync(ctx, payload) -> str` matching existing jobs. Payload carries `binding_id`, `queue_job_id`, `queue_job_public_id`.
  • Execution: acquire `LockKey.for_dashboard_template(...)`, resolve installation token, fetch tree, compare ETag with existing content row, upsert content + files atomically on change, update binding, fan out `dashboard_build` jobs. On failure: mark binding `last_sync_status=failed` with error text, leave `content_id` pointing at previous content so dashboards keep rendering from last-good.
  • Register in `WorkerSettings.functions` alongside the existing jobs.
  • New `LockClass.DASHBOARD_TEMPLATE` and `LockKey.for_dashboard_template` on the existing `LockService`.

Config additions

  • Add `github_app_id`, `github_app_private_key` (`SecretStr`), `github_webhook_secret` (`SecretStr`) to `Config`. All optional — when absent, binding endpoints return 503 and the webhook endpoint 404 (feature disabled). Follows the existing `slack_webhook` pattern.

Migrations

  • One Alembic revision creating the three new tables, modelled on the existing multi-step add-column migrations. No backfill required.

Reuse

  • `TemplateSource` protocol (existing): renderers untouched.
  • `DashboardBuildEnqueuer` (existing): fanout enqueues through it rather than reimplementing queue-job creation.
  • `LockService` (existing): extended with a new lock class only.
  • `HandlerFactory` (existing): new `create_*` methods for the new services, mirroring `create_dashboard_build_enqueuer`.
  • `RequestContext` / transaction conventions (CLAUDE.md): no change; new handlers own their transactions.
  • `safir.github` GitHub-app framework: no hand-rolled JWT or HMAC.

GitHub identity stability

Bindings and content rows carry both the GitHub display names (`github_owner`, `github_repo`) and the stable GitHub numeric IDs (`github_owner_id`, `github_repo_id`, `github_installation_id`; all nullable). Public API remains name-keyed; internal lookups prefer numeric IDs with a name fallback for bindings that have not yet completed their first sync.

Robustness against GitHub rename / transfer follows two layers:

  • Numeric IDs as stable internal keys. The sync job captures the repo/owner/installation IDs from the GitHub API response on first sync and writes them to the binding and content rows. The push event processor matches incoming events by `(github_repo_id, github_ref)` first and falls back to the name-based composite key for un-synced bindings.
  • Rename webhooks keep display names in sync. The webhook handler registers `repository.renamed`, `repository.transferred`, and `organization.renamed` events and rewrites name columns on bindings and content rows keyed by the stable ID carried in the payload. Content is not re-synced — only display strings change. Installation events (`created` / `deleted` / `suspend` / `unsuspend`) update a reachability flag on affected bindings so operators can see when a tenant has uninstalled the app.

Schema-level: three nullable columns (`github_owner_id`, `github_repo_id`, `github_installation_id`) land on `dashboard_template_bindings`; two (`github_owner_id`, `github_repo_id`) land on `dashboard_template_contents`. A non-unique composite index on `(github_repo_id, github_ref)` backs the ID-preferential push lookup. The existing name-based content dedup key is preserved — rename handlers rewrite names on existing rows rather than inserting new ones.

Testing Decisions

What makes a good test here

  • Assert on observable outcomes — rows written, arq jobs enqueued, HTTP status codes, uploaded bytes, log fields — not on internal method calls.
  • Use the real Postgres (testcontainers) and a real async SQLAlchemy session. Do not mock the DB.
  • Use `respx` to stub GitHub endpoints (JWT → installation token, tree fetch, blob fetch, compare API), mirroring the `mock_discovery` pattern in `tests/conftest.py`.
  • Use the existing `MockArqQueue` to assert fan-out counts and payloads.

Modules that get tests

  • `TemplateResolver` — resolution order (override exists, no override, no org default, content row missing → built-in fallback).
  • `DashboardTemplateSyncer` — (a) content + files written, (b) ETag short-circuit on unchanged re-sync, (c) validation failure leaves prior content intact and marks binding failed.
  • `DashboardRebuildFanout` — one content row + three dependent projects ⇒ three `dashboard_build` jobs enqueued.
  • `GitHubTreeFetcher` — tested via the syncer's respx setup, plus direct tests of the truncation-fallback branch.
  • `PushEventProcessor` — only bindings whose `root_path` intersects changed paths get synced; exercises both in-payload and truncated-payload paths.
  • `dashboard_sync` arq function — modelled on `tests/worker/dashboard_build_test.py`.
  • Webhook handler — signed POST, assert the expected `dashboard_sync` jobs land on `MockArqQueue`.
  • Admin + org binding endpoints — shape matches `tests/handlers/dashboard_test.py`: auth headers, status codes, JSON shapes, PUT idempotency, initial-sync enqueue.

Prior art

  • `tests/conftest.py` — `mock_discovery`, `app`, `client`, `db_session`, `seed_org_with_admin`.
  • `tests/worker/dashboard_build_test.py` — direct invocation of an arq function with a fabricated `ctx` dict and seeded DB.
  • `tests/handlers/dashboard_test.py` — `AsyncClient` + `X-Auth-Request-User` admin-header pattern.

Out of Scope

  • Building the `lsst-sqre/docverse-templates` repository itself (tracked as a separate Jira story; this PRD uses fixture trees served by respx).
  • Periodic reconciliation / drift-audit job (deferred — MVP triggers are webhook + manual + on-binding-write).
  • Object-store or hybrid backing for template bytes (Postgres-only for this ticket).
  • Immutable per-commit template version history; only the current snapshot is retained per content row.
  • Squarebot/Kafka event delivery path (the design doc mentions it; this ticket delivers the direct-webhook path only).
  • Docverse-client CLI support for template bindings.
  • Template-level RBAC beyond "org admins manage bindings, super-admins manage manual sync".
  • Migration of the existing built-in templates into the GitHub layout; built-in stays as the last-resort fallback and is untouched.
  • UI changes — management is via the admin API.

Further Notes

  • Resolution order — project override → org default → built-in — is a strict superset of SQR-112: the built-in fallback is intentional and permanent so tests, new-org onboarding, and unconfigured bindings keep working.
  • The design doc's single `DashboardTemplate` table has been split into `dashboard_template_bindings` + `dashboard_template_contents` + `dashboard_template_content_files` to (a) deduplicate shared content across bindings pointing at the same repo/ref/path and (b) keep file bytes out of the row handlers hit for CRUD.
  • Because content dedup keys include `root_path`, a single repository can host multiple template variants in sibling directories, each with its own `template.toml`. A project's dashboard template can even live alongside its docs in the same repo.
  • The changed-path filter is an availability win, not a correctness one. Large pushes may have truncated file lists; the processor falls back to the GitHub compare API. Tests must cover both branches.
  • Sync failures never clear stored content or `content_id` — dashboards render from the last-good snapshot. Failure surfaces through `QueueJob` status, the binding's `last_sync_status` / `last_sync_error`, and structured logs.
  • Per-tenant GitHub App installation matches the Jira wording ("installed by tenants of the corresponding Docverse instance") and the Times Square / Semaphore operational model.
  • Rename-robustness follow-up tasks — DM-54689: Add nullable GitHub numeric ID columns to dashboard-template foundation #240 (schema), DM-54689: Capture GitHub numeric IDs on sync + match by ID in push processor #241 (capture IDs + match by ID), DM-54689: GitHub rename + installation webhook event handlers #242 (rename + installation webhook handlers) — extend this PRD after the core breakdown landed. They were added in response to a Times Square incident where a GitHub organization rename silently broke a name-only integration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    prdProduct Requirements Document

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions