Skip to content

DM-54426: Implement build uploads#139

Merged
jonathansick merged 59 commits intomainfrom
tickets/DM-54426
Mar 27, 2026
Merged

DM-54426: Implement build uploads#139
jonathansick merged 59 commits intomainfrom
tickets/DM-54426

Conversation

@jonathansick
Copy link
Copy Markdown
Member

Summary

Implements minimal end-to-end support for build uploads on both the client and server side, following the tarball-to-object-store upload model described in SQR-112.

Server-side

  • Credential encryption: Fernet-based CredentialEncryptor with key rotation support via MultiFernet. Organization credentials are encrypted at rest and stored as BYTEA in Postgres.
  • Organization credentials API: CRUD endpoints for managing encrypted object store credentials scoped to organizations.
  • Object store abstraction: ObjectStore protocol with async context manager support, S3-compatible implementation (S3ObjectStore via aiobotocore), mock implementation for testing, and a factory that routes by service type (s3, r2, minio).
  • Presigned upload URL: Build creation (POST) returns a presigned upload URL pointing to the staging location in the org's object store.
  • Build processing worker: arq worker function that downloads the staged tarball, stream-unpacks it, uploads individual files to __builds/{public_id}/ with bounded concurrency, records inventory, and transitions the build to completed. Staging tarball cleanup is best-effort.
  • Factory subclasses: HandlerFactory (with ArqQueueBackend) for HTTP handlers and WorkerFactory (with NullQueueBackend) for arq workers, preventing accidental recursive job enqueueing.

Client-side

  • DocverseClient: async httpx client implementing the full upload workflow — create_build(), upload_tarball(), complete_upload(), get_queue_job(), wait_for_job() with exponential backoff.
  • docverse upload CLI: Click command wrapping the client for use in CI pipelines and shell scripts, with env var support and git ref auto-detection.
  • Tarball creation: create_tarball() utility that creates a .tar.gz and computes the SHA-256 content hash in a single pass.

Review fixes

  • Build prefix uses Crockford Base32 public ID instead of internal integer PK
  • ObjectStore protocol includes __aenter__/__aexit__ for async context manager usage
  • Factory and objectstore factory return ObjectStore protocol type instead of S3ObjectStore
  • get_credential handler delegates to CredentialService.get_by_label() instead of directly accessing the store
  • CredentialEncryptor.encrypt() uses MultiFernet.encrypt() instead of redundant _current attribute
  • Removed unused public create_queue_backend method from HandlerFactory
  • Documented BuildStatus.uploaded as a signal-only value

Test plan

  • uv run --only-group=lint pre-commit run --all-files passes
  • uv run --only-group=nox nox -s typing passes (0 errors)
  • uv run --only-group=nox nox -s client_test passes (15/15)
  • uv run --only-group=nox nox -s test passes

Implement per-organization encrypted credential storage for connecting
to external object stores (S3, R2, GCS). Credentials are encrypted at
rest using Fernet with support for key rotation via a retired key.

- CredentialEncryptor wrapping MultiFernet (encrypt/decrypt/rotate)
- SqlOrganizationCredential table with (org_id, label) unique constraint
- publishing_credential_label and staging_credential_label on organizations
- OrganizationCredentialStore, CredentialService, domain model
- Client models for credential API
- Admin API: POST/GET/DELETE /orgs/{org}/credentials
- S3ObjectStore (aiobotocore) and objectstore factory
- Factory methods for credential service and objectstore resolution
- Alembic migration for new table and organization columns
- Tests for encryptor round-trip/rotation, credential store CRUD,
  and credential handler endpoints
Extend the ObjectStore protocol with download_object for the worker to
fetch staging tarballs. Add an in-memory MockObjectStore for use in
unit tests.
Replace the placeholder upload_url in post_build with a real presigned
URL generated from the org's configured object store credential. Orgs
without a credential configured return upload_url: null.
Implement the build_processing arq task that downloads a staged tarball,
unpacks it, and uploads extracted files to the object store under the
__builds/{build_id}/ prefix with bounded concurrency. On success it
updates inventory counts and transitions the build to completed; on
failure it transitions to failed.

- build_processing worker function with tar.gz streaming unpack
- Worker startup initializes CredentialEncryptor from config
- NullQueueBackend for worker contexts that don't enqueue
- get_by_id methods on BuildStore and OrganizationStore
Split Factory into an ABC base with a template method for queue
backend creation, letting handlers use ArqQueueBackend and workers
use NullQueueBackend without manual service wiring.
Separate object storage modules from database store modules by
grouping them into a dedicated subpackage with public re-exports
via __init__.py.
Implement the `docverse upload` command for uploading documentation
builds, backed by DocverseClient (async httpx). Includes tarball
creation, presigned URL upload, and job polling with exponential
backoff. The default API URL is https://roundtable.lsst.cloud/docverse/api
(env var: DOCVERSE_API). CLI help shows env vars via show_envvar=True.
Add client_test to default nox sessions.
Return ObjectStore protocol type instead of S3ObjectStore from
create_objectstore factory, enabling future non-S3 implementations.
Add __aenter__/__aexit__ to the ObjectStore protocol and MockObjectStore
so callers can use the protocol type as an async context manager.
…-effort cleanup

- Use serialize_base32_id(build.public_id) for the build prefix instead
  of the internal integer PK, correlating S3 paths with API identifiers.
- Type _process_build's object_store parameter as ObjectStore instead
  of Any.
- Wrap staging tarball deletion in try/except so a deletion failure
  doesn't fail an otherwise-successful build.
Add get_by_label() to CredentialService for fetching credentials by
org slug and label. Rewrite the get_credential handler to delegate to
the service instead of directly accessing the store, matching the
pattern used by the other credential endpoints.
Add docstring clarifying that 'uploaded' is never persisted to the
database — the server transitions directly from pending to processing.
Remove the unused public create_queue_backend method from
HandlerFactory — the private _create_queue_backend is sufficient.
Also update create_objectstore_for_org to return ObjectStore protocol
type instead of S3ObjectStore.
Remove the redundant _current attribute. MultiFernet.encrypt() already
delegates to the first key, so self._fernet.encrypt() is equivalent.
jonathansick added a commit to lsst-sqre/phalanx that referenced this pull request Mar 19, 2026
The client's PEP 695 `type Base32Id = ...` statement creates a
TypeAliasType that Pydantic 2.5 cannot unwrap, breaking the
client_test_oldest nox session. Rather than bumping the minimum
Pydantic version, move Base32Id to the server domain package
(using a TypeAlias assignment for compatibility) and simplify
client models to use plain `str` for IDs. Clients only ever
receive and return serialized base32 strings — validation and
integer storage are server-side concerns.

- Move full base32id implementation into src/docverse/domain/base32id.py
- Change client Build.id and QueueJob.id from Base32Id to str
- Remove base32-lib from client dependencies
- Pass pre-serialized ID strings in handler from_domain methods
- Update client test assertions to compare strings
Replace the single organization_credentials table with a three-layer
model: credentials (provider-level encrypted secrets), services
(non-secret config referencing a credential), and organization slot
assignments (publishing_store, staging_store, cdn_service, dns_service).

Credentials are now typed with a CredentialProvider enum and use
discriminated union Pydantic models. Services have typed configs per
ServiceProvider with provider-credential compatibility validation.
Organization GET responses embed lightweight service summaries via
HATEOAS links; POST/PATCH use label references.

Includes Alembic migration, full CRUD endpoints and service layer for
organization services, updated objectstore factory for two-step
resolution (service label -> config + credential -> decrypt -> build),
and tests for all new storage and handler code.
The Claude Code sandbox requires TC_HOST=localhost and
TESTCONTAINERS_RYUK_DISABLED=true to run testcontainers.
Bake these into the documented server test command.
Configure Alembic logging in alembic.ini so migrations emit INFO-level
output, and add structured log messages to the update_db_schema command
that report current and target revisions before and after migration.
Remove `docverse init` from start-service.sh so pods no longer
stamp the database on every startup. This prevents schema drift
when new code starts before the Argo CD migration job runs.

Add a guard to the `init` CLI command that refuses to run on an
already-initialized database (unless --reset is passed), ensuring
`docverse update-db-schema` is the sole path for production
schema changes.
The init_database function calls Base.metadata.create_all(), which
attempts to create GIN indexes using gin_trgm_ops. This fails on
databases (e.g. CloudSQL) where the pg_trgm extension is not
pre-installed, since the extension was only created in the Alembic
migration path, not in the init path.
Extract get_current_revision() into database.py as a public function
and use it to log the app version and current Alembic revision in
the FastAPI lifespan, arq worker startup, and the init/update-db-schema
CLI commands. This makes it easier to diagnose deployment failures
caused by schema mismatches.
Add a `members` field to `OrganizationCreate` so that the admin org
creation endpoint can atomically seed memberships. This solves the
bootstrap problem where the membership API requires an existing admin
but a new org has no members. Duplicate (principal_type, principal)
entries are silently deduplicated (first wins). Simplify the test
helper `seed_org_with_admin` to use the new API field instead of
direct DB access.
Users with the `admin:docverse` Gafaelfawr scope are now treated as
org admins everywhere, without needing explicit membership. Add
`get_scopes()` to the `UserInfoStore` protocol, short-circuit
`AuthorizationService.resolve_role()` when the super admin scope is
present, and wire scopes through the `OrgRoleDependency`.
The scope-based super admin detection via SUPERADMIN_SCOPE was
architecturally broken: Gafaelfawr's delegated internal token
only carries the fixed ingress scopes (exec:docverse), not the
user's original scopes like admin:docverse.

Replace with a config-based list of super admin usernames
(DOCVERSE_SUPERADMIN_USERNAMES) checked against the authenticated
X-Auth-Request-User header. Remove get_scopes() from the
UserInfoStore protocol and delete the constants module.
Implement the UserInfoStore protocol using GafaelfawrClient
from rubin-gafaelfawr to resolve user group memberships via
the Gafaelfawr API. Wire it into the application lifespan
using safir's shared httpx client. Tests override the store
back to StubUserInfoStore to avoid requiring a live Gafaelfawr.
pydantic-settings v2 tries to JSON-parse complex types from env vars.
Add a BeforeValidator that splits comma-separated strings into lists
so the field works with both JSON arrays and plain comma-separated
values as documented.
Test the full HTTP path through OrgRoleDependency and
UserInfoStore.get_groups() for group-type memberships,
which was previously only covered at the storage layer.
The server and client packages both declared a "docverse" script,
causing the server's entrypoint to shadow the client's in the uv
workspace. Rename the server CLI to "docverse-admin" so the
unqualified "docverse" command resolves to the user-facing client
CLI (docverse upload).
Use org/project slugs and base32 build IDs instead of internal
integer IDs in structured log keys for better operator traceability.
Cloudflare R2 and MinIO use S3-compatible access keys, not
provider-specific API tokens. The new `s3` credential type
(access_key_id + secret_access_key) fixes the KeyError that
occurred when the objectstore factory tried to read access keys
from a `cloudflare` credential payload that only has `api_token`.
Adds a -v/--verbose option that prints request and response details
to stderr, with bearer tokens masked for safe display.
Enable updating credential_label on an existing service without
deleting and recreating it. The endpoint validates that the new
credential exists and is provider-compatible before applying the change.
The enqueue method was spreading the payload dict as **kwargs, but
the worker function signature expects a single `payload` kwarg.
This caused a TypeError when the worker tried to call the function.
The web app was enqueuing jobs to arq's default "arq:queue" while the
worker listens on "docverse:queue" (config.arq_queue_name). Thread the
configured queue name from config through ContextDependency, HandlerFactory,
and ArqQueueBackend so jobs are enqueued to the correct queue.
Restructure build_processing() into separate transaction phases
so the error-handling transaction runs after the main transaction's
context manager has fully exited. Previously, rolling back and
starting a new transaction inside an active context manager caused
SQLAlchemy's InvalidRequestError.

Remove explicit commit()/rollback() calls in favor of letting
async with session.begin() manage transaction lifecycle.
botocore 1.36+ defaults to adding CRC32 checksums with aws-chunked
transfer encoding for PutObject, which Cloudflare R2 doesn't support,
causing SignatureDoesNotMatch errors on uploads.
Allows looking up a QueueJob by its arq backend job ID, which is needed
for the worker to find and update the corresponding QueueJob record.
The worker now looks up the QueueJob by arq job ID and transitions it
through in_progress, completed, or failed states alongside the build.
Disable payload signing to prevent botocore from using the
STREAMING-UNSIGNED-PAYLOAD-TRAILER signing protocol, which R2
does not support. This forces UNSIGNED-PAYLOAD for PutObject,
avoiding the aws-chunked transfer encoding that causes signature
verification failures.

Also suppress response checksum validation warnings on GetObject
responses from R2.
The payload_signing_enabled=False config was ineffective because
botocore 1.36+ sets requestChecksumRequired on PutObject, causing
aws-chunked transfer encoding with checksum trailers regardless.
Register a before-parameter-build event hook to strip ChecksumAlgorithm
from PutObject params, preventing the STREAMING-UNSIGNED-PAYLOAD-TRAILER
code path that R2 does not support.
The previous before-parameter-build hook fired too early —
resolve_request_checksum_algorithm runs after it and re-adds a CRC32
trailing checksum because PutObject has requestChecksumRequired=true
in the botocore service model. Switch to a before-call hook that
clears context["checksum"]["request_algorithm"] after the checksum
context is resolved but before apply_request_checksum wraps the body
in AwsChunkedWrapper with STREAMING-UNSIGNED-PAYLOAD-TRAILER.
Botocore dispatches event handlers via handler(**kwargs), so
parameter names must match the kwarg names exactly. The _params
name (underscore prefix added for ruff ARG004) didn't match the
params kwarg, causing a TypeError at runtime. Switch to **kwargs
and extract context by name.
httpx is needed for uploading objects to R2 via presigned URLs,
bypassing aiobotocore's signing pipeline which produces
STREAMING-UNSIGNED-PAYLOAD-TRAILER that R2 rejects.
aiobotocore's PutObject uses STREAMING-UNSIGNED-PAYLOAD-TRAILER
which Cloudflare R2 does not support. Three prior attempts to
suppress this via botocore event hooks all failed.

Bypass the signing pipeline entirely: when an httpx client is
available, upload_object generates a presigned PUT URL (pure
crypto, no network call) and uploads via httpx. The direct
put_object path is retained as a fallback.

The httpx.AsyncClient is created in the arq worker startup and
threaded through Factory → create_objectstore → S3ObjectStore.
The presigned-URL upload in upload_object signed Content-Type as a
query parameter, causing 403 SignatureDoesNotMatch on R2 for MIME
types other than application/gzip. Remove ContentType from the
presigned URL params so it is sent as an unsigned header instead.

Also add error-body logging before raise_for_status() to capture
R2's XML error response for future debugging.
Tarball members have ./ prefixes (e.g., ./index.html) because the
client creates tarballs with arcname=".". When used directly in S3
keys, botocore signs the path with ./ included, but R2 normalizes
the path by removing ./ before verifying the signature, causing
403 SignatureDoesNotMatch errors on every upload.
Set phase="unpacking" with progress messages on the QueueJob so
clients polling the job endpoint can see what the worker is doing.
Report final object_count and total_size_bytes before marking complete.
Wrap the raw file object in a `with` statement so it is
always closed, even if tarfile.open() or tar.add() raises.
Previously, an exception would skip raw.close() and leak
the handle.
Note that both the tarball download and file extraction
hold all data in memory at once. Streaming would reduce
peak memory usage for large documentation builds.
Missing keys now raise a descriptive ValueError instead of a bare
KeyError, making object store misconfiguration easier to diagnose.
A build that completes with object_count=0 likely indicates a
corrupted or empty upload. Log a warning with the staging_key
so operators can detect and investigate these cases.
Replace inline structlog.get_logger() call with a bound logger
passed through the constructor, matching the pattern used by all
other services and stores. This ensures error logs from presigned
upload failures carry request/worker context (org, project, build).
@jonathansick jonathansick merged commit 51d81f5 into main Mar 27, 2026
15 checks passed
@jonathansick jonathansick deleted the tickets/DM-54426 branch March 27, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant