Merged
Conversation
Implement per-organization encrypted credential storage for connecting
to external object stores (S3, R2, GCS). Credentials are encrypted at
rest using Fernet with support for key rotation via a retired key.
- CredentialEncryptor wrapping MultiFernet (encrypt/decrypt/rotate)
- SqlOrganizationCredential table with (org_id, label) unique constraint
- publishing_credential_label and staging_credential_label on organizations
- OrganizationCredentialStore, CredentialService, domain model
- Client models for credential API
- Admin API: POST/GET/DELETE /orgs/{org}/credentials
- S3ObjectStore (aiobotocore) and objectstore factory
- Factory methods for credential service and objectstore resolution
- Alembic migration for new table and organization columns
- Tests for encryptor round-trip/rotation, credential store CRUD,
and credential handler endpoints
Extend the ObjectStore protocol with download_object for the worker to fetch staging tarballs. Add an in-memory MockObjectStore for use in unit tests.
Replace the placeholder upload_url in post_build with a real presigned URL generated from the org's configured object store credential. Orgs without a credential configured return upload_url: null.
Implement the build_processing arq task that downloads a staged tarball,
unpacks it, and uploads extracted files to the object store under the
__builds/{build_id}/ prefix with bounded concurrency. On success it
updates inventory counts and transitions the build to completed; on
failure it transitions to failed.
- build_processing worker function with tar.gz streaming unpack
- Worker startup initializes CredentialEncryptor from config
- NullQueueBackend for worker contexts that don't enqueue
- get_by_id methods on BuildStore and OrganizationStore
Split Factory into an ABC base with a template method for queue backend creation, letting handlers use ArqQueueBackend and workers use NullQueueBackend without manual service wiring.
Separate object storage modules from database store modules by grouping them into a dedicated subpackage with public re-exports via __init__.py.
Implement the `docverse upload` command for uploading documentation builds, backed by DocverseClient (async httpx). Includes tarball creation, presigned URL upload, and job polling with exponential backoff. The default API URL is https://roundtable.lsst.cloud/docverse/api (env var: DOCVERSE_API). CLI help shows env vars via show_envvar=True. Add client_test to default nox sessions.
Return ObjectStore protocol type instead of S3ObjectStore from create_objectstore factory, enabling future non-S3 implementations. Add __aenter__/__aexit__ to the ObjectStore protocol and MockObjectStore so callers can use the protocol type as an async context manager.
…-effort cleanup - Use serialize_base32_id(build.public_id) for the build prefix instead of the internal integer PK, correlating S3 paths with API identifiers. - Type _process_build's object_store parameter as ObjectStore instead of Any. - Wrap staging tarball deletion in try/except so a deletion failure doesn't fail an otherwise-successful build.
Add get_by_label() to CredentialService for fetching credentials by org slug and label. Rewrite the get_credential handler to delegate to the service instead of directly accessing the store, matching the pattern used by the other credential endpoints.
Add docstring clarifying that 'uploaded' is never persisted to the database — the server transitions directly from pending to processing.
Remove the unused public create_queue_backend method from HandlerFactory — the private _create_queue_backend is sufficient. Also update create_objectstore_for_org to return ObjectStore protocol type instead of S3ObjectStore.
Remove the redundant _current attribute. MultiFernet.encrypt() already delegates to the first key, so self._fernet.encrypt() is equivalent.
jonathansick
added a commit
to lsst-sqre/phalanx
that referenced
this pull request
Mar 19, 2026
The client's PEP 695 `type Base32Id = ...` statement creates a TypeAliasType that Pydantic 2.5 cannot unwrap, breaking the client_test_oldest nox session. Rather than bumping the minimum Pydantic version, move Base32Id to the server domain package (using a TypeAlias assignment for compatibility) and simplify client models to use plain `str` for IDs. Clients only ever receive and return serialized base32 strings — validation and integer storage are server-side concerns. - Move full base32id implementation into src/docverse/domain/base32id.py - Change client Build.id and QueueJob.id from Base32Id to str - Remove base32-lib from client dependencies - Pass pre-serialized ID strings in handler from_domain methods - Update client test assertions to compare strings
Replace the single organization_credentials table with a three-layer model: credentials (provider-level encrypted secrets), services (non-secret config referencing a credential), and organization slot assignments (publishing_store, staging_store, cdn_service, dns_service). Credentials are now typed with a CredentialProvider enum and use discriminated union Pydantic models. Services have typed configs per ServiceProvider with provider-credential compatibility validation. Organization GET responses embed lightweight service summaries via HATEOAS links; POST/PATCH use label references. Includes Alembic migration, full CRUD endpoints and service layer for organization services, updated objectstore factory for two-step resolution (service label -> config + credential -> decrypt -> build), and tests for all new storage and handler code.
The Claude Code sandbox requires TC_HOST=localhost and TESTCONTAINERS_RYUK_DISABLED=true to run testcontainers. Bake these into the documented server test command.
304d810 to
ed55cc5
Compare
Configure Alembic logging in alembic.ini so migrations emit INFO-level output, and add structured log messages to the update_db_schema command that report current and target revisions before and after migration.
98796b2 to
113ab81
Compare
Remove `docverse init` from start-service.sh so pods no longer stamp the database on every startup. This prevents schema drift when new code starts before the Argo CD migration job runs. Add a guard to the `init` CLI command that refuses to run on an already-initialized database (unless --reset is passed), ensuring `docverse update-db-schema` is the sole path for production schema changes.
The init_database function calls Base.metadata.create_all(), which attempts to create GIN indexes using gin_trgm_ops. This fails on databases (e.g. CloudSQL) where the pg_trgm extension is not pre-installed, since the extension was only created in the Alembic migration path, not in the init path.
Extract get_current_revision() into database.py as a public function and use it to log the app version and current Alembic revision in the FastAPI lifespan, arq worker startup, and the init/update-db-schema CLI commands. This makes it easier to diagnose deployment failures caused by schema mismatches.
Add a `members` field to `OrganizationCreate` so that the admin org creation endpoint can atomically seed memberships. This solves the bootstrap problem where the membership API requires an existing admin but a new org has no members. Duplicate (principal_type, principal) entries are silently deduplicated (first wins). Simplify the test helper `seed_org_with_admin` to use the new API field instead of direct DB access.
Users with the `admin:docverse` Gafaelfawr scope are now treated as org admins everywhere, without needing explicit membership. Add `get_scopes()` to the `UserInfoStore` protocol, short-circuit `AuthorizationService.resolve_role()` when the super admin scope is present, and wire scopes through the `OrgRoleDependency`.
The scope-based super admin detection via SUPERADMIN_SCOPE was architecturally broken: Gafaelfawr's delegated internal token only carries the fixed ingress scopes (exec:docverse), not the user's original scopes like admin:docverse. Replace with a config-based list of super admin usernames (DOCVERSE_SUPERADMIN_USERNAMES) checked against the authenticated X-Auth-Request-User header. Remove get_scopes() from the UserInfoStore protocol and delete the constants module.
Implement the UserInfoStore protocol using GafaelfawrClient from rubin-gafaelfawr to resolve user group memberships via the Gafaelfawr API. Wire it into the application lifespan using safir's shared httpx client. Tests override the store back to StubUserInfoStore to avoid requiring a live Gafaelfawr.
pydantic-settings v2 tries to JSON-parse complex types from env vars. Add a BeforeValidator that splits comma-separated strings into lists so the field works with both JSON arrays and plain comma-separated values as documented.
Test the full HTTP path through OrgRoleDependency and UserInfoStore.get_groups() for group-type memberships, which was previously only covered at the storage layer.
The server and client packages both declared a "docverse" script, causing the server's entrypoint to shadow the client's in the uv workspace. Rename the server CLI to "docverse-admin" so the unqualified "docverse" command resolves to the user-facing client CLI (docverse upload).
Use org/project slugs and base32 build IDs instead of internal integer IDs in structured log keys for better operator traceability.
Cloudflare R2 and MinIO use S3-compatible access keys, not provider-specific API tokens. The new `s3` credential type (access_key_id + secret_access_key) fixes the KeyError that occurred when the objectstore factory tried to read access keys from a `cloudflare` credential payload that only has `api_token`.
Adds a -v/--verbose option that prints request and response details to stderr, with bearer tokens masked for safe display.
Enable updating credential_label on an existing service without deleting and recreating it. The endpoint validates that the new credential exists and is provider-compatible before applying the change.
The enqueue method was spreading the payload dict as **kwargs, but the worker function signature expects a single `payload` kwarg. This caused a TypeError when the worker tried to call the function.
The web app was enqueuing jobs to arq's default "arq:queue" while the worker listens on "docverse:queue" (config.arq_queue_name). Thread the configured queue name from config through ContextDependency, HandlerFactory, and ArqQueueBackend so jobs are enqueued to the correct queue.
Restructure build_processing() into separate transaction phases so the error-handling transaction runs after the main transaction's context manager has fully exited. Previously, rolling back and starting a new transaction inside an active context manager caused SQLAlchemy's InvalidRequestError. Remove explicit commit()/rollback() calls in favor of letting async with session.begin() manage transaction lifecycle.
botocore 1.36+ defaults to adding CRC32 checksums with aws-chunked transfer encoding for PutObject, which Cloudflare R2 doesn't support, causing SignatureDoesNotMatch errors on uploads.
Allows looking up a QueueJob by its arq backend job ID, which is needed for the worker to find and update the corresponding QueueJob record.
The worker now looks up the QueueJob by arq job ID and transitions it through in_progress, completed, or failed states alongside the build.
Disable payload signing to prevent botocore from using the STREAMING-UNSIGNED-PAYLOAD-TRAILER signing protocol, which R2 does not support. This forces UNSIGNED-PAYLOAD for PutObject, avoiding the aws-chunked transfer encoding that causes signature verification failures. Also suppress response checksum validation warnings on GetObject responses from R2.
The payload_signing_enabled=False config was ineffective because botocore 1.36+ sets requestChecksumRequired on PutObject, causing aws-chunked transfer encoding with checksum trailers regardless. Register a before-parameter-build event hook to strip ChecksumAlgorithm from PutObject params, preventing the STREAMING-UNSIGNED-PAYLOAD-TRAILER code path that R2 does not support.
The previous before-parameter-build hook fired too early — resolve_request_checksum_algorithm runs after it and re-adds a CRC32 trailing checksum because PutObject has requestChecksumRequired=true in the botocore service model. Switch to a before-call hook that clears context["checksum"]["request_algorithm"] after the checksum context is resolved but before apply_request_checksum wraps the body in AwsChunkedWrapper with STREAMING-UNSIGNED-PAYLOAD-TRAILER.
Botocore dispatches event handlers via handler(**kwargs), so parameter names must match the kwarg names exactly. The _params name (underscore prefix added for ruff ARG004) didn't match the params kwarg, causing a TypeError at runtime. Switch to **kwargs and extract context by name.
httpx is needed for uploading objects to R2 via presigned URLs, bypassing aiobotocore's signing pipeline which produces STREAMING-UNSIGNED-PAYLOAD-TRAILER that R2 rejects.
aiobotocore's PutObject uses STREAMING-UNSIGNED-PAYLOAD-TRAILER which Cloudflare R2 does not support. Three prior attempts to suppress this via botocore event hooks all failed. Bypass the signing pipeline entirely: when an httpx client is available, upload_object generates a presigned PUT URL (pure crypto, no network call) and uploads via httpx. The direct put_object path is retained as a fallback. The httpx.AsyncClient is created in the arq worker startup and threaded through Factory → create_objectstore → S3ObjectStore.
The presigned-URL upload in upload_object signed Content-Type as a query parameter, causing 403 SignatureDoesNotMatch on R2 for MIME types other than application/gzip. Remove ContentType from the presigned URL params so it is sent as an unsigned header instead. Also add error-body logging before raise_for_status() to capture R2's XML error response for future debugging.
Tarball members have ./ prefixes (e.g., ./index.html) because the client creates tarballs with arcname=".". When used directly in S3 keys, botocore signs the path with ./ included, but R2 normalizes the path by removing ./ before verifying the signature, causing 403 SignatureDoesNotMatch errors on every upload.
Set phase="unpacking" with progress messages on the QueueJob so clients polling the job endpoint can see what the worker is doing. Report final object_count and total_size_bytes before marking complete.
Wrap the raw file object in a `with` statement so it is always closed, even if tarfile.open() or tar.add() raises. Previously, an exception would skip raw.close() and leak the handle.
Note that both the tarball download and file extraction hold all data in memory at once. Streaming would reduce peak memory usage for large documentation builds.
Missing keys now raise a descriptive ValueError instead of a bare KeyError, making object store misconfiguration easier to diagnose.
A build that completes with object_count=0 likely indicates a corrupted or empty upload. Log a warning with the staging_key so operators can detect and investigate these cases.
Replace inline structlog.get_logger() call with a bound logger passed through the constructor, matching the pattern used by all other services and stores. This ensures error logs from presigned upload failures carry request/worker context (org, project, build).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements minimal end-to-end support for build uploads on both the client and server side, following the tarball-to-object-store upload model described in SQR-112.
Server-side
CredentialEncryptorwith key rotation support viaMultiFernet. Organization credentials are encrypted at rest and stored as BYTEA in Postgres.ObjectStoreprotocol with async context manager support, S3-compatible implementation (S3ObjectStorevia aiobotocore), mock implementation for testing, and a factory that routes by service type (s3,r2,minio).POST) returns a presigned upload URL pointing to the staging location in the org's object store.__builds/{public_id}/with bounded concurrency, records inventory, and transitions the build tocompleted. Staging tarball cleanup is best-effort.HandlerFactory(withArqQueueBackend) for HTTP handlers andWorkerFactory(withNullQueueBackend) for arq workers, preventing accidental recursive job enqueueing.Client-side
DocverseClient: async httpx client implementing the full upload workflow —create_build(),upload_tarball(),complete_upload(),get_queue_job(),wait_for_job()with exponential backoff.docverse uploadCLI: Click command wrapping the client for use in CI pipelines and shell scripts, with env var support and git ref auto-detection.create_tarball()utility that creates a.tar.gzand computes the SHA-256 content hash in a single pass.Review fixes
ObjectStoreprotocol includes__aenter__/__aexit__for async context manager usageObjectStoreprotocol type instead ofS3ObjectStoreget_credentialhandler delegates toCredentialService.get_by_label()instead of directly accessing the storeCredentialEncryptor.encrypt()usesMultiFernet.encrypt()instead of redundant_currentattributecreate_queue_backendmethod fromHandlerFactoryBuildStatus.uploadedas a signal-only valueTest plan
uv run --only-group=lint pre-commit run --all-filespassesuv run --only-group=nox nox -s typingpasses (0 errors)uv run --only-group=nox nox -s client_testpasses (15/15)uv run --only-group=nox nox -s testpasses