Skip to content

TOOLS-2592 Tooling for shipping Triton service images from monitor-reef#21

Open
nshalman wants to merge 73 commits intomainfrom
how-to-ship
Open

TOOLS-2592 Tooling for shipping Triton service images from monitor-reef#21
nshalman wants to merge 73 commits intomainfrom
how-to-ship

Conversation

@nshalman
Copy link
Copy Markdown
Collaborator

@nshalman nshalman commented Mar 31, 2026

Portions generated by: Claude Opus 4.5 and 4.6 <noreply@anthropic.com>

Add tritonadm CLI, SAPI/IMGAPI/NAPI/PAPI API conversions, and zone image build infrastructure

Summary

  • New API trait conversions: SAPI, IMGAPI, NAPI, and PAPI — each with full 5-phase conversion (plan → API trait → client → CLI → validation)
  • tritonadm CLI: New operator administration tool with subcommands for post-setup (grafana, portal, common-external-nics), image management (list, import, import-remote, delete),
    dc-maint status, and dev teardown helpers
  • Zone image build infrastructure: Design doc, Makefile-based build system (images/), and a triton-api service with SMF manifests and SAPI metadata
  • triton-tls crate: Portable TLS cert loading that works on both illumos and other platforms
  • Auto-discovery: tritonadm discovers SAPI/VMAPI URLs from Triton headnode config files
  • Client generator improvements: Error schema patches for all Node.js Triton API clients, new client registrations for SAPI/IMGAPI/NAPI/PAPI
  • Restify conversion skill improvements: Updated guidance based on lessons from SAPI conversion

Test plan

  • make package-build PACKAGE=tritonadm builds successfully
  • make package-test PACKAGE=sapi-cli / imgapi-cli / napi-cli / papi-cli pass
  • make openapi-check confirms generated specs are up-to-date
  • make clients-check confirms generated client code is up-to-date
  • make audit passes (with known pre-existing exceptions)
  • Verify tritonadm post-setup commands work against a Triton headnode
  • Verify tritonadm image import/list/delete against IMGAPI

nshalman and others added 3 commits March 26, 2026 12:44
Outlines the images/ directory approach for building multiple Triton
zone images from a single Rust monorepo, including per-service
Makefiles, SAPI integration, and a jenkins-joylib enhancement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document which services don't ship as images (bugview, jira-stub),
list the reference repos needed to understand the design, and add a
prerequisites checklist for the jenkins-joylib change and SmartOS
testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce triton-api, a Dropshot API service that will eventually
replace cloudapi. For now it has a single /ping endpoint. This also
establishes the images/ directory structure for building zone images
from the monorepo.

- apis/triton-api: API trait with /ping endpoint
- services/triton-api-server: service implementation
- images/triton-api: zone image Makefile, SMF manifests, SAPI
  manifests, and boot script
- images/image.defs.mk: shared image build definitions, sets
  ENGBLD_REPO_ROOT for eng Makefile compatibility
- deps/eng: updated to include ENGBLD_REPO_ROOT monorepo support
- .gitignore: add image build artifact patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
setup.sh was committed without the execute bit, which would cause
SMF postboot to fail to start. Also move smf_include.sh source
before the first-boot marker check so $SMF_EXIT_OK is available
for the early exit path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nshalman and others added 24 commits April 6, 2026 16:46
$(shell) swallows exit codes, so git rev-parse and git submodule
update failures would leave ENGBLD_REPO_ROOT empty and eng includes
broken with confusing errors. Add explicit guards with clear messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update all REPO_ROOT references in code blocks to ENGBLD_REPO_ROOT
to match actual implementation. Renumber open questions (was 1,3,4,5
now 1,2,3,4). Reframe eng Makefile compatibility question to reflect
that ENGBLD_REPO_ROOT already addresses the root issue. Remove local
filesystem path from TODO.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Files were untracked artifacts, not committed to the branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add status and healthy fields to PingResponse matching VMAPI pattern.
Move types to types/ module for consistency with other API crates.
Add Clone derive and crate-level doc comment. Update server to
return populated response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add triton-api dependency and ManagedApiConfig entry so make
openapi-generate and openapi-check cover the new API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document that the bind address should come from the SAPI-generated
config file once this service is ready for production deployment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
*.tar.gz, bits/, proto/, make_stamps/ were repo-wide but only
needed for image builds. Scope to images/*/ to avoid accidentally
hiding legitimate files elsewhere.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rust successor to sdcadm for Triton datacenter administration.
All 16 top-level commands and 47 subcommands scaffolded as stubs
returning "not yet implemented". Shell completion works. Design doc
covers architecture, API client strategy, and first target
(post-setup portal).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Internal Triton APIs get the full trait-based pipeline (API trait →
OpenAPI spec → Progenitor client), not hand-written minimal clients.
Builds toward correct specs from day one and means the trait is ready
when we rewrite the Node.js services. jira-client is the sole
exception as a large external API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 5 API clients needed for post-setup portal also unlock services,
instances, avail, check-config, and check-health as low-hanging fruit.
Reordered priority list to reflect this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Grafana has a known-working sdcadm implementation to validate against.
Same APIs needed, but we can compare results on a real DC before
applying the pattern to a brand-new service (portal).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three patches applied to sapi-api.json:
- GET /mode: returns plain string, not ModeResponse JSON object
- POST /mode: returns 204 no content, not 200 with JSON body
- POST /loglevel: returns empty 200, not JSON body

Updated client-generator to use patched spec, regenerated client,
and fixed CLI to handle the new response types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Trait changes (canonical type fixes):
- Create endpoints return 200 (matching Node.js Restify default), not 201
- LogLevelResponse.level is serde_json::Value (Bunyan returns integer)
- SetLogLevelBody.level is serde_json::Value (accepts string or integer)
- Add uuid and master fields to all create body types

Patch additions:
- GET /ping 500: documented as known limitation (Node.js returns
  PingResponse on 500, Progenitor can't handle multiple response types)
- Create status code safety net patch (no-op since trait already fixed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused ModeResponse type (GET /mode is patched to return string)
- Add StorageType enum for PingResponse.stor_type field
- Change PingResponse.mode from String to SapiMode enum
- Change get_mode trait to return SapiMode (patched to string in spec)
- Change set_mode trait to HttpResponseUpdatedNoContent (native 204)
- Remove dead UpdateAttributesBody re-export from sapi-client
- Simplify post_mode patch to no-op (trait now generates 204 natively)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1: Add sections for enum identification, Restify response
pattern cataloging, patch requirements, and hidden request fields.

Phase 2: Add guidance on using Phase 1 enums, matching Restify
response patterns to Dropshot types (200 not 201 for creates),
and avoiding dead wrapper types.

Phase 5: Add enum wire-value verification, status code checking,
dead schema detection, and remaining String→enum scan.

Reference: Add Restify response pattern table, Progenitor
limitations section (multiple body types, text/plain, empty bodies).

Orchestrator: Add Step 2b for applying OpenAPI spec patches
between API generation and client generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add sapi-client and vmapi-client dependencies to tritonadm. Convert
main to async with tokio. Implement `services` (alias `svcs`) and
`instances` (alias `insts`) as the first real commands, replacing their
stubs. Services output matches sdcadm columns (type, uuid, name, image,
insts). Instances enriches SAPI data with VM alias, state, and image
from VMAPI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nshalman and others added 28 commits April 6, 2026 20:12
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Checks SAPI service metadata for CLOUDAPI_READONLY and DOCKER_READONLY
to determine maintenance mode. Also reads DC_MAINT_MESSAGE and
DC_MAINT_ETA from the sdc application metadata. Matches sdcadm output
format. Supports --json for machine-readable output.

Also refactors main.rs to resolve API URLs eagerly before the match,
avoiding borrow-checker issues when match arms destructure cli.command.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First real write command in tritonadm. Creates the grafana SAPI service,
provisions a first instance on the headnode, and optionally adds a manta
NIC. Handles re-runs (reprovision if image changed). Supports --yes,
--dry-run, --server, and --image flags.

Uses all five API clients (SAPI, IMGAPI, VMAPI, PAPI, NAPI). Image
lookup is local IMGAPI only for now (no updates server download).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --image flag now defaults to "latest" which queries
updates.tritondatacenter.com for the newest grafana image and imports
it into local IMGAPI if not already present. Use --image current for
local-only lookup (previous behavior). Adds --channel/-C for channel
selection and --updates-url for overriding the updates server.

Channel resolution: --channel flag > SAPI update_channel metadata >
remote default. Import uses IMGAPI's import-remote action with polling
until the image reaches active state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The real Node.js IMGAPI expects source and skip_owner_check as query
parameters on POST /images/:uuid?action=import-remote&source=...
Our TypedClient was putting them in the request body, causing a 404.

Use a direct reqwest POST with .query() to match the wire format the
IMGAPI server actually expects. Also removes the unused TypedClient
for local IMGAPI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds external NICs to imgapi and adminui zones, matching sdcadm's
command. Required before IMGAPI can reach the updates server to import
images. Refactors NIC addition into reusable add_nic_if_missing and
get_service_instances helpers, shared with the manta NIC logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The external NIC must be primary so the zone's default gateway routes
through the external network, allowing IMGAPI to reach the internet
(e.g., updates.tritondatacenter.com). Without primary=true the admin
network remains the default route and external DNS resolution fails.

Changed AddNicsRequest.networks from Vec<Uuid> to Vec<serde_json::Value>
to support VMAPI's object form: {"uuid": "...", "primary": true}.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `tritonadm dev` subcommand group with:
- `remove-external-nics`: undo common-external-nics (remove external
  NICs from imgapi/adminui)
- `remove-grafana`: undo post-setup grafana (destroy VM, delete SAPI
  instance and service)

These are development helpers not present in sdcadm, for iterating on
post-setup commands without manual cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The real Node.js IMGAPI returns {"code": ..., "message": ...} errors
without a request_id field. Our Progenitor-generated Error type required
request_id, causing deserialization failures on 404s during import
polling.

Fix: add IMGAPI error schema patch (same approach as CloudAPI) making
request_id optional and using "code" instead of "error_code". Point
imgapi-client at the patched spec. Also fix wait_for_image_active to
tolerate 404s during the async import workflow and add a 4-minute
timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All Node.js Triton services return errors as {"code": ..., "message": ...}
without request_id. Dropshot generates an Error schema with required
request_id, causing deserialization failures when clients receive error
responses from real services.

Extracted a shared patch_node_triton_error_schema function and applied
it to all six Triton API clients: CloudAPI, SAPI, IMGAPI, NAPI, PAPI,
VMAPI. Only bugview-client (our Dropshot service) and jira-client
(external API) retain the Dropshot error format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all 33 IMGAPI CLI commands into tritonadm as `tritonadm image <cmd>`.
This consolidates image management into the admin tool rather than
shipping a separate imgapi-cli binary. Uses tritonadm's existing
--imgapi-url global flag for URL resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All IMGAPI CLI functionality is now available via `tritonadm image`.
Remove the standalone imgapi-cli crate from the workspace.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract ServiceConfig struct and cmd_add_service from the grafana
implementation. Both grafana and portal now call the same function
with different configs. Portal defaults to --image current since
images are locally built, not on the updates server.

Portal config: name=portal, image=user-portal, package=sdc_1024,
no delegated dataset, firewall enabled, no manta NIC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single command to import a manifest + file, matching sdc-imgadm's
import -m -f workflow. Reads the manifest JSON, imports it to IMGAPI,
uploads the image file, and activates the image.

Usage: tritonadm image import -m <manifest> -f <file> [-c gzip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 'list' as alias for 'list-images'. Update default table format to
match sdc-imgadm: UUID, NAME, VERSION, FLAGS, OS, PUBLISHED columns.
Flags: I=unactivated, D=disabled, P=private (non-public), X=other.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The real Node.js IMGAPI reads the action from the query string, not
the request body. The TypedClient was wrapping the action in an
ActionBody struct and sending it in the body, causing 404s.

Fix image_action_json to pass the action via the Progenitor builder's
.action() method (which sets the query parameter). Remove the
ActionBody wrapper from all image action methods. This fixes import,
activate, and all other image actions.

Also reverts the direct-HTTP workaround in post_setup.rs import-remote
since the TypedClient now works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ImportImageRequest struct drops fields like origin, tags, and
requirements that the real IMGAPI expects. Send the raw JSON value
as the body instead of parsing into a typed struct first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The image import command now checks if the manifest's origin image
exists locally, and if not, imports it from the updates server before
importing the manifest. This matches sdc-imgadm's behavior of
resolving origin chains automatically.

Also moved DEFAULT_UPDATES_URL to a single constant in main.rs,
used by both post_setup.rs and image.rs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The real IMGAPI expects source and skip_owner_check as query parameters
for the import-remote action, not body fields. Added these to
ImageActionQuery in the API trait so the Progenitor builder exposes
.source() and .skip_owner_check() methods. Updated the TypedClient's
import_remote_image to use them directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The IMGAPI requires a compression parameter when uploading image files.
Read it from the manifest's files[0].compression as a fallback when
--compression is not explicitly passed on the command line.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add remove-portal shortcut and a generic remove-service that takes a
service name. All three (remove-grafana, remove-portal, remove-service)
use the same shared cmd_remove_service function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Portal needs USER_PORTAL_JWT_SECRET, USER_PORTAL_KEY_ID, and
USER_PORTAL_DATACENTERS in SAPI service metadata so config-agent can
render the config template. Generate JWT secret from /dev/urandom,
read SSH key fingerprint from headnode, build datacenter list from
SDC config.

Also refactors cmd_add_service args into SetupOpts struct.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Read /root/.ssh/sdc.id_rsa during post-setup portal and store it as
USER_PORTAL_SDC_KEY in the SAPI service metadata. Config-agent renders
this into /opt/smartdc/portal/etc/sdc_key via the sdc-key manifest,
which the portal uses to sign requests to CloudAPI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keys provisioned via config-agent templates or manual copy-paste can
acquire leading/trailing artifacts (BOM, blank lines, shell prompts)
that cause strict RFC 7468 parsers to reject otherwise valid PEM data.

Add normalize_pem() which extracts the -----BEGIN to -----END block,
and call it in both LegacyPrivateKey::from_pem() and
KeyLoader::load_from_file() before the data reaches strict parsers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port sdcadm's post-setup cloudapi into tritonadm using the existing
cmd_add_service infrastructure. The cloudapi service is created during
headnode setup, so this command just creates a first instance (or
reprovisions an existing one with a newer image).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes CloudAPI instances without deleting the SAPI service
definition, since CloudAPI is a core service created during headnode
setup. This allows re-running post-setup cloudapi for development
iteration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nshalman nshalman marked this pull request as ready for review April 7, 2026 18:50
@nshalman nshalman requested a review from a team April 7, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant