Update with main by RyaliNvidia · Pull Request #694 · NVIDIA/OSMO

RyaliNvidia · 2026-03-11T22:36:06Z

Issue #None

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

* Implement Pool Summary Cards * Carve out free/used on pools.. no additional toggle * Rename to PoolGpuSummaryCard * Code simplify

* remove log agent and all refs in charts * remove references in docs * remove log agent ref from deploy script * update log format to json * remove tee * remove command and extra log dir

* Remove OTEL collector sidecar, expose Prometheus metrics directly The OTEL collector sidecar's only role was to receive OTLP metrics from the app and re-expose them as a Prometheus scrape endpoint. By switching to opentelemetry-exporter-prometheus, each service now exposes metrics directly on port 9464, eliminating the need for the sidecar entirely. - Replace OTLPMetricExporter with PrometheusMetricReader in metrics.py; update MetricsCreatorConfig to use metrics_prometheus_port instead of the OTLP collector host/port/interval fields - Remove otel-collector sidecar container, ConfigMap, and volume from all service and backend-operator deployments - Add a named 'metrics' port (9464) to each app container so PodMonitors can scrape it directly - Make PodMonitors always-on (no longer gated on otel.enabled); add an envoy-admin endpoint to the service PodMonitor so Envoy's /stats/prometheus is still scraped after the sidecar is gone - Remove sidecars.otel / sidecars.OTEL values blocks from all charts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix PrometheusMetricReader usage: start HTTP server explicitly PrometheusMetricReader has no 'port' argument; use prometheus_client.start_http_server() to bind the scrape endpoint. Also add prometheus-client as an explicit BUILD dep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Expose Prometheus metrics on 0.0.0.0 via explicit start_server() Move prometheus_client.start_http_server out of MetricCreator.__init__ into a start_server() method, binding to 0.0.0.0 so Prometheus can scrape from outside the container. Each service now calls start_server() explicitly after construction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Clean up MetricCreator call sites to not go through get_meter_instance() Call start_server() directly on the constructor result instead of chaining through the static get_meter_instance() accessor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add prometheus_client to mypy ignore_missing_imports prometheus_client lacks type stubs visible to the Bazel mypy integration, so suppress the import-not-found error the same way other untyped packages are handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Move start_server() out of configure_app() into main() configure_app() is called by tests, which must not bind a Prometheus HTTP server. Move start_server() to main() so it only runs in production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add podMonitor.enabled toggle to service and backend-operator charts Users without the Prometheus Operator CRD (monitoring.coreos.com/v1) can set podMonitor.enabled=false to skip PodMonitor creation. Defaults to true. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Move start_server() to main() in backend_worker; disable podMonitor in quick-start - Move BackendWorker.start_server() call from __init__ to main() for consistency with other services - Disable PodMonitor in quick-start chart since the CRD is not installed in a basic quick-start deployment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update chart READMEs: replace OTEL sidecar docs with podMonitor docs Remove references to sidecars.otel/OTEL.enabled and associated OTEL collector parameters; add podMonitor.enabled documentation reflecting the switch to Prometheus pull-based metrics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Update KAI scheduler to v0.12 in scripts and docs to support topology aware scheduling

* Oauth2 Proxy Redis Based Sessions * Add TLS and password

…#641) * Update readme for helm charts * Make redis session store enabled by default

* Fix generating default admin info * Update docs

Prevents port conflicts when running all services locally via bazel. Core: 9464, worker: 9465, delayed_job_monitor: 9466, backend_listener: 9467, backend_worker: 9468. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously it would do a redirect to the OIDC provider, which the UI did not present in a way that made sense for the user.

* Update password flow to fetch client_id from service * lint

* Add auto coderabbit * chill

…Skill (#631) * Update Skill to Autoscale Workflow Submissions based on workflow parameters and README contents * Parameterize torchrun workflow * Remove cookbook.md, just fetch from cookbook README * Revert changes

* Occupancy Page * Tweak column widths * Update icon

* Initial commit * Remove timing logs and linta * Remove timing logs

…668) * Add default filter (STATUS) to occupancy page * Use compact bytes unit in Occupancy page

* Properly fix unique Prometheus ports per service (redo of #649) PR #649 fixed port conflicts only in the bazel run scripts, meaning the problem persisted when services were launched directly via their bazel targets. This commit fixes it at the source by overriding metrics_prometheus_port in each service's config class, so the correct port is used regardless of how the service is started. Port assignments (core stays at the base default of 9464): - worker: 9465 - delayed_job_monitor: 9466 - backend_listener: 9467 - backend_worker: 9468 Kubernetes manifests are updated to match these new defaults. Also fixes a pre-existing port name collision: the oauth2-proxy sidecar declared its metrics port as "metrics", conflicting with the OSMO service container port of the same name in the same pod. Renamed to "oauth2-metrics" across all three chart sidecar helpers (service, router, web-ui). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert run script port overrides made redundant by previous commit The --metrics_prometheus_port flags added to the bazel run scripts by #649 are now superseded by the per-service config class defaults. Remove them to keep the scripts clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add oauth2-metrics endpoint to service PodMonitor The oauth2-proxy sidecar port was renamed from "metrics" to "oauth2-metrics" to avoid a pod-level port name collision. Add a corresponding PodMonitor endpoint so Prometheus continues scraping oauth2-proxy metrics when the sidecar is enabled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix pylint missing-class-docstring in WorkerConfig Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Catch error when timing out * update

* Revert "new fix for prometheus ports (#670)" This reverts commit 122af26. * Revert "Assign unique Prometheus metrics ports per service in bazel mode (#649)" This reverts commit ef45396. * Fix OAuth2 metrics * Skip Prometheus metrics server in dev mode for all services Prevents port conflicts when running multiple services locally with --method=dev. Consistent with the pattern already applied to worker.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Skip metrics server startup when otel metrics are disabled Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "Skip Prometheus metrics server in dev mode for all services" This reverts commit 9211ccc. * Disable metrics by default --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

The osmo_workspace+ directory containing cross-workspace dependencies may exist inside the runfiles directory rather than at the image root. Unify the pythonpath append to use local_runfiles_dir so it resolves correctly in both cases.

Fix minor bugs in aws deployment script - Upgrade database for RDS to postgres 15.12 (15.4 is not supported anymore) - Fix CLI parameters not being correctly propogated to terraform - Add support for NGC_API_KEY to pull non-public helm charts and images - Fix bug where dry-run would not stop after terraform and would throw errors

Add basic redaction of secrets in workflow specs

* Fix user/pool filters in occupancy page * Add occupancy to pools quicklink * Add cross link from occupancy page to workflows, refactor for better code reuse * Coderabbit * Send `all_users` and `all_pools` for occupancy page when necessary * Include initializing into default filters for occupancy * Coderabbit

* fix backend listener * lint

* Add verbose mode for config show * Fix for past revisions

coderabbitai · 2026-03-11T22:36:17Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f7d8b31-398e-46cc-91f7-0678e15a0707

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch main

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fernandol-nvidia and others added 30 commits March 4, 2026 15:20

Pool Utilization/Occupancy UI Updates (#624)

e3ea952

* Implement Pool Summary Cards * Carve out free/used on pools.. no additional toggle * Rename to PoolGpuSummaryCard * Code simplify

Remove setuptools warning from CLI (#625)

701078e

remove log agent (#612)

8dad0be

* remove log agent and all refs in charts * remove references in docs * remove log agent ref from deploy script * update log format to json * remove tee * remove command and extra log dir

Fix various API issues after orval migration (#632)

7f1c5ce

Remove f-strings from key decryption + update uek creation (#636)

50f8374

Add integration tests and fixtures for Golang and Authz Sidecar (#637)

07fb199

Update KAI to v0.12 in scripts and docs (#639)

7d421d3

Update KAI scheduler to v0.12 in scripts and docs to support topology aware scheduling

Add Redis to Oauth2-Proxy (#638)

9931f98

* Oauth2 Proxy Redis Based Sessions * Add TLS and password

Make Redis Oauth2 Session Store Enabled by Default and Update READMEs (…

dcb4f6a

…#641) * Update readme for helm charts * Make redis session store enabled by default

Fix template in backend test runner (#648)

d44e5b3

Resolve issues creating default admin (#642)

d3193e8

* Fix generating default admin info * Update docs

charts: Fix XHR calls so they return 401 if not auth (#654)

c7f2f24

Previously it would do a redirect to the OIDC provider, which the UI did not present in a way that made sense for the user.

Update password flow to fetch client_id from service (#653)

5f067f8

* Update password flow to fetch client_id from service * lint

Add auto coderabbit (#656)

c535301

* Add auto coderabbit * chill

Add regex to user create validation (#660)

9b770be

Fix device_client_id typo (#664)

56c391f

Occupancy Page in UI (#640)

0448f6b

* Occupancy Page * Tweak column widths * Update icon

Refactor and fix data-table variable height rows (#667)

82d0ff9

Ethany/support large workflows (#655)

88fd200

* Initial commit * Remove timing logs and linta * Remove timing logs

Add default filter (Status) to occupancy page and compact bytes unit (#…

211404d

…668) * Add default filter (STATUS) to occupancy page * Use compact bytes unit in Occupancy page

Misc UI improvements and polish (#669)

00b527c

Catch error when logs time out (#662)

5501daa

* Catch error when timing out * update

Fix auto-refresh on workflow details page (#671)

2fff7c4

Fix bad cli call in the docs (#679)

d09e694

ecolternv and others added 12 commits March 11, 2026 12:57

Speedup UpdateGroup Jobs (#676)

e58bd4f

readme: add pointer to azure reference arch (#685)

5c7deab

Rudimentary secret redaction (#678)

a882e83

Add basic redaction of secrets in workflow specs

Fix: Connection test to not just check 200 (#688)

71969cc

Fix auxilary pod cleanup (#687)

bcb07fa

Fix authz sidecar evaluating tokens (#683)

ba252dc

fix backend listener (#681)

a484225

* fix backend listener * lint

Add some color to the filters and update occupancy links (#689)

a684427

Pool Quota: Address Case when Pools have No Nodes (#691)

19c3e76

#686 - Add verbose option for config show POOL (#690)

5edeab4

* Add verbose mode for config show * Fix for past revisions

RyaliNvidia requested a review from a team March 11, 2026 22:36

RyaliNvidia requested a review from a team as a code owner March 11, 2026 22:36

RyaliNvidia closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update with main#694

Update with main#694
RyaliNvidia wants to merge 42 commits intorelease/6.2from
main

RyaliNvidia commented Mar 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

RyaliNvidia commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

coderabbitai bot commented Mar 11, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

RyaliNvidia commented Mar 11, 2026 •

edited

Loading