Skip to content

Update with main#694

Closed
RyaliNvidia wants to merge 42 commits intorelease/6.2from
main
Closed

Update with main#694
RyaliNvidia wants to merge 42 commits intorelease/6.2from
main

Conversation

@RyaliNvidia
Copy link
Contributor

@RyaliNvidia RyaliNvidia commented Mar 11, 2026

Issue #None

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

fernandol-nvidia and others added 30 commits March 4, 2026 15:20
* Implement Pool Summary Cards

* Carve out free/used on pools.. no additional toggle

* Rename to PoolGpuSummaryCard

* Code simplify
* remove log agent and all refs in charts

* remove references in docs

* remove log agent ref from deploy script

* update log format to json

* remove tee

* remove command and extra log dir
* Remove OTEL collector sidecar, expose Prometheus metrics directly

The OTEL collector sidecar's only role was to receive OTLP metrics from
the app and re-expose them as a Prometheus scrape endpoint. By switching
to opentelemetry-exporter-prometheus, each service now exposes metrics
directly on port 9464, eliminating the need for the sidecar entirely.

- Replace OTLPMetricExporter with PrometheusMetricReader in metrics.py;
  update MetricsCreatorConfig to use metrics_prometheus_port instead of
  the OTLP collector host/port/interval fields
- Remove otel-collector sidecar container, ConfigMap, and volume from
  all service and backend-operator deployments
- Add a named 'metrics' port (9464) to each app container so PodMonitors
  can scrape it directly
- Make PodMonitors always-on (no longer gated on otel.enabled); add an
  envoy-admin endpoint to the service PodMonitor so Envoy's
  /stats/prometheus is still scraped after the sidecar is gone
- Remove sidecars.otel / sidecars.OTEL values blocks from all charts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix PrometheusMetricReader usage: start HTTP server explicitly

PrometheusMetricReader has no 'port' argument; use
prometheus_client.start_http_server() to bind the scrape endpoint.
Also add prometheus-client as an explicit BUILD dep.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Expose Prometheus metrics on 0.0.0.0 via explicit start_server()

Move prometheus_client.start_http_server out of MetricCreator.__init__
into a start_server() method, binding to 0.0.0.0 so Prometheus can
scrape from outside the container. Each service now calls start_server()
explicitly after construction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Clean up MetricCreator call sites to not go through get_meter_instance()

Call start_server() directly on the constructor result instead of
chaining through the static get_meter_instance() accessor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add prometheus_client to mypy ignore_missing_imports

prometheus_client lacks type stubs visible to the Bazel mypy
integration, so suppress the import-not-found error the same way
other untyped packages are handled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Move start_server() out of configure_app() into main()

configure_app() is called by tests, which must not bind a Prometheus
HTTP server. Move start_server() to main() so it only runs in
production.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add podMonitor.enabled toggle to service and backend-operator charts

Users without the Prometheus Operator CRD (monitoring.coreos.com/v1)
can set podMonitor.enabled=false to skip PodMonitor creation.
Defaults to true.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Move start_server() to main() in backend_worker; disable podMonitor in quick-start

- Move BackendWorker.start_server() call from __init__ to main() for
  consistency with other services
- Disable PodMonitor in quick-start chart since the CRD is not installed
  in a basic quick-start deployment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update chart READMEs: replace OTEL sidecar docs with podMonitor docs

Remove references to sidecars.otel/OTEL.enabled and associated OTEL
collector parameters; add podMonitor.enabled documentation reflecting
the switch to Prometheus pull-based metrics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Update KAI scheduler to v0.12 in scripts and docs to support topology
aware scheduling
* Oauth2 Proxy Redis Based Sessions

* Add TLS and password
…#641)

* Update readme for helm charts

* Make redis session store enabled by default
* Fix generating default admin info

* Update docs
Prevents port conflicts when running all services locally via bazel.
Core: 9464, worker: 9465, delayed_job_monitor: 9466, backend_listener: 9467, backend_worker: 9468.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously it would do a redirect to the OIDC provider, which the UI did
not present in a way that made sense for the user.
* Update password flow to fetch client_id from service

* lint
* Add auto coderabbit

* chill
…Skill (#631)

* Update Skill to Autoscale Workflow Submissions based on workflow parameters and README contents

* Parameterize torchrun workflow

* Remove cookbook.md, just fetch from cookbook README

* Revert changes
* Occupancy Page

* Tweak column widths

* Update icon
* Initial commit

* Remove timing logs and linta

* Remove timing logs
…668)

* Add default filter (STATUS) to occupancy page

* Use compact bytes unit in Occupancy page
* Properly fix unique Prometheus ports per service (redo of #649)

PR #649 fixed port conflicts only in the bazel run scripts, meaning the
problem persisted when services were launched directly via their bazel
targets. This commit fixes it at the source by overriding
metrics_prometheus_port in each service's config class, so the correct
port is used regardless of how the service is started.

Port assignments (core stays at the base default of 9464):
  - worker:            9465
  - delayed_job_monitor: 9466
  - backend_listener:  9467
  - backend_worker:    9468

Kubernetes manifests are updated to match these new defaults.

Also fixes a pre-existing port name collision: the oauth2-proxy sidecar
declared its metrics port as "metrics", conflicting with the OSMO
service container port of the same name in the same pod. Renamed to
"oauth2-metrics" across all three chart sidecar helpers (service,
router, web-ui).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert run script port overrides made redundant by previous commit

The --metrics_prometheus_port flags added to the bazel run scripts by
#649 are now superseded by the per-service config class defaults. Remove
them to keep the scripts clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add oauth2-metrics endpoint to service PodMonitor

The oauth2-proxy sidecar port was renamed from "metrics" to
"oauth2-metrics" to avoid a pod-level port name collision. Add a
corresponding PodMonitor endpoint so Prometheus continues scraping
oauth2-proxy metrics when the sidecar is enabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix pylint missing-class-docstring in WorkerConfig

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Catch error when timing out

* update
* Revert "new fix for prometheus ports (#670)"

This reverts commit 122af26.

* Revert "Assign unique Prometheus metrics ports per service in bazel mode (#649)"

This reverts commit ef45396.

* Fix OAuth2 metrics

* Skip Prometheus metrics server in dev mode for all services

Prevents port conflicts when running multiple services locally with
--method=dev. Consistent with the pattern already applied to worker.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Skip metrics server startup when otel metrics are disabled

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "Skip Prometheus metrics server in dev mode for all services"

This reverts commit 9211ccc.

* Disable metrics by default

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The osmo_workspace+ directory containing cross-workspace dependencies
may exist inside the runfiles directory rather than at the image root.
Unify the pythonpath append to use local_runfiles_dir so it resolves
correctly in both cases.
ecolternv and others added 12 commits March 11, 2026 12:57
Fix minor bugs in aws deployment script

- Upgrade database for RDS to postgres 15.12 (15.4 is not supported anymore)
- Fix CLI parameters not being correctly propogated to terraform
- Add support for NGC_API_KEY to pull non-public helm charts and images
- Fix bug where dry-run would not stop after terraform and would throw errors
Add basic redaction of secrets in workflow specs
* Fix user/pool filters in occupancy page

* Add occupancy to pools quicklink

* Add cross link from occupancy page to workflows, refactor for better code reuse

* Coderabbit

* Send `all_users` and `all_pools` for occupancy page when necessary

* Include initializing into default filters for occupancy

* Coderabbit
* fix backend listener

* lint
* Add verbose mode for config show

* Fix for past revisions
@RyaliNvidia RyaliNvidia requested a review from a team March 11, 2026 22:36
@RyaliNvidia RyaliNvidia requested a review from a team as a code owner March 11, 2026 22:36
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f7d8b31-398e-46cc-91f7-0678e15a0707

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch main

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants