Closed
Conversation
* Implement Pool Summary Cards * Carve out free/used on pools.. no additional toggle * Rename to PoolGpuSummaryCard * Code simplify
* remove log agent and all refs in charts * remove references in docs * remove log agent ref from deploy script * update log format to json * remove tee * remove command and extra log dir
* Remove OTEL collector sidecar, expose Prometheus metrics directly The OTEL collector sidecar's only role was to receive OTLP metrics from the app and re-expose them as a Prometheus scrape endpoint. By switching to opentelemetry-exporter-prometheus, each service now exposes metrics directly on port 9464, eliminating the need for the sidecar entirely. - Replace OTLPMetricExporter with PrometheusMetricReader in metrics.py; update MetricsCreatorConfig to use metrics_prometheus_port instead of the OTLP collector host/port/interval fields - Remove otel-collector sidecar container, ConfigMap, and volume from all service and backend-operator deployments - Add a named 'metrics' port (9464) to each app container so PodMonitors can scrape it directly - Make PodMonitors always-on (no longer gated on otel.enabled); add an envoy-admin endpoint to the service PodMonitor so Envoy's /stats/prometheus is still scraped after the sidecar is gone - Remove sidecars.otel / sidecars.OTEL values blocks from all charts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix PrometheusMetricReader usage: start HTTP server explicitly PrometheusMetricReader has no 'port' argument; use prometheus_client.start_http_server() to bind the scrape endpoint. Also add prometheus-client as an explicit BUILD dep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Expose Prometheus metrics on 0.0.0.0 via explicit start_server() Move prometheus_client.start_http_server out of MetricCreator.__init__ into a start_server() method, binding to 0.0.0.0 so Prometheus can scrape from outside the container. Each service now calls start_server() explicitly after construction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Clean up MetricCreator call sites to not go through get_meter_instance() Call start_server() directly on the constructor result instead of chaining through the static get_meter_instance() accessor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add prometheus_client to mypy ignore_missing_imports prometheus_client lacks type stubs visible to the Bazel mypy integration, so suppress the import-not-found error the same way other untyped packages are handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Move start_server() out of configure_app() into main() configure_app() is called by tests, which must not bind a Prometheus HTTP server. Move start_server() to main() so it only runs in production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add podMonitor.enabled toggle to service and backend-operator charts Users without the Prometheus Operator CRD (monitoring.coreos.com/v1) can set podMonitor.enabled=false to skip PodMonitor creation. Defaults to true. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Move start_server() to main() in backend_worker; disable podMonitor in quick-start - Move BackendWorker.start_server() call from __init__ to main() for consistency with other services - Disable PodMonitor in quick-start chart since the CRD is not installed in a basic quick-start deployment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update chart READMEs: replace OTEL sidecar docs with podMonitor docs Remove references to sidecars.otel/OTEL.enabled and associated OTEL collector parameters; add podMonitor.enabled documentation reflecting the switch to Prometheus pull-based metrics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Update KAI scheduler to v0.12 in scripts and docs to support topology aware scheduling
* Oauth2 Proxy Redis Based Sessions * Add TLS and password
…#641) * Update readme for helm charts * Make redis session store enabled by default
* Fix generating default admin info * Update docs
Prevents port conflicts when running all services locally via bazel. Core: 9464, worker: 9465, delayed_job_monitor: 9466, backend_listener: 9467, backend_worker: 9468. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously it would do a redirect to the OIDC provider, which the UI did not present in a way that made sense for the user.
* Update password flow to fetch client_id from service * lint
* Add auto coderabbit * chill
…Skill (#631) * Update Skill to Autoscale Workflow Submissions based on workflow parameters and README contents * Parameterize torchrun workflow * Remove cookbook.md, just fetch from cookbook README * Revert changes
* Occupancy Page * Tweak column widths * Update icon
* Initial commit * Remove timing logs and linta * Remove timing logs
…668) * Add default filter (STATUS) to occupancy page * Use compact bytes unit in Occupancy page
* Properly fix unique Prometheus ports per service (redo of #649) PR #649 fixed port conflicts only in the bazel run scripts, meaning the problem persisted when services were launched directly via their bazel targets. This commit fixes it at the source by overriding metrics_prometheus_port in each service's config class, so the correct port is used regardless of how the service is started. Port assignments (core stays at the base default of 9464): - worker: 9465 - delayed_job_monitor: 9466 - backend_listener: 9467 - backend_worker: 9468 Kubernetes manifests are updated to match these new defaults. Also fixes a pre-existing port name collision: the oauth2-proxy sidecar declared its metrics port as "metrics", conflicting with the OSMO service container port of the same name in the same pod. Renamed to "oauth2-metrics" across all three chart sidecar helpers (service, router, web-ui). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert run script port overrides made redundant by previous commit The --metrics_prometheus_port flags added to the bazel run scripts by #649 are now superseded by the per-service config class defaults. Remove them to keep the scripts clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add oauth2-metrics endpoint to service PodMonitor The oauth2-proxy sidecar port was renamed from "metrics" to "oauth2-metrics" to avoid a pod-level port name collision. Add a corresponding PodMonitor endpoint so Prometheus continues scraping oauth2-proxy metrics when the sidecar is enabled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix pylint missing-class-docstring in WorkerConfig Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Catch error when timing out * update
* Revert "new fix for prometheus ports (#670)" This reverts commit 122af26. * Revert "Assign unique Prometheus metrics ports per service in bazel mode (#649)" This reverts commit ef45396. * Fix OAuth2 metrics * Skip Prometheus metrics server in dev mode for all services Prevents port conflicts when running multiple services locally with --method=dev. Consistent with the pattern already applied to worker.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Skip metrics server startup when otel metrics are disabled Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "Skip Prometheus metrics server in dev mode for all services" This reverts commit 9211ccc. * Disable metrics by default --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The osmo_workspace+ directory containing cross-workspace dependencies may exist inside the runfiles directory rather than at the image root. Unify the pythonpath append to use local_runfiles_dir so it resolves correctly in both cases.
Fix minor bugs in aws deployment script - Upgrade database for RDS to postgres 15.12 (15.4 is not supported anymore) - Fix CLI parameters not being correctly propogated to terraform - Add support for NGC_API_KEY to pull non-public helm charts and images - Fix bug where dry-run would not stop after terraform and would throw errors
Add basic redaction of secrets in workflow specs
* Fix user/pool filters in occupancy page * Add occupancy to pools quicklink * Add cross link from occupancy page to workflows, refactor for better code reuse * Coderabbit * Send `all_users` and `all_pools` for occupancy page when necessary * Include initializing into default filters for occupancy * Coderabbit
* fix backend listener * lint
* Add verbose mode for config show * Fix for past revisions
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #None
Checklist