Skip to content

Commit a610855

Browse files
committed
add manifests and documentation for observability
Signed-off-by: sallyom <somalley@redhat.com>
1 parent 1505851 commit a610855

26 files changed

Lines changed: 2889 additions & 0 deletions

kubernetes/observability/README.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Monitor Llamastack & vLLM in OpenShift
2+
3+
Follow this README to configure an observability stack in OpenShift to visualize Llamastack telemetry and vLLM metrics.
4+
5+
## OpenShift Observability Operators
6+
7+
Operators are available from OperatorHub
8+
The following operators must be installed in order to proceed with this example.
9+
10+
### Operator descriptions
11+
12+
1. **Red Hat Build of OpenTelemetry**: The OpenTelemetry Collector (OTC) is provided from this operator.
13+
Metrics and traces will be distributed from the OTC to various backends. Tempo is deployed and is the tracing backend.
14+
15+
2. **Tempo Operator**: Provides `TempoStack` Custom Resource. This is the backend for distributed tracing.
16+
An S3-compatible storage (Minio) is paired with Tempo.
17+
18+
3. **Cluster Observability Operator**: This provides PodMonitor and ServiceMonitor Custom Resources which are necessary for
19+
user-workload monitoring's prometheus to scrape workload metrics. Also, the COO provides UIPlugins for viewing telemetry.
20+
21+
3. **(optional) Grafana Operator**: Provides Grafana APIs including `GrafanaDashboard`, `Grafana`, and `GrafanaDataSource` that will be used to visualize telemetry.
22+
23+
## Create PodMonitor or ServiceMonitor for any AI Workload that exposes a metrics endpoint
24+
25+
This is how to enable collection of user-workload metrics for any workload within OpenShift. You need to create a `PodMonitor` or a `ServiceMonitor`.
26+
The PodMonitor will ensure all metrics from pods with matching selectors will be scraped by the user-workload-monitoring Prometheus, and a ServiceMonitor will
27+
scrape from any pod that runs under a particular service.
28+
29+
* [Example PodMonitor](./podmonitor-example-0.yaml)
30+
* [Example ServiceMonitor](./servicemonitor-example.yaml)
31+
32+
Upon creation of either, metrics will be scraped and will be visible from the console `Observe -> Metrics` dashboards.
33+
34+
## Create custom resources and configurations for a central observability hub
35+
36+
Create the observablity hub namespace `observability-hub`. If a different namespace is created, be sure to update the resource yamls accordingly.
37+
38+
```bash
39+
oc create ns observability-hub
40+
```
41+
42+
### Tracing Backend (Tempo with Minio for S3 storage)
43+
44+
```bash
45+
# edit storageclassName & secret as necessary
46+
# secret and storage for testing only
47+
oc apply --kustomize ./tempo -n observability-hub
48+
```
49+
50+
### OpenTelemetryCollector deployment
51+
52+
OpenTelemetry Collector is used to aggregate telemetry from various workloads, process individual signals, and export
53+
to various backends. This is used to collect traces from various workloads and export all as a single
54+
authenticated stream to the in-cluster TempoStack. For in-cluster only, opentelemetry-collector is not necessary to collect
55+
metrics. Metrics are sent to the in-cluster user-workload-monitoring prometheus by creating the podmonitors and servicemonitors.
56+
However, if exporting off-cluster to a 3rd party observability vendor, the collector is necessary for all signals,
57+
and can provide a single place with which to receive telemetry from various workloads and export as a single authenticated and
58+
secure OTLP stream.
59+
60+
To create a central opentelemetry-collector, update the
61+
[otel-collector/otel-collector.yaml](./otel-collector/otel-collector.yaml) to match your requirements and then apply.
62+
63+
```bash
64+
oc apply --kustomize ./otel-collector -n observability-hub
65+
```
66+
67+
### OpenTelemetryCollector Sidecars deployment
68+
69+
You can add individual metrics endpoints to the central otel-collector in observability-hub, but
70+
another way is to add otel-collector sidecar containers to individual deployments throughout the
71+
cluster. Paired with an annotation on the deployment, telemetry will be exported as configured.
72+
Any deployment with the annotation below will receive and export telemetry as configured in the
73+
[otel-collector-vllm-sidecar.yaml](./otel-collector/otel-collector-vllm-sidecar.yaml).
74+
75+
The example here will add an otel-collector sidecar custom resource to the `llama-serve` namespace,
76+
and to trigger a sidecar container, annotate any deployment's `template.metadata.annotations` with:
77+
`sidecar.opentelemetry.io/inject: vllm-otelsidecar`
78+
79+
```bash
80+
oc apply -f ./otel-collector/otel-collector-vllm-sidecar.yaml
81+
82+
# Then, annotate whatever vllm deployment you'd like to collect metrics from
83+
# Or, add the annotation to the deployment's `template.metadata.annotations` from the console.
84+
oc patch deployment <deployment-name> \
85+
-n <namespace> \
86+
--type='merge' \
87+
-p '{"spec":{"template":{"metadata":{"annotations":{"sidecar.opentelemetry.io/inject":"vllm-otelsidecar"}}}}}'
88+
```
89+
90+
### Grafana
91+
92+
This will deploy a Grafana instance, and Prometheus & Tempo DataSources
93+
The prometheus datasource is the user-workload-monitoring prometheus running in `openshift-user-workload-monitoring` namespace.
94+
The Grafana console is configured with `username: rhel, password: rhel`
95+
96+
```bash
97+
cd grafana
98+
./deploy-grafana.sh
99+
```
100+
Upon success, you can explore metrics and traces from Grafana route.
101+
102+
#### GrafanaDashboard to visualize cluster metrics and traces
103+
104+
Check out [github.com/kevchu3/openshift-4-grafana](https://github.com/kevchu3/openshift4-grafana/tree/master/dashboards/crds) for a list of
105+
dashboards to deploy on OpenShift.
106+
107+
Here's an example to download and deploy a GrafanaDashboard for OpenShift 4.16 cluster metrics.
108+
The dashboard is slightly modified from https://github.com/kevchu3/openshift4-grafana/blob/master/dashboards/json_raw/cluster_metrics.ocp416.json
109+
110+
```bash
111+
oc apply -n observability-hub -f cluster-metrics-dashboard/cluster-metrics.yaml
112+
```
113+
114+
### Cluster Observability Operator Tracing UIPlugin
115+
116+
The Jaeger frontend feature of TempoStack is no longer supported by Red Hat. This has been replaced by the COO UIPlugin. To create the UIPlugin for
117+
Tracing, first ensure the TempoStack described above is created. This is a prerequisite. Then, all that's necessary to view traces from
118+
the OpenShift console at `Observe -> Traces` is to create the following [Tracing UIPlugin resource](./tracing-ui-plugin.yaml).
119+
120+
```bash
121+
oc apply ./tracing-ui-plugin.yaml
122+
```
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
kind: GrafanaDashboard
2+
apiVersion: grafana.integreatly.org/v1beta1
3+
metadata:
4+
name: cluster-metrics
5+
labels:
6+
app: grafana
7+
spec:
8+
instanceSelector:
9+
matchLabels:
10+
dashboards: grafana # This label matches the grafana Grafana instance
11+
# This json was copied and modified from https://github.com/kevchu3/openshift4-grafana/blob/master/dashboards/json_raw/cluster_metrics.ocp416.json
12+
url: https://raw.githubusercontent.com/redhat-et/edge-ocp-observability/refs/heads/main/observability-hub/grafana/cluster-metrics-dashboard/cluster_metrics_ocp.json
13+

0 commit comments

Comments
 (0)