Description
Hello everyone! Currently, I'm a Trainee under the CNCF's Community Bridge program, working for the KubeVirt community.
My project is to improve KubeVirt's observability by implementing new metrics that represent what is going on with the KubeVirt environment. And after I got a little more comfortable with Open Source development, I also started to look for other projects that need help with observability and found Flux.
I've noticed that the first step before doing anything with open source, is to write a proposal so other people can give some comments, thoughts, and suggest changes based on best practices about what is about to be implemented. However, when discussing new metrics, we can't seem to find an agreement about some points, mostly because we cannot find a curated guideline about some issues, which I will present below.
This might be a really extensive issue, but I'd like to provide as much information as possible to present my thoughts 🤔
Hopefully, I won't bore you guys too much 😬
Metrics granularity
Let me give an example. Flux can synchronize Kubernetes manifests in a git repository with a Kubernetes cluster. To do so, when configuring Flux, one needs to inform which git remote repository, which branch, and which directories' path that flux will look up to when doing that synchronization.
Flux has a metric called flux_daemon_sync_manifests
with a label success=true|false
, which indicates the number of manifests that are being synchronized between repository and cluster and if the sync is successful or not.
When working with a single git repository, this may be enough, but let's say that someone is using flux to sync multiple repositories with a multi-cluster/multi-cloud environment. It will be hard to pinpoint exactly where is the problem when receiving an alert with flux_daemon_sync_manifests{success="false"} > 1
To solve this, we can use some approaches:
- Tell users to use different Prometheus/Alertmanager servers for each Flux deployment
- Add new labels to flux metrics that will help to pinpoint problems and set up better alerts.
I think everyone agrees that we should stick with the latter. But how much information is too much?
The most obvious labels are git_repository
, git_branch
and path
, but we could also add some not so obvious ones like manifest
which will tell exactly what file failed/succeded the synchronization, and we could also split the metric into flux_daemon_sync_manifests_fails
and flux_daemon_sync_manifests_success
and add error
to the first one indicating what was the error faced when doing the sync.
Of course, adding new labels always helps us build more detailed monitoring solutions, but they will also need more storage capacity and provoke more performance issues with PromQL(not sure about the last one).
I'm almost sure that a user can drop unnecessary labels from metrics at scrape time, but I don't think that should justify developers to add every label that could be useful to every single use-case.
Differentiate between what should be a metric or a label
I could use the same example as above. Should flux_daemon_sync_manifests
be a single metric with the success
label? Or should it be split in two; flux_daemon_sync_manifests_fails
and flux_daemon_sync_manifests_success
?
But let me give you another example:
KubeVirt is capable of deploying Virtual Machines on top of a Kubernetes Cluster, and KubeVirt exposes metrics regarding the VMs performance and resource usage.
When looking at node exporter's approach with disk metrics, it exposes metrics for read
operations and another one for write
operations. For example:
node_disk_ops_reads
- for the total amount of reading operations from a disk devicenode_disk_ops_written
- for the total amount of writing operations in a disk device
KubeVirt's approach, on the other hand, is to expose a single disk metric, but with the label type=read|write
.
kubevirt_vmi_storage_iops_total
- for the total amount of operations in a disk device, has a label to differentiate read and write ops
Both approaches work just fine, but which one is better? Do they have any differences performance-wise?
Every time a developer knows that a given label's value will ALWAYS be within a pre-defined set of values, the developer can choose whether implement several metrics or just a single one with an extra identifying label.
How to work with historical data
To better explain this one, I guess I will have to show you the problem I'm facing with Kubevirt.
As previously said, KubeVirt deploys VMs on top of Kubernetes. KubeVirt can also migrate VMs between nodes, which is necessary when a node becomes slow or unresponsive. Virtual Machine Instance(VMI) and Virtual Machine Instance Migration(VMIM) are implemented as Kubernetes Custom Resource.
VMIM has some useful information like, End and Start timestamps, target node where the VMI is being migrated to, source node where the VMI is being migrated from, which migration method is being used, and they can be in different stages: Succeeded
, Running
, and Failed
.
Every VMI and VMIM that is posted to the K8s API is stored in etcd and as long as the VMI still exists in the cluster, we can retrieve its data and expose them easily. Once we delete a VMI, the VMI and all VMIMs related to it are deleted from the etcd, and then we can't expose information about them anymore.
However, users want to analyze and correlate problems from previous VMI migrations with the existing ones, so they can identify why some migrations are taking more time than others and why they fail or succeed.
Let me try to explain it like this:
- Each row is a time-series for a VMI migration metric
- Each column represents 1h in the timeline
- Let's assume that the Prometheus server was configured to keep metrics in HEAD for 3h before indexing it in the TSDB
o
represents that the metric was collected in that particular moment of time-
represents that the metric was not collected in that particular moment of time[ ]
represents where the Prometheus HEAD is pointed to for a particular time series
We can whether:
- Keep VMIM objects in etcd and always expose old migration metrics
- Too much disk space/memory required for both metrics storage and etcd
- Will always keep every migration in-memory, thus, easy to analyze.
/\
| o o o o o o o [o] #migration 1 UID=1
| o o o o o o o [o] #migration 2 UID=2
| o o o o o o o [o] #migration 3 UID=3 (last one for that particular VMI)
------------------------->
Or
- Remove migrations information of old VMIs from etcd
- Less storage capacity needed
- Will have to deal with historical data at the Prometheus' side
/\
| o o o - - - - - #migration 1 UID=1
| - - - o o [o] - - #migration 2 UID=2
| - - - - - - o [o] #migration 3 UID=3 (last one for that particular VMI)
------------------------->
Let's say that I want to create a dashboard with information about all migrations that have ever happened within my cluster. With the first approach, a simple query like this would be enough: kubevirt_vmi_migration_metric_example
, since everything is in memory.
Once a time series is removed from Prometheus' HEAD, I will have to work with queries with time ranges, most probably with remote storages like Thanos or InfluxDB as well. The queries' return will not be metric values anymore, but rather metric vectors, which require to be treated differently. It's not an impossible thing to do, but surely must be thought carefully.
I'm sure there are good solutions for everything that I'm bringing with this issue, and I'm also sure that there are several other problems that I couldn't think of right now. What I'm asking for is to have a centralized place with documentation and guidelines for developers who are trying to improve applications' observability.
Perhaps it could be study material for a future Monitoring and Observability Certification, by CNCF. 👀
But anyways, this will help a lot anyone who is writing proposals for Open Source projects or anyone who is trying to follow CNCF's guidelines for Cloud Native Observability.