Skip to content

Proposal about observability #8841

Open
@YunWZ

Description

@YunWZ

关于可观测性的提案

背景

尽管 Nacos 拥有完善的日志记录,以及一些指标数据监控,但这些数据只能在 Nacos 的控制台或者日志文件中查看,并不能和现有的一些监控运维平台无缝结合(例如:prometheus)。另外,为了高可用往往采用多节点部署的方式,这也为问题定位/故障排查带来了困难。

因此有必要提高 Nacos 的可观测性,为开发者带来指标数据查看、监控告警,分布式追踪等能力。

意图

为 Nacos 引入OpenTelemetry(以下简称 OTel ), 基于此提供指标数据、全链路追踪能力、日志以及相应的数据导出等能力,方便开发者将Nacos的遥测数据对接不同的监控运维平台。

为什么选择 OpenTelemetry?

在可观测领域,除了 OpenTelemetry,还有 OpenTracingOpenCensus, 而后两者已于 2019年5月合并为一个项目。(那就是 *OpenTelemetry* !)

因此,OpenTelemetry具有后两者的全部优势。

在 Nacos 中使用OpenTelemetry的好处:

  1. OTel 提供了多种语言实现的 SDK( Java/ go/ pythone/ C++ 等等),不仅支持当前 Nacos 服务端、客户端已支持的语言(暂不支持 nodejs/ c# ),也囊括了其他语言,为后续 Nacos 多语言版本客户端扩展留下空间;
  2. 支持 HTTPgRPC 暴露遥测数据,同时也支持推拉模型;
  3. 遥测数据导出支持主流监控平台,例如 Prometheus / Zipkin / Jaeger ,或者以 OTLP (OpenTelemetry Protocol)协议向外暴露数据;
  4. 在链路追踪方面支持多种分布式跟踪标头格式:

如何实现

全链路追踪(分布式链路追踪)

  1. 在客户端/服务端中引入OTelSDK,为 HTTP/gRPC 协议的请求和响应提供 Trace数据,并保存到上下文中;

指标数据

当前服务端已经记录指标数据,但使用的框架是 Micrometer,需要改造 OTel

日志

  1. 在当前已有的日志信息中添加 Trace 信息(包括 traceId、spanId 等)。

NOTE: OpenTelemetry 中对于 logging部分的定义尚处于实验阶段,因此日志部分暂不对接OpenTelemetry。

遥测数据导出

这一部分可基于 Otel 提供的功能实现。

链路数据导出

  1. 支持导出以 OTLP 协议导出;
  2. 支持导出至 Jaeger;
  3. 支持导出至 Zipkin;

指标数据导出

  1. 支持导出以 OTLP 协议导出;
  2. 支持导出至 Prometheus;

日志数据导出

按照现有模式输出到文件中,有开发者通过其他方式收集。(例如:ELK




Proposal about observability

background

Although Nacos has complete logging and some indicator data monitoring, these data can only be viewed in the Nacos console or log files, and cannot be seamlessly integrated with some existing monitoring and operation platforms (for example: prometheus). In addition, multi-node deployment is often adopted for high availability, which also brings difficulties to problem location/fault troubleshooting.

Therefore, it is necessary to improve the observability of Nacos, and provide developers with the ability to view indicator data, monitor alarms, and distributed tracking.

intent

Introduce OpenTelemetry (hereinafter referred to as OTel) for Nacos, based on this, it provides the ability of indicator data, full link tracking ability, log and corresponding data export, etc., which is convenient for developers to connect Nacos telemetry data to different monitoring operation and maintenance platforms .

Why choose OpenTelemetry?

In the observable field, in addition to OpenTelemetry, there are OpenTracing and OpenCensus, and the latter two have been published in May 2019 merged into one project. (That's *OpenTelemetry* !)

Therefore, OpenTelemetry has all the advantages of the latter two.

Benefits of using OpenTelemetry in Nacos:

  1. OTel provides SDK ( Java/ go/ pythone/ C++ etc.) implemented in multiple languages, not only Support the languages ​​supported by the current Nacos server and client (not support nodejs/ c#), and also include other languages, leaving room for subsequent Nacos multilingual version client expansion;
  2. Support HTTP and gRPC to expose telemetry data, and also support push-pull model;
  3. The telemetry data export supports mainstream monitoring platforms, such as Prometheus / Zipkin / Jaeger , or exposes data through the OTLP (OpenTelemetry Protocol) protocol;
  4. Supports multiple distributed trace header formats for link tracing:

How to achieve

Full link tracking (distributed link tracking)

  1. Introduce the OTel SDK in the client/server to provide Trace data for HTTP/gRPC protocol requests and responses, and save to context;

Indicator data

The current server has recorded indicator data, but the framework used is Micrometer, and OTel needs to be modified.

log

  1. Add Trace information (including traceId, spanId, etc.) to the existing log information.

NOTE: The definition of the logging section in OpenTelemetry is still in the experimental stage, so the log section will not be connected to OpenTelemetry for the time being.

Telemetry data export

This part can be implemented based on the functions provided by Otel.

link data export

  1. Support export with OTLP protocol;
  2. Support export to Jaeger;
  3. Support export to Zipkin;

Indicator data export

  1. Support export with OTLP protocol;
  2. Support export to Prometheus;

log data export

It is output to the file according to the existing mode, and some developers collect it by other means. (Example: ELK)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions