Description
关于可观测性的提案
背景
尽管 Nacos 拥有完善的日志记录,以及一些指标数据监控,但这些数据只能在 Nacos 的控制台或者日志文件中查看,并不能和现有的一些监控运维平台无缝结合(例如:prometheus)。另外,为了高可用往往采用多节点部署的方式,这也为问题定位/故障排查带来了困难。
因此有必要提高 Nacos 的可观测性,为开发者带来指标数据查看、监控告警,分布式追踪等能力。
意图
为 Nacos 引入OpenTelemetry
(以下简称 OTel
), 基于此提供指标数据、全链路追踪能力、日志以及相应的数据导出等能力,方便开发者将Nacos的遥测数据对接不同的监控运维平台。
为什么选择 OpenTelemetry?
在可观测领域,除了 OpenTelemetry
,还有 OpenTracing
和 OpenCensus
, 而后两者已于 2019年5月合并为一个项目。(那就是 *OpenTelemetry*
!)
因此,OpenTelemetry
具有后两者的全部优势。
在 Nacos 中使用OpenTelemetry
的好处:
OTel
提供了多种语言实现的 SDK(Java
/go
/pythone
/C++
等等),不仅支持当前 Nacos 服务端、客户端已支持的语言(暂不支持nodejs
/c#
),也囊括了其他语言,为后续 Nacos 多语言版本客户端扩展留下空间;- 支持
HTTP
和gRPC
暴露遥测数据,同时也支持推拉模型; - 遥测数据导出支持主流监控平台,例如
Prometheus
/Zipkin
/Jaeger
,或者以OTLP
(OpenTelemetry Protocol)协议向外暴露数据; - 在链路追踪方面支持多种分布式跟踪标头格式:
- "tracecontext": W3C Trace Context (add baggage as well to include W3C baggage)
- "baggage": W3C Baggage
- "b3": B3 Single
- "b3multi": B3 Multi
- "jaeger": Jaeger (includes Jaeger baggage)
- "xray": AWS X-Ray
- "ottrace": OT Trace
如何实现
全链路追踪(分布式链路追踪)
- 在客户端/服务端中引入
OTel
SDK,为 HTTP/gRPC 协议的请求和响应提供Trace
数据,并保存到上下文中;
指标数据
当前服务端已经记录指标数据,但使用的框架是 Micrometer,需要改造 OTel
。
日志
- 在当前已有的日志信息中添加
Trace
信息(包括 traceId、spanId 等)。
NOTE: OpenTelemetry 中对于 logging部分的定义尚处于实验阶段,因此日志部分暂不对接OpenTelemetry。
遥测数据导出
这一部分可基于 Otel
提供的功能实现。
链路数据导出
- 支持导出以
OTLP
协议导出; - 支持导出至 Jaeger;
- 支持导出至 Zipkin;
指标数据导出
- 支持导出以
OTLP
协议导出; - 支持导出至 Prometheus;
日志数据导出
按照现有模式输出到文件中,有开发者通过其他方式收集。(例如:ELK)
Proposal about observability
background
Although Nacos has complete logging and some indicator data monitoring, these data can only be viewed in the Nacos console or log files, and cannot be seamlessly integrated with some existing monitoring and operation platforms (for example: prometheus). In addition, multi-node deployment is often adopted for high availability, which also brings difficulties to problem location/fault troubleshooting.
Therefore, it is necessary to improve the observability of Nacos, and provide developers with the ability to view indicator data, monitor alarms, and distributed tracking.
intent
Introduce OpenTelemetry
(hereinafter referred to as OTel
) for Nacos, based on this, it provides the ability of indicator data, full link tracking ability, log and corresponding data export, etc., which is convenient for developers to connect Nacos telemetry data to different monitoring operation and maintenance platforms .
Why choose OpenTelemetry?
In the observable field, in addition to OpenTelemetry
, there are OpenTracing
and OpenCensus
, and the latter two have been published in May 2019 merged into one project. (That's *OpenTelemetry*
!)
Therefore, OpenTelemetry
has all the advantages of the latter two.
Benefits of using OpenTelemetry
in Nacos:
OTel
provides SDK (Java
/go
/pythone
/C++
etc.) implemented in multiple languages, not only Support the languages supported by the current Nacos server and client (not supportnodejs
/c#
), and also include other languages, leaving room for subsequent Nacos multilingual version client expansion;- Support
HTTP
andgRPC
to expose telemetry data, and also support push-pull model; - The telemetry data export supports mainstream monitoring platforms, such as
Prometheus
/Zipkin
/Jaeger
, or exposes data through theOTLP
(OpenTelemetry Protocol) protocol; - Supports multiple distributed trace header formats for link tracing:
- "tracecontext": W3C Trace Context (add baggage as well to include W3C baggage)
- "baggage": W3C Baggage
- "b3": B3 Single
- "b3multi": B3 Multi
- "jaeger": Jaeger (includes Jaeger baggage)
- "xray": AWS X-Ray
- "ottrace": OT Trace
How to achieve
Full link tracking (distributed link tracking)
- Introduce the
OTel
SDK in the client/server to provideTrace
data for HTTP/gRPC protocol requests and responses, and save to context;
Indicator data
The current server has recorded indicator data, but the framework used is Micrometer, and OTel
needs to be modified.
log
- Add
Trace
information (including traceId, spanId, etc.) to the existing log information.
NOTE: The definition of the logging section in OpenTelemetry is still in the experimental stage, so the log section will not be connected to OpenTelemetry for the time being.
Telemetry data export
This part can be implemented based on the functions provided by Otel
.
link data export
- Support export with
OTLP
protocol; - Support export to Jaeger;
- Support export to Zipkin;
Indicator data export
- Support export with
OTLP
protocol; - Support export to Prometheus;
log data export
It is output to the file according to the existing mode, and some developers collect it by other means. (Example: ELK)