Open
Conversation
…tricManager code structure - Adjusted naming conventions in .clang-tidy for better readability and consistency. - Refactored AlarmManager to enhance code clarity and maintainability, including changes to method structures and variable handling. - Improved MetricManager by refining metric event handling and ensuring proper memory management. - Updated SelfMonitorServer to utilize move semantics for efficiency in metric event processing. - Cleaned up includes and removed unnecessary dependencies across various files for better organization.
- Removed unnecessary includes and dependencies in Monitor.cpp, Monitor.h, SelfMonitorServer.cpp, and ProfileSender.cpp. - Simplified ProfileSender's handling of profile project names by replacing FlusherSLS with a string map for region project names. - Enhanced unit tests in OnetimeConfigUpdateUnittest to ensure accurate expiration time validation and improved readability. - Overall code organization and clarity improvements across multiple files.
- Introduced mechanisms for writing alarms to a disk buffer during startup when the alarm pipeline is not ready. - Added methods to read alarms from the disk buffer and process them into PipelineEventGroups. - Implemented logic to manage the alarm disk buffer file, including checks for file size limits and time windows for writing. - Enhanced unit tests for AlarmManager to validate the new disk buffer functionality and ensure no data loss or duplication occurs. - Updated AlarmManager and SelfMonitorServer to integrate the new alarm handling logic effectively.
…e clarity - Changed the maximum size for the alarm disk buffer file from MB to bytes for better precision. - Simplified the iteration over alarm messages using structured bindings for improved readability. - Updated logging to reflect the new size unit in error messages, enhancing clarity in file size limits.
- Replaced the Provider include with FlusherSLS to streamline alarm processing. - Added instance_id and hostname to alarm event content for better traceability. - Updated logic to handle multiple projects when sending alarms, ensuring all relevant regions are notified. - Enhanced alarm key construction to include additional metadata such as IP, OS, version, instance_id, and hostname. - Improved unit tests to validate the new alarm handling logic and ensure accurate state management during tests.
- Introduced a set to track unique regions for alarm notifications, ensuring each region receives the alarm only once. - Updated the SendAlarm method to iterate over distinct regions instead of projects, improving efficiency and clarity in alarm handling. - Removed redundant comments and cleaned up method declarations for better code organization.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
启动时告警上报链路设计文档
1. 概要
1.1 背景
LoongCollector 在启动过程中,可能会在 AlarmPipeline 完全就绪之前产生告警。由于告警发送机制依赖于:
因此,在启动初期(Pipeline 未就绪时)产生的告警无法被正常发送,可能导致重要告警信息丢失。
1.2 解决方案
实现了一个启动期告警落盘缓冲机制,确保在 AlarmPipeline 就绪前产生的告警能够被持久化保存,并在 Pipeline 就绪后自动恢复和发送。该机制具有以下特点:
1.3 核心组件
2. 流程图
graph TB Start([系统启动]) --> Init[AlarmManager 初始化<br/>mAlarmPipelineReady = false<br/>文件路径: alarm_disk_buffer.json] Init --> ReceiveAlarm[接收告警 SendAlarm] ReceiveAlarm --> CheckPipeline{AlarmPipeline<br/>是否就绪?} CheckPipeline -->|已就绪| WriteBuffer[写入内存 Buffer<br/>去重统计] CheckPipeline -->|未就绪| CheckLevel{告警等级<br/>是否满足条件?} CheckLevel -->|不满足| Discard1[丢弃告警] CheckLevel -->|满足| CheckWindow{是否在启动<br/>窗口期内?} CheckWindow -->|是| CheckSize{文件大小<br/>是否超过限制?} CheckWindow -->|否| Discard2 CheckSize -->|是| Discard2[停止写入<br/>记录错误日志] CheckSize -->|否| OpenFile[首次写入时,打开文件句柄] OpenFile --> WriteFile[写入磁盘文件<br/>alarm_disk_buffer.json] Discard2 --> CloseFile1[关闭文件句柄] WriteBuffer --> End1([告警处理完成]) WriteFile --> End1 Discard1 --> End1 CloseFile1 --> End1 PipelineReady[AlarmPipeline 就绪] --> SendAlarms[SelfMonitorServer::SendAlarms] SendAlarms --> CheckFirst{首次发送?<br/>CheckAndSetAlarmPipelineReady} CheckFirst -->|是 false| CloseFile2[关闭写文件句柄<br/>确保数据刷新] CloseFile2 --> ReadFile[读取磁盘文件<br/>AlarmManager::ReadAlarmsFromFile<br/>按key分组累加count<br/>同时返回原始 JSON 字符串] ReadFile --> ParseFile[解析 JSON 文件<br/>构造 PipelineEventGroup<br/>保存原始 JSON 按 region 分组] ParseFile --> SendFile[发送文件中的告警] SendFile --> LogError{发送失败?} LogError -->|是| RecordError[记录错误日志<br/>包含原始 JSON 字符串] LogError -->|否| Continue[继续处理后续 group] RecordError --> Continue Continue --> DeleteFile[删除磁盘文件<br/>AlarmManager::DeleteAlarmFile] DeleteFile --> SendBuffer[发送内存 Buffer 中的告警] CheckFirst -->|是 true| SendBuffer SendBuffer --> FlushBuffer[FlushAllRegionAlarm<br/>从内存 Buffer 获取告警] FlushBuffer --> PushQueue[推送到处理队列] PushQueue --> End2([发送完成]) style WriteFile fill:#e1f5ff style ReadFile fill:#fff4e1 style SendFile fill:#e8f5e9 style Discard1 fill:#ffebee style Discard2 fill:#ffebee style OpenFile fill:#f3e5f5 style CloseFile1 fill:#f3e5f5 style CloseFile2 fill:#f3e5f53. 设计优势分析
3.1 高效性
atomic_bool进行无锁状态检查,避免锁竞争3.2 资源开销小
3.3 风险可控
3.4 数据完整性
4. 配置参数
logtail_startup_alarm_window_secondslogtail_startup_alarm_file_max_sizelogtail_startup_alarm_file_min_level5. 文件格式
5.1 文件命名
格式:
alarm_disk_buffer.json文件路径:
{AgentDataDir}/alarm_disk_buffer.json注意:文件路径固定,不包含时间戳。连续重启时,所有告警都会追加到同一个文件中,读取时会按 key 分组并累加 count。
5.2 JSON 格式
每行一个 JSON 对象,字段包括:
{ "region": "cn-hangzhou", "alarm_type": "USER_CONFIG_ALARM", "alarm_level": "1", "alarm_message": "Config file not found", "alarm_count": "1", "timestamp": 1704067200, "project_name": "my-project", "category": "my-category", "config": "my-config" }字段说明:
region: 必填,告警所属 regionalarm_type: 必填,告警类型alarm_level: 必填,告警级别(1=WARNING, 2=ERROR, 3=CRITICAL)alarm_message: 必填,告警消息alarm_count: 必填,告警计数(写入时固定为 "1",读取时按 key 分组累加)timestamp: 必填,告警时间戳(Unix 时间戳)project_name: 可选,项目名称category: 可选,分类config: 可选,配置名称重要说明:
alarm_count固定为 "1"6. 测试覆盖
单元测试文件:
core/unittest/monitor/AlarmDiskBufferUnittest.cpp测试用例包括:
7. 总结
启动时告警上报链路通过磁盘缓冲机制,有效解决了 Pipeline 未就绪时告警丢失的问题。该设计在保证数据完整性的同时,通过高效的文件操作、资源限制和容错机制,实现了低开销、低风险的告警持久化和恢复方案。