Conversation
There was a problem hiding this comment.
Pull request overview
本PR新增了ProcessEntityRunner组件用于采集进程实体信息,替换了之前未完成的ProcessEntityCollector。采用全量上报(默认3600秒)与增量采集(默认10秒)相结合的混合模式,实现了进程生命周期的高效追踪。
关键变更:
- 新增ProcessEntityRunner及配套的Timer事件机制,实现定时调度和进程实体采集
- 实现进程缓存、变化检测(新增/退出/PID复用)、多维度过滤(内核线程、运行时间、白名单/黑名单)
- 删除未使用的ProcessEntityCollector及其单元测试,清理旧代码
- 新增完整的单元测试覆盖和监控指标
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/cn/plugins/input/native/input-host-meta.md | 新增进程实体采集的完整使用文档,包括配置参数、采集模式、过滤机制说明 |
| core/host_monitor/entity/ProcessEntityRunner.h | 定义ProcessEntityRunner核心类和数据结构(ProcessEntityInfo、ProcessFilterConfig等) |
| core/host_monitor/entity/ProcessEntityRunner.cpp | 实现进程采集、过滤、变化检测、事件生成等核心逻辑 |
| core/host_monitor/entity/ProcessEntityTimerEvent.h/cpp | 实现Timer事件回调机制,用于定时触发进程采集任务 |
| core/plugin/input/InputHostMeta.h/cpp | 扩展InputHostMeta插件,增加进程实体采集配置和ProcessEntityRunner集成 |
| core/unittest/host_monitor/ProcessEntityRunnerUnittest.cpp | 新增全面的单元测试,覆盖采集、过滤、变化检测等核心功能 |
| core/unittest/host_monitor/ProcessEntityCollectorUnittest.cpp | 删除旧的ProcessEntityCollector单元测试 |
| core/host_monitor/collector/ProcessEntityCollector.h/cpp | 删除未使用的ProcessEntityCollector实现 |
| core/monitor/metric_constants/MetricConstants.h | 新增ProcessEntityRunner相关监控指标定义 |
| core/monitor/metric_constants/HostMonitorMetrics.cpp | 实现监控指标常量定义 |
| core/monitor/AlarmManager.h | 新增PROCESS_ENTITY_ALARM告警类型 |
| core/common/ProcParser.h/cpp | 新增flags字段解析和IsKernelThread辅助方法 |
| core/host_monitor/Constants.h | 新增DEFAULT_HOST_NAME_LABEL常量定义 |
| core/collection_pipeline/plugin/PluginRegistry.cpp | 调整InputHostMeta插件注册参数 |
| core/host_monitor/HostMonitorInputRunner.cpp | 移除ProcessEntityCollector注册 |
| core/CMakeLists.txt | 新增host_monitor/entity子目录到构建系统 |
| core/unittest/host_monitor/CMakeLists.txt | 更新单元测试构建配置 |
| SystemInformation systemInfo; | ||
| if (SystemInterface::GetInstance()->GetSystemInformation(systemInfo)) { | ||
| // 计算进程运行时间 | ||
| int64_t startTimeSec = info.startTime / SYSTEM_HERTZ + systemInfo.bootTime; | ||
| int64_t runningTime = now - startTimeSec; | ||
| if (runningTime < config.minRunningTimeSeconds) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
在 ShouldCollectProcess 中重复调用 SystemInterface::GetInstance()->GetSystemInformation(systemInfo)。这个调用已经在 GetProcessEntityInfo 中执行过(第622行),并且结果已经保存在 info.startTimeUnix 中。
建议直接使用 info.startTimeUnix 计算运行时间:
if (config.minRunningTimeSeconds > 0) {
time_t now = time(nullptr);
int64_t runningTime = now - info.startTimeUnix;
if (runningTime < config.minRunningTimeSeconds) {
return false;
}
}这样可以避免重复的系统调用,提升性能。
| SystemInformation systemInfo; | |
| if (SystemInterface::GetInstance()->GetSystemInformation(systemInfo)) { | |
| // 计算进程运行时间 | |
| int64_t startTimeSec = info.startTime / SYSTEM_HERTZ + systemInfo.bootTime; | |
| int64_t runningTime = now - startTimeSec; | |
| if (runningTime < config.minRunningTimeSeconds) { | |
| return false; | |
| } | |
| // 使用在 GetProcessEntityInfo 中已经计算好的进程启动时间(Unix 秒) | |
| int64_t runningTime = now - info.startTimeUnix; | |
| if (runningTime < config.minRunningTimeSeconds) { | |
| return false; |
| // 如果 schedule time 和 steady clock 的差值与 metric time 和 system time 的差值相差超过 60 秒 | ||
| // 说明系统时钟发生了跳跃 | ||
| auto expectedDiff = std::chrono::duration_cast<std::chrono::seconds>(mScheduleTime - steadyClockNow).count(); | ||
| auto actualDiff = mMetricTime - systemTimeT; | ||
|
|
||
| if (std::abs(expectedDiff - actualDiff) > 60) { |
There was a problem hiding this comment.
时钟跳跃检测的差值计算可能溢出。在第83-84行,expectedDiff 和 actualDiff 都是 int64_t 类型,但一个是从 std::chrono::seconds 转换的,另一个是从 time_t 相减得到的。
问题:
- 当
mScheduleTime < steadyClockNow时,expectedDiff会是负数 - 当
mMetricTime < systemTimeT时,actualDiff也会是负数 - 两个可能为负的大数相减,可能导致溢出或意外行为
建议使用绝对值比较,或者确保比较的两个值都是正数:
auto expectedDiff = std::chrono::duration_cast<std::chrono::seconds>(
std::chrono::abs(mScheduleTime - steadyClockNow)).count();
auto actualDiff = std::abs(static_cast<int64_t>(mMetricTime - systemTimeT));
core/plugin/input/InputHostMeta.cpp
Outdated
| // 读取白名单 | ||
| if (config.isMember("WhitelistPatterns") && config["WhitelistPatterns"].isArray()) { | ||
| for (const auto& pattern : config["WhitelistPatterns"]) { | ||
| if (pattern.isString()) { | ||
| mWhitelistPatterns.push_back(pattern.asString()); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // 读取黑名单 | ||
| if (config.isMember("BlacklistPatterns") && config["BlacklistPatterns"].isArray()) { | ||
| for (const auto& pattern : config["BlacklistPatterns"]) { | ||
| if (pattern.isString()) { | ||
| mBlacklistPatterns.push_back(pattern.asString()); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
白名单和黑名单配置解析缺少对非字符串元素的错误处理和日志记录。当配置文件中包含非字符串类型的元素时(如数字、对象等),代码只是静默跳过(第143行、第152行的 if (pattern.isString()) 检查)。
这会导致用户配置错误时难以发现问题。建议添加警告日志:
if (config.isMember("WhitelistPatterns") && config["WhitelistPatterns"].isArray()) {
for (const auto& pattern : config["WhitelistPatterns"]) {
if (pattern.isString()) {
mWhitelistPatterns.push_back(pattern.asString());
} else {
LOG_WARNING(sLogger,
("invalid whitelist pattern type", "expected string")
("config", mContext->GetConfigName()));
}
}
}| void ProcessEntityRunner::PushNextTimerEvent(ProcessEntityCollectContextPtr context) { | ||
| // 检查系统时钟是否跳跃 | ||
| if (context->CheckClockRolling()) { | ||
| context->Reset(); | ||
| } else { | ||
| auto now = std::chrono::steady_clock::now(); | ||
| std::chrono::steady_clock::time_point nextScheduleTime | ||
| = context->GetScheduleTime() + context->mIncrementalInterval; | ||
| time_t nextMetricTime = context->GetMetricTime() + context->mIncrementalInterval.count(); | ||
|
|
||
| int64_t skipCount = 0; | ||
| if (now > nextScheduleTime) { | ||
| // 延迟过大,计算跳过的次数 | ||
| skipCount = (now - nextScheduleTime) / context->mIncrementalInterval; | ||
| nextScheduleTime += (skipCount + 1) * context->mIncrementalInterval; | ||
| nextMetricTime += (skipCount + 1) * context->mIncrementalInterval.count(); | ||
|
|
||
| LOG_WARNING(sLogger, | ||
| ("ProcessEntity skip collect", | ||
| "may cause data loss")("config", context->mConfigName)("skip count", skipCount + 1)); | ||
| } | ||
|
|
||
| context->SetTime(nextScheduleTime, nextMetricTime); | ||
| } | ||
|
|
||
| // 创建并推送 Timer 事件 | ||
| auto event = std::make_unique<ProcessEntityTimerEvent>(context); | ||
| Timer::GetInstance()->PushEvent(std::move(event)); | ||
| } |
There was a problem hiding this comment.
时间跳跃检测的逻辑不够健壮。在 PushNextTimerEvent 中,如果检测到时钟跳跃会调用 context->Reset(),但紧接着又在不检查 Reset() 结果的情况下继续创建 Timer 事件。
如果时钟跳跃后立即又检测到延迟过大的情况,可能会导致时间管理混乱。建议:
- 时钟跳跃检测应该优先处理,在
Reset()之后直接创建事件并返回 - 或者在
Reset()后,跳过延迟计算逻辑
建议修改为:
if (context->CheckClockRolling()) {
context->Reset();
// 时钟跳跃后立即推送事件,不需要计算延迟
auto event = std::make_unique<ProcessEntityTimerEvent>(context);
Timer::GetInstance()->PushEvent(std::move(event));
return;
}
// 正常的时间推进逻辑...| // 清理已退出的进程 | ||
| std::vector<ProcessPrimaryKey> toRemove; | ||
| for (const auto& [key, detail] : cache) { | ||
| if (currentPidMap.find(key.pid) == currentPidMap.end()) { |
There was a problem hiding this comment.
全量采集时清理退出进程的逻辑存在问题。代码在第413-417行中,只检查了 PID 是否存在于 currentPidMap,但没有检查 startTime 是否匹配。
问题场景:
- 旧进程退出,缓存中是 (pid=100, startTime=1000)
- 新进程复用了 PID 100,但 startTime=2000
currentPidMap.find(key.pid)会找到 pid=100,于是不会清理旧进程- 结果:缓存中同时存在两个 pid=100 的进程实体
建议修改为:
for (const auto& [key, detail] : cache) {
auto it = currentPidMap.find(key.pid);
if (it == currentPidMap.end() || it->second.startTime != key.startTime) {
// 进程不存在或 PID 被复用(startTime 不同)
toRemove.push_back(key);
}
}| // 清理已退出的进程 | |
| std::vector<ProcessPrimaryKey> toRemove; | |
| for (const auto& [key, detail] : cache) { | |
| if (currentPidMap.find(key.pid) == currentPidMap.end()) { | |
| // 清理已退出的进程(包含 PID 复用场景:startTime 不同视为旧进程) | |
| std::vector<ProcessPrimaryKey> toRemove; | |
| for (const auto& [key, detail] : cache) { | |
| auto it = currentPidMap.find(key.pid); | |
| if (it == currentPidMap.end() || it->second.startTime != key.startTime) { |
There was a problem hiding this comment.
为什么需要一个独立的Runner而不是在原来ProcessEntityCollector插件中继续增强?
| bool operator!=(const ProcessPrimaryKey& other) const { return !(*this == other); } | ||
| }; | ||
|
|
||
| struct ProcessEntityInfo { |
There was a problem hiding this comment.
没看到Listening Ports信息,没预留language字段
… can start process entity collector
Uh oh!
There was an error while loading. Please reload this page.