Skip to content

Comments

feat:collect_process_entity#2491

Open
StartE wants to merge 18 commits intoalibaba:mainfrom
StartE:yili/process_entity
Open

feat:collect_process_entity#2491
StartE wants to merge 18 commits intoalibaba:mainfrom
StartE:yili/process_entity

Conversation

@StartE
Copy link
Collaborator

@StartE StartE commented Dec 17, 2025

  1. 新增ProcessEntityRunner,用于采集进程实体,采集流程参照hostMonitorRunner;
  2. 实体采集分为全量上报(3600周期默认)和增量上报(10s周期)
  3. 实体分为可变属性(每个全量周期采集)和不变属性(首次发现时采集,缓存后使用)
  4. 增量上报期间进行 进程发现、进程ID复用、进程清理 ;
  5. 进程过滤:默认不采集内核进程、默认过滤now-startime<minRunningTimeSeconds(20s)的进程;支持黑名单白名单过滤(正则匹配 name或binary);
  6. 删除之前过期没用上的processCollector及ut

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

本PR新增了ProcessEntityRunner组件用于采集进程实体信息,替换了之前未完成的ProcessEntityCollector。采用全量上报(默认3600秒)与增量采集(默认10秒)相结合的混合模式,实现了进程生命周期的高效追踪。

关键变更:

  • 新增ProcessEntityRunner及配套的Timer事件机制,实现定时调度和进程实体采集
  • 实现进程缓存、变化检测(新增/退出/PID复用)、多维度过滤(内核线程、运行时间、白名单/黑名单)
  • 删除未使用的ProcessEntityCollector及其单元测试,清理旧代码
  • 新增完整的单元测试覆盖和监控指标

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
docs/cn/plugins/input/native/input-host-meta.md 新增进程实体采集的完整使用文档,包括配置参数、采集模式、过滤机制说明
core/host_monitor/entity/ProcessEntityRunner.h 定义ProcessEntityRunner核心类和数据结构(ProcessEntityInfo、ProcessFilterConfig等)
core/host_monitor/entity/ProcessEntityRunner.cpp 实现进程采集、过滤、变化检测、事件生成等核心逻辑
core/host_monitor/entity/ProcessEntityTimerEvent.h/cpp 实现Timer事件回调机制,用于定时触发进程采集任务
core/plugin/input/InputHostMeta.h/cpp 扩展InputHostMeta插件,增加进程实体采集配置和ProcessEntityRunner集成
core/unittest/host_monitor/ProcessEntityRunnerUnittest.cpp 新增全面的单元测试,覆盖采集、过滤、变化检测等核心功能
core/unittest/host_monitor/ProcessEntityCollectorUnittest.cpp 删除旧的ProcessEntityCollector单元测试
core/host_monitor/collector/ProcessEntityCollector.h/cpp 删除未使用的ProcessEntityCollector实现
core/monitor/metric_constants/MetricConstants.h 新增ProcessEntityRunner相关监控指标定义
core/monitor/metric_constants/HostMonitorMetrics.cpp 实现监控指标常量定义
core/monitor/AlarmManager.h 新增PROCESS_ENTITY_ALARM告警类型
core/common/ProcParser.h/cpp 新增flags字段解析和IsKernelThread辅助方法
core/host_monitor/Constants.h 新增DEFAULT_HOST_NAME_LABEL常量定义
core/collection_pipeline/plugin/PluginRegistry.cpp 调整InputHostMeta插件注册参数
core/host_monitor/HostMonitorInputRunner.cpp 移除ProcessEntityCollector注册
core/CMakeLists.txt 新增host_monitor/entity子目录到构建系统
core/unittest/host_monitor/CMakeLists.txt 更新单元测试构建配置

Comment on lines 695 to 702
SystemInformation systemInfo;
if (SystemInterface::GetInstance()->GetSystemInformation(systemInfo)) {
// 计算进程运行时间
int64_t startTimeSec = info.startTime / SYSTEM_HERTZ + systemInfo.bootTime;
int64_t runningTime = now - startTimeSec;
if (runningTime < config.minRunningTimeSeconds) {
return false;
}
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShouldCollectProcess 中重复调用 SystemInterface::GetInstance()->GetSystemInformation(systemInfo)。这个调用已经在 GetProcessEntityInfo 中执行过(第622行),并且结果已经保存在 info.startTimeUnix 中。

建议直接使用 info.startTimeUnix 计算运行时间:

if (config.minRunningTimeSeconds > 0) {
    time_t now = time(nullptr);
    int64_t runningTime = now - info.startTimeUnix;
    if (runningTime < config.minRunningTimeSeconds) {
        return false;
    }
}

这样可以避免重复的系统调用,提升性能。

Suggested change
SystemInformation systemInfo;
if (SystemInterface::GetInstance()->GetSystemInformation(systemInfo)) {
// 计算进程运行时间
int64_t startTimeSec = info.startTime / SYSTEM_HERTZ + systemInfo.bootTime;
int64_t runningTime = now - startTimeSec;
if (runningTime < config.minRunningTimeSeconds) {
return false;
}
// 使用在 GetProcessEntityInfo 中已经计算好的进程启动时间(Unix 秒)
int64_t runningTime = now - info.startTimeUnix;
if (runningTime < config.minRunningTimeSeconds) {
return false;

Copilot uses AI. Check for mistakes.
Comment on lines 81 to 86
// 如果 schedule time 和 steady clock 的差值与 metric time 和 system time 的差值相差超过 60 秒
// 说明系统时钟发生了跳跃
auto expectedDiff = std::chrono::duration_cast<std::chrono::seconds>(mScheduleTime - steadyClockNow).count();
auto actualDiff = mMetricTime - systemTimeT;

if (std::abs(expectedDiff - actualDiff) > 60) {
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

时钟跳跃检测的差值计算可能溢出。在第83-84行,expectedDiffactualDiff 都是 int64_t 类型,但一个是从 std::chrono::seconds 转换的,另一个是从 time_t 相减得到的。

问题:

  1. mScheduleTime < steadyClockNow 时,expectedDiff 会是负数
  2. mMetricTime < systemTimeT 时,actualDiff 也会是负数
  3. 两个可能为负的大数相减,可能导致溢出或意外行为

建议使用绝对值比较,或者确保比较的两个值都是正数:

auto expectedDiff = std::chrono::duration_cast<std::chrono::seconds>(
    std::chrono::abs(mScheduleTime - steadyClockNow)).count();
auto actualDiff = std::abs(static_cast<int64_t>(mMetricTime - systemTimeT));

Copilot uses AI. Check for mistakes.
Comment on lines 140 to 156
// 读取白名单
if (config.isMember("WhitelistPatterns") && config["WhitelistPatterns"].isArray()) {
for (const auto& pattern : config["WhitelistPatterns"]) {
if (pattern.isString()) {
mWhitelistPatterns.push_back(pattern.asString());
}
}
}

// 读取黑名单
if (config.isMember("BlacklistPatterns") && config["BlacklistPatterns"].isArray()) {
for (const auto& pattern : config["BlacklistPatterns"]) {
if (pattern.isString()) {
mBlacklistPatterns.push_back(pattern.asString());
}
}
}
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

白名单和黑名单配置解析缺少对非字符串元素的错误处理和日志记录。当配置文件中包含非字符串类型的元素时(如数字、对象等),代码只是静默跳过(第143行、第152行的 if (pattern.isString()) 检查)。

这会导致用户配置错误时难以发现问题。建议添加警告日志:

if (config.isMember("WhitelistPatterns") && config["WhitelistPatterns"].isArray()) {
    for (const auto& pattern : config["WhitelistPatterns"]) {
        if (pattern.isString()) {
            mWhitelistPatterns.push_back(pattern.asString());
        } else {
            LOG_WARNING(sLogger, 
                ("invalid whitelist pattern type", "expected string")
                ("config", mContext->GetConfigName()));
        }
    }
}

Copilot uses AI. Check for mistakes.
Comment on lines 298 to 326
void ProcessEntityRunner::PushNextTimerEvent(ProcessEntityCollectContextPtr context) {
// 检查系统时钟是否跳跃
if (context->CheckClockRolling()) {
context->Reset();
} else {
auto now = std::chrono::steady_clock::now();
std::chrono::steady_clock::time_point nextScheduleTime
= context->GetScheduleTime() + context->mIncrementalInterval;
time_t nextMetricTime = context->GetMetricTime() + context->mIncrementalInterval.count();

int64_t skipCount = 0;
if (now > nextScheduleTime) {
// 延迟过大,计算跳过的次数
skipCount = (now - nextScheduleTime) / context->mIncrementalInterval;
nextScheduleTime += (skipCount + 1) * context->mIncrementalInterval;
nextMetricTime += (skipCount + 1) * context->mIncrementalInterval.count();

LOG_WARNING(sLogger,
("ProcessEntity skip collect",
"may cause data loss")("config", context->mConfigName)("skip count", skipCount + 1));
}

context->SetTime(nextScheduleTime, nextMetricTime);
}

// 创建并推送 Timer 事件
auto event = std::make_unique<ProcessEntityTimerEvent>(context);
Timer::GetInstance()->PushEvent(std::move(event));
}
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

时间跳跃检测的逻辑不够健壮。在 PushNextTimerEvent 中,如果检测到时钟跳跃会调用 context->Reset(),但紧接着又在不检查 Reset() 结果的情况下继续创建 Timer 事件。

如果时钟跳跃后立即又检测到延迟过大的情况,可能会导致时间管理混乱。建议:

  1. 时钟跳跃检测应该优先处理,在 Reset() 之后直接创建事件并返回
  2. 或者在 Reset() 后,跳过延迟计算逻辑

建议修改为:

if (context->CheckClockRolling()) {
    context->Reset();
    // 时钟跳跃后立即推送事件,不需要计算延迟
    auto event = std::make_unique<ProcessEntityTimerEvent>(context);
    Timer::GetInstance()->PushEvent(std::move(event));
    return;
}
// 正常的时间推进逻辑...

Copilot uses AI. Check for mistakes.
Comment on lines 411 to 414
// 清理已退出的进程
std::vector<ProcessPrimaryKey> toRemove;
for (const auto& [key, detail] : cache) {
if (currentPidMap.find(key.pid) == currentPidMap.end()) {
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

全量采集时清理退出进程的逻辑存在问题。代码在第413-417行中,只检查了 PID 是否存在于 currentPidMap,但没有检查 startTime 是否匹配。

问题场景:

  1. 旧进程退出,缓存中是 (pid=100, startTime=1000)
  2. 新进程复用了 PID 100,但 startTime=2000
  3. currentPidMap.find(key.pid) 会找到 pid=100,于是不会清理旧进程
  4. 结果:缓存中同时存在两个 pid=100 的进程实体

建议修改为:

for (const auto& [key, detail] : cache) {
    auto it = currentPidMap.find(key.pid);
    if (it == currentPidMap.end() || it->second.startTime != key.startTime) {
        // 进程不存在或 PID 被复用(startTime 不同)
        toRemove.push_back(key);
    }
}
Suggested change
// 清理已退出的进程
std::vector<ProcessPrimaryKey> toRemove;
for (const auto& [key, detail] : cache) {
if (currentPidMap.find(key.pid) == currentPidMap.end()) {
// 清理已退出的进程(包含 PID 复用场景:startTime 不同视为旧进程)
std::vector<ProcessPrimaryKey> toRemove;
for (const auto& [key, detail] : cache) {
auto it = currentPidMap.find(key.pid);
if (it == currentPidMap.end() || it->second.startTime != key.startTime) {

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么需要一个独立的Runner而不是在原来ProcessEntityCollector插件中继续增强?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已优化

bool operator!=(const ProcessPrimaryKey& other) const { return !(*this == other); }
};

struct ProcessEntityInfo {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没看到Listening Ports信息,没预留language字段

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants