Skip to content

Conversation

@rakibhossainctr
Copy link

@rakibhossainctr rakibhossainctr commented Nov 17, 2025

What is this change?

This change introduces automatic cleanup of unused agent runtime assets by purging assets that have not been accessed within a configurable time window, along with improved event handling for failed asset downloads.

Why is this change necessary?

Several customers (including TD Bank, Ingenico, and others) have requested a scalable mechanism for cleaning up unused assets. Current workarounds — such as manually deleting the --cache-dir — are not viable and often lead to operational issues, including:

  • DDoS-style load against asset hosts (e.g., Artifactory) due to forced redownloads
  • False-positive events caused by failed (timed out) asset downloads, resulting in exit status 127 (“command not found”)

This change reduces unnecessary cache growth, prevents redundant asset downloads, and improves the reliability of event reporting related to asset failures.

Does your change need a Changelog entry?

Yes — this change adds new functionality to the sensu-agent and modifies its runtime behavior.

Do you need clarification on anything?

N/A at this time.

Were there any complications while making this change?

  • Refactoring was required to track last-accessed timestamps for runtime assets.
  • The interfaces for asset status and event error handling required updates to support improved reporting for failed asset downloads.
  • Care was taken to ensure asset purge operations do not interfere with in-use assets during runtime.

Have you reviewed and updated the documentation for this change? Is new documentation required?

Yes

How did you verify this change?

  • Added unit and integration tests validating:
    • asset purge logic based on timestamp threshold
    • no interference with actively used assets
    • improved event reporting for failed asset downloads
  • Manually verified via end-to-end tests on a Linux agent using multiple assets with staged access timestamps
  • Confirmed events no longer incorrectly report exit status 127 when the failure is due to an asset download timeout

Is this change a patch?

No

… management

- Add LastAccessed field to RuntimeAsset struct to track asset usage
- Update boltDBAssetManager.Get() to record timestamps on asset access
- Add updateLastAccessed() helper method for atomic timestamp updates
- Update asset creation to set initial LastAccessed timestamp
- Update all test files to include LastAccessed field in struct initialization
- Fix TestGetAllError to expect empty slice instead of nil for Go consistency
- Add comprehensive test coverage for timestamp functionality

This change enables tracking of asset usage patterns as the foundation
for implementing asset cleanup functionality based on last access time.
The implementation maintains backward compatibility and follows existing
BoltDB transaction patterns for data consistency.

Signed-off-by: rakibhossainctr <[email protected]>
Signed-off-by: rakibhossainctr <[email protected]>
- exposed new method to delete and find unused assets

Signed-off-by: rakibhossainctr <[email protected]>
- restructure agent db and cache dir creation
- created a method to fetch db connection from enterprise edition
- restructure start asset manager process

Signed-off-by: rakibhossainctr <[email protected]>
- fixed: AfterAgentRun and BeforeAgentRun, now able to send all errors incase of fatal plugin
- removed local references in go.mod file

Signed-off-by: rakibhossainctr <[email protected]>
Signed-off-by: rakibhossainctr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant