Skip to content

Conversation

x-INFiN1TY-x
Copy link

@x-INFiN1TY-x x-INFiN1TY-x commented Sep 17, 2025

Add FileCache Prune REST API

Introduces a REST API for on-demand pruning of OpenSearch's file cache. This production-ready implementation enables operators to efficiently manage disk space across warm node clusters.

Issue

#19322

Implementation Architecture

Implements OpenSearch's advanced TransportNodesAction pattern for sophisticated multi-node coordination:

REST Layer → Node Transport → Node Resolution → Parallel FileCache Operations

API Specification

Endpoint

POST /_cache/filecache/prune

Parameters

  • nodes (optional) — Comma-separated list of node IDs or node selectors (e.g., warm:true)
  • node (optional) — Single node ID for targeted operations
  • timeout (optional) — Operation timeout (e.g., 30s, 2m, 1h)

Response Format

{
  "acknowledged": true,
  "total_pruned_bytes": 2097152,
  "summary": {
    "total_nodes_targeted": 3,
    "successful_nodes": 2,
    "failed_nodes": 1,
    "total_cache_capacity": 32212254720
  },
  "nodes": {
    "warm-node-1": {
      "name": "opensearch-warm-1",
      "transport_address": "10.0.1.101:9300",
      "host": "warm-1.cluster.local",
      "ip": "10.0.1.101",
      "pruned_bytes": 1048576,
      "cache_capacity": 10737418240
    },
    "warm-node-2": {
      "name": "opensearch-warm-2", 
      "transport_address": "10.0.1.102:9300",
      "host": "warm-2.cluster.local",
      "ip": "10.0.1.102",
      "pruned_bytes": 1048576,
      "cache_capacity": 10737418240
    }
  },
  "failures": [
    {
      "node_id": "warm-node-3",
      "reason": "FileCache prune operation failed",
      "caused_by": "RuntimeException"
    }
  ]
}

Files Added

server/src/main/java/org/opensearch/action/admin/cluster/cache/
├── PruneCacheAction.java
├── PruneCacheRequest.java
├── PruneCacheResponse.java
├── NodePruneCacheResponse.java  [NEW - Advanced node response model]
├── TransportPruneCacheAction.java
└── package-info.java

server/src/main/java/org/opensearch/rest/action/admin/cluster/
└── RestPruneCacheAction.java

Files Modified

server/src/main/java/org/opensearch/action/ActionModule.java       # Action & REST handler registration
server/src/main/java/org/opensearch/node/Node.java                 # FileCache dependency injection
server/src/test/java/org/opensearch/action/ActionModuleTests.java  # Framework integration tests

Advanced Capabilities

Intelligent Node Targeting

# Target all warm nodes (default)
POST /_cache/filecache/prune

# Target specific warm nodes
POST /_cache/filecache/prune?nodes=warm-node-1,warm-node-2

# Target single node
POST /_cache/filecache/prune?node=warm-node-1

# With operation timeout
POST /_cache/filecache/prune?timeout=60s

Test Architecture

  • PruneCacheRequestResponseTests.java
  • TransportPruneCacheActionTests.java
  • RestPruneCacheActionTests.java

Behavior Details

  • Warm Nodes: Executes FileCache.prune() and returns actual freed bytes with capacity metrics
  • Non-Warm Nodes: Safely filtered out.
  • Null FileCache: Graceful handling returning zero metrics.
  • Exception Handling: FileCache layer exceptions properly surfaced with node context

Sequence Flow Diagram

mermaid-diagram-2025-09-23-141826

- Implements POST /_cache/remote/prune endpoint for manual cache cleanup
- Adds comprehensive action, transport, and REST handler implementation
- Includes full test coverage for all components
@x-INFiN1TY-x x-INFiN1TY-x marked this pull request as ready for review September 17, 2025 07:12
@x-INFiN1TY-x x-INFiN1TY-x requested a review from a team as a code owner September 17, 2025 07:12
Copy link
Contributor

❌ Gradle check result for 7788eb6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@gaobinlong
Copy link
Contributor

Mind opening an issue to describe the purpose of this new API?

@x-INFiN1TY-x
Copy link
Author

Gradle Check failed due to Flaky Tests : #17486

@x-INFiN1TY-x x-INFiN1TY-x marked this pull request as draft September 18, 2025 11:07
Copy link
Contributor

❕ Gradle check result for 5f88ec4: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Sep 19, 2025

Codecov Report

❌ Patch coverage is 84.13793% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.97%. Comparing base (7e0c6ea) to head (8fb90d0).
⚠️ Report is 35 commits behind head on main.

Files with missing lines Patch % Lines
...admin/cluster/cache/TransportPruneCacheAction.java 71.05% 8 Missing and 3 partials ⚠️
...on/admin/cluster/cache/NodePruneCacheResponse.java 63.63% 8 Missing ⚠️
...action/admin/cluster/cache/PruneCacheResponse.java 94.23% 0 Missing and 3 partials ⚠️
...est/action/admin/cluster/RestPruneCacheAction.java 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               main   #19321    +/-   ##
==========================================
  Coverage     72.96%   72.97%            
- Complexity    69825    69870    +45     
==========================================
  Files          5667     5673     +6     
  Lines        320555   320700   +145     
  Branches      46348    46364    +16     
==========================================
+ Hits         233908   234035   +127     
- Misses        67769    67782    +13     
- Partials      18878    18883     +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

❌ Gradle check result for ca35f02: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 8fb90d0: SUCCESS

@x-INFiN1TY-x x-INFiN1TY-x marked this pull request as ready for review September 23, 2025 16:40
@x-INFiN1TY-x
Copy link
Author

x-INFiN1TY-x commented Sep 23, 2025

Requesting Review

CC : @Harsh-87

Copy link
Contributor

@Gagan6164 Gagan6164 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from these minor comments, rest of the changes LGTM.

@Override
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof NodePruneCacheResponse)) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The convention is to use == false instead of !.

Comment on lines +132 to +144
public boolean isPartiallySuccessful() {
return !getNodes().isEmpty() && !failures().isEmpty();
}

/**
* Check if the operation was completely successful (all targeted nodes succeeded)
*
* @return true if all targeted nodes succeeded
*/
public boolean isCompletelySuccessful() {
return !getNodes().isEmpty() && failures().isEmpty();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using these methods anywhere else except in tests? If not we can mark these @VisibleForTesting instead of making them public.

);
}

if (warmNodes.isEmpty() && nodeIds.length == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeIds.length == 0 check is redundant here.


@Override
protected NodePruneCacheResponse nodeOperation(NodeRequest nodeRequest) {
PruneCacheRequest request = nodeRequest.getRequest();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't see request getting used here.

if (fileCache != null) {
b.bind(FileCache.class).toInstance(fileCache);
} else {
b.bind(FileCache.class).toProvider(() -> null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
b.bind(FileCache.class).toProvider(() -> null);
b.bind(FileCache.class).toProvider(Providers.of(null));

assert request.concreteNodes() == null : "request concreteNodes shouldn't be set";

String[] nodeIds = clusterState.nodes().resolveNodes(request.nodesIds());
List<DiscoveryNode> warmNodes = Arrays.stream(nodeIds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can first collect Warm nodes and then take intersection with the nodeIds from request and we can reuse the same list at Line 85.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if this PR get's merged first then we can get the same list directly fromDiscoveryNodes.getWarmNodes() as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants