Skip to content

ES-10063 Add multi-project support for more stats APIs #127650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

PeteGillinElastic
Copy link
Member

@PeteGillinElastic PeteGillinElastic commented May 2, 2025

This affects the following APIs:

  • GET _nodes/stats:
    • For indices, it now prefixes the index name with the project ID (in multi-project clusters). Previously, it didn't tell you which project an index was in, and it failed if two projects had the same index name.
    • For ingest, it now gets the pipeline and processor stats for all projects, and prefixes the pipeline ID with the project ID (in multi-project clusters). Previously, it only got them for the default project.
  • GET /_cluster/stats:
    • For ingest, it now aggregates the pipeline and processor stats for all projects. Previously, it only got them for the default project.
  • GET /_info:
    • For ingest, same as for GET /_nodes/stats.

It does not add multi-project support for the repositories section on _nodes/stats, since repositories aren't MP-ready yet AFAIK.

This is done by making IndicesService.stats() and IngestService.stats() include project IDs in the NodeIndicesStats and IngestStats objects they return, and making those stats objects incorporate the project IDs when converting to XContent.

The transitive callers of these two methods are rather extensive (including all callers to NodeService.stats(), all callers of TransportNodesStatsAction, and so on). To ensure the change is safe, the callers were all checked out, and they fall into the following cases:

  • The behaviour change is one of the desired enhancements described above.
  • There is no behaviour change because it was getting node stats but neither indices nor ingest stats were requested.
  • There is no behaviour change because it was getting indices and/or ingest stats but only using aggregate values.
  • In MachineLearningUsageTransportAction and TransportGetTrainedModelsStatsAction, the IngestStats returned will return stats from all projects instead of just the default with this change, but they have been changed to filter the non-default project stats out, so this change is a noop there. (These actions are not MP-ready yet.)
  • MonitoringService will be affected, but this is the legacy monitoring module which is not in use anywhere that MP is going to be enabled. (If anything, the behaviour is probably improved by this change, as it will now include project IDs, rather than producing ambiguous unqualified results and failing in the case of duplicates.)

This affects the following APIs:
 - `GET _nodes/stats`:
   - For `indices`, it now prefixes the index name with the project ID (for non-default projects). Previously, it didn't tell you which project an index was in, and it failed if two projects had the same index name.
   - For `ingest`, it now gets the pipeline and processor stats for all projects, and prefixes the pipeline ID with the project ID. Previously, it only got them for the default project.
 - `GET /_cluster/stats`:
   - For `ingest`, it now aggregates the pipeline and processor stats for all projects. Previously, it only got them for the default project.
 - `GET /_info`:
   - For `ingest`, same as for `GET /_nodes/stats`.

This is done by making `IndicesService.stats()` and `IngestService.stats()` include project IDs in the `NodeIndicesStats` and `IngestStats` objects they return, and making those stats objects incorporate the project IDs when converting to XContent.

The transitive callers of these two methods are rather extensive (including all callers to `NodeService.stats()`, all callers of `TransportNodesStatsAction`, and so on). To ensure the change is safe, the callers were all checked out, and they fall into the following cases:
 - The behaviour change is one of the desired enhancements described above.
 - There is no behaviour change because it was getting node stats but neither `indices` nor `ingest` stats were requested.
 - There is no behaviour change because it was getting `indices` and/or `ingest` stats but only using aggregate values.
 - In `MachineLearningUsageTransportAction` and `TransportGetTrainedModelsStatsAction`, the `IngestStats` returned will return stats from all projects instead of just the default with this change, but they have been changed to filter the non-default project stats out, so this change is a noop there. (These actions are not MP-ready yet.)
 - `MonitoringService` will be affected, but this is the legacy monitoring module which is not in use anywhere that MP is going to be enabled. (If anything, the behaviour is probably improved by this change, as it will now include project IDs, rather than producing ambiguous unqualified results and failing in the case of duplicates.)
@PeteGillinElastic PeteGillinElastic added >non-issue :Data Management/Stats Statistics tracking and retrieval APIs v9.1.0 labels May 2, 2025
@PeteGillinElastic PeteGillinElastic marked this pull request as ready for review May 2, 2025 16:40
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label May 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@PeteGillinElastic PeteGillinElastic changed the title Add multi-project support for more stats APIs ES-10063 Add multi-project support for more stats APIs May 2, 2025
Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this, Pete! I did a first round of review by just looking at the production files. I'll have a look at the tests later on. This feels like the right direction; my comments are mostly small and/or thinking out loud.

Comment on lines 322 to 324
if (Objects.equals(projectId, Metadata.DEFAULT_PROJECT_ID)) {
return index.getName();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit on the fence about this approach. Thinking out loud: on the one hand, this maintains BwC with little effort. On the other hand, if we ever have indices in the default project this'll be a bit off. It's been mentioned a handful of times that we might want to consider (so, super hypothetically) having a shared cluster-level GeoIP index to avoid downloading the same DB over again. We shouldn't make this decision purely based on the GeoIP index, but the idea of the faint possibility of cluster-level indices is something to think about. At the same time, this approach doesn't "break" anything in that case either, it just makes the output inconsistent. So, we either try to account for that situation proactively now, or we decide to go with this approach and we solve that issue if/when it comes. Any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the options are:

  1. Include the default/ prefix on every index in single-project mode.
  2. Live with the fact that people will need to know that the absence of a prefix indicates the default project in multi-project mode.
  3. Somehow pass in information about whether we're in single- or multi-project mode. I guess that we could go back to making the map nullable and use the null value to indicate that we're in single-project mode and don't need the prefixes.

What I've currently implement is option 2. I think this is certainly better than option 1. But if we're seriously considering having indices in the default project when in MP mode (which I hadn't realized was a possibility) and if we think that people might not realize that no prefix means default (which I'm not sure about) then we could switch to option 3 easily enough. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, I think I'm leaning towards option 3.

This would also mean that, when deserializing an instance from an older node, we could make the map be null and so not show project prefixes: I think this would be valid, because I assume that we will never have to support BWC with nodes older than the current main in MP clusters (there's no way that could possibly work!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, ideally, I'd want to go with option 3 too (not showing the project ID in single-project mode and showing the project ID for all indices in multi-project mode), for consistency. But let's park this discussion for a bit until we resolve #127650 (comment) (or choose not to resolve it and live with it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit switching to option 3. I know you said to park it, but it's an easy change, and I think it might help the other discussion (esp. for folks who haven't seen the PR before we raise it) if we're looking at the closest we have to a proposal.

Copy link
Member Author

@PeteGillinElastic PeteGillinElastic May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ugh, it now occurs to me: IngestStats still behaves the old way (prefix project ID if non-default). They should be consistent. Switching IngestStats to behave the new way looks like a little more work because there's no obvious way to indicate to it whether it's in MP mode or not, we'd have to explicitly pass that in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether XContent params might be an option (for both). I've not used those so I don't know what limitations there might be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some YAML tests from the ingest-common module are muted in this file because they depend on stats APIs:

    '^test/ingest/15_info_ingest/*', // uses cluster info API
    '^test/ingest/70_bulk/Test bulk request with default pipeline', // uses stats API
    '^test/ingest/70_bulk/Test bulk request without default pipeline', // uses stats API

However, because we decided to prefix the project ID, we won't be able to unmute those tests... I don't immediately have a good suggestion, but I thought I'd mention it so we're aware. If we decide that that is OK, we should move these lines to the bottom so they're next to the comment explaining why some stats API tests are muted in MP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thanks. I've made myself a note to come back to this.

Comment on lines 576 to 579
return indices.values()
.stream()
.map(AbstractIndexComponent::index)
.collect(Collectors.toMap(index -> index, index -> clusterService.state().metadata().projectFor(index).id()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: by using Metadata#projectFor, we throw an exception if the index is not found. I'm not familiar enough with the IndicesService to know whether there is a possibility of the service being "behind" the state in ClusterService. For instance, when an index is deleted from the cluster state, is the indices service updated atomically? Or if an index is added, is it first added to the cluster state or the IndicesService? If you don't already have an answer to those questions (which is completely fine), I think it's worth having a look to be sure what kind of behavior we could see here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. That's a good question.

Looking at the creation flow, I believe that creating the IndexService and adding it to the indices map in IndicesService happens as part of an applyClusterState method in a ClusterStateApplier, and looking at ClusterApplierService it seems that this will be invoked before the state is updated. So it seems plausible that we could call IndicesService.stats on another thread and observe it in a state where this fails for an index which is in the process of being created on this node.

At any rate, I think that good defensive coding principles dictate that we should make things work as well as possible in these circumstances.

I think the best option might be to switch to using lookupProject(Index). That means that we'll just omit the index from the map in that case. Then we have to figure out how to handle that. Right now, it would use null for the project ID, which is probably confusing. We could make it use something like <unknown>, which is sort of horrible but at least it's explicit. (We already have the possibility that the map is empty because it came from an older node, although at least that situation is only temporary.)

I don't like that, but I'm struggling to think of any better alternatives right now.

It would be simple if the IndexService knew its project ID. I think you said it was a deliberate design decision not to do that... Or did was the deliberate decision just that Index and/or IndexMetadata wouldn't know their project ID? I've had a glance, and it would be a moderately messy change: there are dozens of code paths we'd have to wire the project ID through to get back to somewhere it's available (not counting tests).

The refactoring to ensure that things happen in a different order which guarantees it would be okay is waaaaay to large and risky to justify.

So I'm going with the <unknown> option for now. But I'd be very interested to hear your thoughts!

Copy link
Member Author

@PeteGillinElastic PeteGillinElastic May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We already have the possibility that the map is empty because it came from an older node, although at least that situation is only temporary.)

N.B. If we go with the thing I called option 3 in the other thread, this comment will not be true. The only way that I think this the index could be missing from the map is if we hit this rare timing issue. So I think the question becomes: Do we want to do the plumbing to make an IndexService know its project ID in order to avoid this? Or is there a better solution that I'm not seeing?

(There is, incidentally, a risk that the stats call could fail if there are two indices with the same name in different projects both of which hit this rare timing issue at the same time, causing a key collision. I'm inclined not to worry about that, as it would be transient, although it could be avoided if we just add a counter to the prefix.)

Copy link
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall! I haven't gone over the tests yet, will run another pass for the tests tomorrow.

public NodeIndicesStats(
CommonStats oldStats,
Map<Index, CommonStats> statsByIndex,
Map<Index, List<IndexShardStats>> statsByShard,
Map<Index, ProjectId> projectsByIndex,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like this approach of passing in the index to project map

Copy link
Member Author

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I just pushed a commit which responds to most of the comments here. There's some discussion below.

Comment on lines 576 to 579
return indices.values()
.stream()
.map(AbstractIndexComponent::index)
.collect(Collectors.toMap(index -> index, index -> clusterService.state().metadata().projectFor(index).id()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. That's a good question.

Looking at the creation flow, I believe that creating the IndexService and adding it to the indices map in IndicesService happens as part of an applyClusterState method in a ClusterStateApplier, and looking at ClusterApplierService it seems that this will be invoked before the state is updated. So it seems plausible that we could call IndicesService.stats on another thread and observe it in a state where this fails for an index which is in the process of being created on this node.

At any rate, I think that good defensive coding principles dictate that we should make things work as well as possible in these circumstances.

I think the best option might be to switch to using lookupProject(Index). That means that we'll just omit the index from the map in that case. Then we have to figure out how to handle that. Right now, it would use null for the project ID, which is probably confusing. We could make it use something like <unknown>, which is sort of horrible but at least it's explicit. (We already have the possibility that the map is empty because it came from an older node, although at least that situation is only temporary.)

I don't like that, but I'm struggling to think of any better alternatives right now.

It would be simple if the IndexService knew its project ID. I think you said it was a deliberate design decision not to do that... Or did was the deliberate decision just that Index and/or IndexMetadata wouldn't know their project ID? I've had a glance, and it would be a moderately messy change: there are dozens of code paths we'd have to wire the project ID through to get back to somewhere it's available (not counting tests).

The refactoring to ensure that things happen in a different order which guarantees it would be okay is waaaaay to large and risky to justify.

So I'm going with the <unknown> option for now. But I'd be very interested to hear your thoughts!

Comment on lines 322 to 324
if (Objects.equals(projectId, Metadata.DEFAULT_PROJECT_ID)) {
return index.getName();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the options are:

  1. Include the default/ prefix on every index in single-project mode.
  2. Live with the fact that people will need to know that the absence of a prefix indicates the default project in multi-project mode.
  3. Somehow pass in information about whether we're in single- or multi-project mode. I guess that we could go back to making the map nullable and use the null value to indicate that we're in single-project mode and don't need the prefixes.

What I've currently implement is option 2. I think this is certainly better than option 1. But if we're seriously considering having indices in the default project when in MP mode (which I hadn't realized was a possibility) and if we think that people might not realize that no prefix means default (which I'm not sure about) then we could switch to option 3 easily enough. What do you think?

@@ -564,6 +571,15 @@ IndexShardStats indexShardStats(final IndicesService indicesService, final Index
);
}

private Map<Index, ProjectId> projectsByIndex() {
Map<Index, ProjectId> map = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
Map<Index, ProjectId> map = new HashMap<>();
Map<Index, ProjectId> map = new HashMap<>(indices.size());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason github won't let me commit your suggestion — it says "Sorry, the changes couldn't be applied. Refresh and try again." repeatedly — so I'm making this change myself.

@@ -66,6 +70,7 @@ public class NodeIndicesStats implements Writeable, ChunkedToXContent {
private final CommonStats stats;
private final Map<Index, List<IndexShardStats>> statsByShard;
private final Map<Index, CommonStats> statsByIndex;
private final Map<Index, ProjectId> projectsByIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame we have to pass in this map ourselves (and (de)serialize it), as it's pretty much just the Metadata.ProjectLookup... It looks like we only need it for the XContent serialization, but I can't think of a way to pass the ProjectLookup to that serialization method. I do think it's worth thinking about that some more (and maybe asking others for suggestions). This is going to result in a lot more serialization over the wire.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a lot more. It's adding one ProjectId where there's already a CommonStats and a List<IndexShardStats>. But I agree it's unfortunate to be sharing copies of this around everywhere. I'll start a thread on the channel, as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, the serialization works a bit different in this class, it doesn't just serialize a full map. Hm ok that makes it a lot less worse, indeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just realized I'm mixing things up. I was thinking of the serialization of IngestStats. In NodeIndicesStats, this map will result in quite a bit more serialization: a whole map where we serialize every index (not sure if it's every index on the node or really all indices, but still) and every project ID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it's every index on the node. I still think that ProjectId is small compared with other stuff we're already storing for each index on the node...

@PeteGillinElastic
Copy link
Member Author

I just pushed a commit to switch it to using ToXContent.Params to distinguish between single- and multi-project behaviour, as discussed elsewhere. I will look into changing NodeIndicesStats to using the nested structure we talked about next.

@PeteGillinElastic PeteGillinElastic marked this pull request as draft May 19, 2025 18:30
@PeteGillinElastic PeteGillinElastic marked this pull request as ready for review May 20, 2025 12:09
@PeteGillinElastic
Copy link
Member Author

PeteGillinElastic commented May 20, 2025

Those last few commits make some MP REST YAML tests work in the cases where the response contains a project ID prefix.

The way this works isn't the most elegant thing in the world, but I think it's okay. There are two parts to it: We populate an entry in the 'stash' for the _project_id_prefix_, and we subtly change the way that the subject of YAML assertions are interpreted.

Here's how the subject of YAML assertions were interpreted before:

  1. If there were no references, e.g. items.0.index.executed_pipelines, it was interpreted as a path to look up in the response.
  2. If the whole thing was a simple reference, e.g. $body, it was looked up in the stash instead — in this case, it asserts on the stashed value, it is not interpreted as a path to look up in the response.
  3. If it contained a simple reference somewhere inside it, e.g. nodes.$master.ingest.total.count, the reference was looked up in the stash and substituted, and the result interpreted as a path to look up in the response.
  4. If it contained an extended reference, e.g. abc${foo}xyz, the reference was looked up in the stash and substitued, and the assertion was done on the result — it was not interpreted as a path to look up in the response. 👈 This is the thing that's changing in this PR.

This final case is surprising: it means that nodes.$master.ingest.total.count and nodes.${master}.ingest.total.count would be treated differently, with the former being substituted and then interpreted as a path in the response and the latter being substituted and then interpreted as the literal subject of the assertion. I don't think this can have been the intention. No tests seem to rely on it. This changes it so that these are both substituted and then treated as paths.

Copy link
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I have posted some minor comments, it should be good to merge very soon :)

.nodesStats(nodesStatsRequest, new RestRefCountedChunkedToXContentListener<>(channel, xContentParamsForRequest(request)));
}

private ToXContent.DelegatingMapParams xContentParamsForRequest(RestRequest request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, didn't know about ToXContent.Params. Curious why use ToXContent.DelegatingMapParams here but ToXContent.MapParams in RestClusterInfoAction? Do we expect the request object to contain the MULTI_PROJECT_ENABLED_XCONTENT_PARAM_KEY key here?

* Otherwise, it should return {@code true} to indicate the cluster can accommodate multiple projects regardless
* how many project it current has.
*/
default boolean supportsMultipleProjects() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason for this move so ActionModule can access this method? I think ActionModule is constructed with a ProjectResolver instance in NodeConstruciton, so changing the ActionModule.ProjectIdResolver property to ActionModule.ProjectResolver should make the method accessible. Just feel a little bit more fitting forsupportsMultipleProjects to stay in ProjectResolver.

- match: { "ingest.pipelines.${_project_id_prefix_}ingest_info_pipeline.current": 0 }
- match: { "ingest.pipelines.${_project_id_prefix_}ingest_info_pipeline.failed": 0 }
- gt: { "ingest.pipelines.${_project_id_prefix_}ingest_info_pipeline.ingested_as_first_pipeline_in_bytes": 0 }
- gt: { "ingest.pipelines.${_project_id_prefix_}ingest_info_pipeline.produced_as_first_pipeline_in_bytes": 0 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar, could you help me understand how does the ${...} part get parsed and resolved to the value in the stash?

'^indices.get_alias/10_basic/Get alias against closed indices', // Does NOT work with security enabled, see also core-rest-tests-with-security
'^indices.recovery/*/*',
'^indices.resolve_cluster/*/*',
'^indices.resolve_cluster/*/*/*',
'^indices.shard_stores/*/*',
'^test/ingest/15_info_ingest/*', // uses cluster info API
'^test/ingest/70_bulk/Test bulk request with default pipeline', // uses stats API
'^test/ingest/70_bulk/Test bulk request without default pipeline', // uses stats API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for making these tests work in MP!

),
// Adding a non inference processor that should be ignored
new IngestStats.ProcessorStat("grok", "grok", new IngestStats.Stats(100, 100, 100, 100))
ProjectId.DEFAULT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use randomProjectIdOrDefault() to make unit tests a little more generic to MP

@samxbr
Copy link
Contributor

samxbr commented May 21, 2025

  1. If the whole thing was a simple reference, e.g. $body, it was looked up in the stash instead — in this case, it asserts on the stashed value, it is not interpreted as a path to look up in the response.

This makes sense, but it does feel like a little different behaviour from the other 3 cases. I will probably confuse myself on this later, is this behaviour documented somewhere in the code base?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >non-issue Team:Data Management Meta label for data/management team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants