Skip to content

SOLR-16458: Migrate NodeHealthAPI from homegrown @EndPoint to JAX-RS#4171

Merged
epugh merged 35 commits intoapache:mainfrom
epugh:copilot/migrate-node-health-api
Mar 21, 2026
Merged

SOLR-16458: Migrate NodeHealthAPI from homegrown @EndPoint to JAX-RS#4171
epugh merged 35 commits intoapache:mainfrom
epugh:copilot/migrate-node-health-api

Conversation

@epugh
Copy link
Contributor

@epugh epugh commented Feb 28, 2026

Migrates NodeHealthAPI — the last node-level V2 API still using Solr's homegrown @EndPoint annotation — to standard JAX-RS, following the same pattern as NodeLogging, GetPublicKey, etc.

Design

The logic stays in HealthCheckHandler (minimising diff surface). NodeHealthAPI is a thin JAX-RS wrapper (~60 lines) that delegates entirely to it.

Key changes

  • solr/api — New NodeHealthApi interface (@Path, @GET, @Operation) and NodeHealthResponse model (status, message, num_cores_unhealthy)
  • NodeHealthAPI — Replaces @EndPoint with JAX-RS; injects CoreContainer, delegates to HealthCheckHandler
  • HealthCheckHandler — Logic unchanged; adds public NodeHealthResponse checkNodeHealth(Boolean, Integer) as the shared entry point for both v1 (handleRequestBody) and v2 (NodeHealthAPI); switches to getJerseyResources() / empty getApis()
  • V2NodeAPIMappingTest — Removes the now-obsolete @EndPoint/ApiBag routing test for health
  • NodeHealthAPITest — New Mockito unit tests for the API class
  • NodeHealthAPITest2 — New mock-free integration tests: cloud-mode via real MiniSolrCloudCluster, legacy mode via embedded CoreContainer built from NodeConfig
  • implicit-requesthandlers.adoc — Health section now links to both HealthCheckHandler (v1) and NodeHealthAPI (v2) javadocs

Copilot AI and others added 4 commits February 22, 2026 12:52
Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
…ef guide

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
@epugh
Copy link
Contributor Author

epugh commented Feb 28, 2026

Some outstanding questions: 1) there is a more mock and less mock versions of the same test. Which do we prefer? 2) Does this seem like a reasonable pattern for the conversion?

@epugh
Copy link
Contributor Author

epugh commented Feb 28, 2026

I wish I didnt' have TWO ways of writing tests, one for cloud and one for standalone... sigh.

@epugh
Copy link
Contributor Author

epugh commented Mar 1, 2026

tests all pass!

@epugh epugh changed the title Migrate NodeHealthAPI from homegrown @EndPoint to JAX-RS; add ref guide link Migrate NodeHealthAPI from homegrown @EndPoint to JAX-RS Mar 2, 2026
@epugh epugh removed the no-changelog label Mar 3, 2026
@gerlowskija
Copy link
Contributor

@epugh - this should probably be attached to one of the v2 JIRA tickets or another. Maybe https://issues.apache.org/jira/browse/SOLR-16458?

@epugh
Copy link
Contributor Author

epugh commented Mar 3, 2026

@epugh - this should probably be attached to one of the v2 JIRA tickets or another. Maybe https://issues.apache.org/jira/browse/SOLR-16458?

Sure... I suppose I could be crosslinking all of these to various JIRAs?

@epugh epugh changed the title Migrate NodeHealthAPI from homegrown @EndPoint to JAX-RS SOLR-16458: Migrate NodeHealthAPI from homegrown @EndPoint to JAX-RS Mar 3, 2026
* mocks.
*
* <p>Cloud-mode tests use a real {@link org.apache.solr.cloud.MiniSolrCloudCluster} and get a
* {@link CoreContainer} directly from a {@link JettySolrRunner}. Legacy (standalone) mode tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Do we have other tests that do standalone testing within a SolrCloudTestCase?

It feels weird conceptually. And in practical terms SolrCloudTestCase does some work that makes it much slower on a per-test basis than our other base classes. Doing standalone testing in a SolrCloudTestCase is going to end up paying that runtime cost for no reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno... Do you think we should ahve TWO tests? The setup is only done one in beforeClass.. not per test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should ahve TWO tests?

Yes, probably. NodeHealthAPIStandaloneTest and NodeHealthAPICloudTest or something like that.

public void testCloudMode_RequireHealthyCoresReturnOkWhenAllCoresHealthy() {
CoreContainer coreContainer = cluster.getJettySolrRunner(0).getCoreContainer();

// requireHealthyCores=true should succeed on a node with no unhealthy cores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[-1] You haven't actually created any cores!!

Can you create a collection or something that'll cause this test to actually exercise the per-core logic currently in HealthcheckHandler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something, and one thing is we may want to collapse our tests from HealthCheckHandlerTest at some point into this test case.

v2: `api/node/health` |{solr-javadocs}/core/org/apache/solr/handler/admin/HealthCheckHandler.html[HealthCheckHandler] |
v2: `api/node/health` |v1: {solr-javadocs}/core/org/apache/solr/handler/admin/HealthCheckHandler.html[HealthCheckHandler]

v2: {solr-javadocs}/core/org/apache/solr/handler/admin/api/NodeHealthAPI.html[NodeHealthAPI] |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Is it the implementation Javadocs we want to point people to here, or would the solr/api interface docs be more helpful?

Or more broadly - is there much value even in pointing to either Javadoc on the v2 side? HealthcheckHandler has a nice good blurb, but neither NodeHealthAPI nor NodeHealthApi have much of anything that's worth pointing a user at IMO...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.... I think we need to decide what the pattern should be for these new apis... Where DO we want these apis to link to and be documented? I don't have a strong opinion, and honestlhy, if we punted this to a future JIRA I could live with that too.

Copilot AI and others added 2 commits March 9, 2026 19:24
…kHandler delegates to V2

- NodeHealthAPI now owns all business logic (cloud mode, legacy mode,
  isWithinGenerationLag, findUnhealthyCores) using strongly-typed
  NodeHealthResponse / NodeStatus throughout.
- HealthCheckHandler becomes a thin V1 bridge: handleRequestBody()
  creates NodeHealthAPI(coreContainer).checkNodeHealth(...) and
  squashes the typed response into SolrQueryResponse.
- findUnhealthyCores() moved to NodeHealthAPI as a public static util;
  HealthCheckHandler keeps a @deprecated delegation shim so existing
  callers continue to compile.
- HealthCheckHandlerTest updated to call NodeHealthAPI.findUnhealthyCores()
  directly.
- Utils.getReflectWriter() now serialises Enum values as their .name()
  string so that NodeStatus.OK round-trips as "OK" through
  NamedList/javabin, keeping HealthCheckHandlerTest assertions passing.
- Fixed pre-existing bug in isWithinGenerationLag: condition was
  `generationDiff < maxGenerationLag` (wrong); corrected to
  `generationDiff > maxGenerationLag` with the return values adjusted
  so the method returns true=healthy / false=lagging-too-far.
- Fixed missing slf4j log arguments in the negative-diff warning.

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
@epugh
Copy link
Contributor Author

epugh commented Mar 10, 2026

Okay, at this point down to a decision on Enum impact, migrating all the business logic out of HealthCheckHandler, and Javadocs and then this may be good to go!

Chatted with @gerlowskija on phone, and we're going to take a stab at migrating business logic. Javadocs we won't do anything more creative than what we have for now. Forgot to ask about Enum.

Got a fix for handling the Enum comparisoin. I migrated the business logic, and removed a duplicate test in the process...

@gerlowskija I think this is ready for final review!

@epugh
Copy link
Contributor Author

epugh commented Mar 10, 2026

Oh, and the reason, based on spelunking, why we don't support maxGenerationLag in v2 is that it's a feature of the old leader/follower capability only. See TestHealthCheckHandlerLegacyMode.

epugh added 2 commits March 11, 2026 07:07
I used claude for this regressoin test and I don't love how verbose they are.  I tried a mock approach first and it was worse.
@gerlowskija
Copy link
Contributor

why we don't support maxGenerationLag in v2 is that it's a feature of the old leader/follower capability only

I'm not sure I follow. Are you saying that you didn't port it to v2 because it's a standalone-only parameter @epugh ?

@epugh
Copy link
Contributor Author

epugh commented Mar 17, 2026

why we don't support maxGenerationLag in v2 is that it's a feature of the old leader/follower capability only

I'm not sure I follow. Are you saying that you didn't port it to v2 because it's a standalone-only parameter @epugh ?

And that it wasn't in the source V2 api either!

Copy link
Contributor

@gerlowskija gerlowskija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some things that need sorted out IMO, but this is getting much closer!

Could you also expand a bit on the decision to not bring maxGenerationLag into the v2 API? You mentioned there'd been discussion of it, but I can't find it for whatever reason.

// from the old index
if (generationDiff < 0) {
log.warn("core:[{}], generation lag:[{}] is negative.", core, generationDiff);
} else if (generationDiff > maxGenerationLag) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q/-1] There's something funky going on in this conditional branch.

The boolean condition and the ret-value are opposite what they are on main. Which I think is OK - the double-reversal "cancels out" so to speak.

But shouldn't the log messages here change in some way? Were the initial log messages here inaccurate and by reversing the condition that triggers them you've "fixed" things? Or is it the opposite - were the initial log messages correct and are now misleading given the change in the triggering logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look deep, but the log messages here look like they make sense. so bug fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, on further thought:

I think you (and Copilot) are correct that the original else-if condition here was out-of-sync with the log message. With those conditions and ret-vals fixed, the log message looks "correct" to me.

If we're suggesting that this is an error case though, maybe the message should be a "warn" to make that clearer. But otherwise I think this is good 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked to see if there are any formatting patterns beyond the log.warn in messages and there aren't... can you be more specific? Otherwise I think this is similar to other logged warnings.

// 2) Leader's index is wiped clean and the follower is still showing commit generation
// from the old index
if (generationDiff < 0) {
log.warn("core:[{}], generation lag:[{}] is negative.", core, generationDiff);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q/-1] The code on main returns false in this condition because the "catch-all" retval on L254 was "false".

Was that an intentional change? Or is this a mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and copilot highlghted "Corrected a pre-existing bug in isWithinGenerationLag: the condition was generationDiff < maxGenerationLag (inverted); corrected to > maxGenerationLag with correct return values (true = healthy, false = lagging). Also fixed missing slf4j log arguments."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[-1] Your response seems to be talking about the codeblock that starts on L217 a line or two below this one, which compares generationDiff and maxGenerationLag. Did you maybe leave this reply in the wrong place?

This codeblock is checking whether generationDiff is negative. So your reply above and the Copilot rationale aren't really relevant AFAICT?

@Test
public void testCloudMode_UnhealthyWhenZkClientClosed() throws Exception {
// Use a fresh node so closing its ZK client does not break the primary cluster node
JettySolrRunner newJetty = cluster.startJettySolrRunner();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[-1] I bet @dsmiley would know for sure, but...

AFAIK there's nothing in cluster.startJettySolrRunner() that ensures the new SolrCloud node is connected to ZK, a part of live_nodes, etc. when it returns. For this reason many callers of cluster.startJettySolrRunner() also call cluster.waitForAllNodes() or cluster.waitForNode right after.

So this code has a race condition - if the new node is slow to connect to ZK the healthcheck API call here will (correctly!) fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming my understanding here is correct (which it may not be - really hoping David corrects me) - I wonder if there's anything we could do to catch this with a linting tool of some kind....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't tehre another method that sets up a cluster that could have the correct waitForAllNodes? So we don't get surprised?

And yes, we need liniting or more rules in AGents to capture these things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking copilot to look overall at our code base... and fixing this specfiic case.

*/
@Test
public void testCloudMode_NotInLiveNodes_ThrowsServiceUnavailable() throws Exception {
JettySolrRunner newJetty = cluster.startJettySolrRunner();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[-1] Ditto my comment on L77 above about starting a new SolrCloud node but not waiting to ensure it joins the cluster.


assertNotNull(response);
assertEquals(OK, response.status);
assertTrue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] Small nit, but this won't actually tell a developer what the 'message' field actually was.

I typically use Hamcrest matchers in these scenarios, as they have really nice error messages on failure by default. e.g.

assertThat(response.message, containsString("maxGenerationLag isn't specified"));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to unleash claude on making this chagne once and for all across our code base ;-). It'sj just tokens!


assertEquals("Expected 503 SERVICE_UNAVAILABLE", 503, response.statusCode());
assertTrue(
"v1 error response body must contain status=FAILURE so body-inspecting clients get"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] Again, Hamcrest matchers would make this a bunch cleaner:

assertThat(response.body, containsString("FAILURE"));

// Enums serialized as their declared name so that javabin/NamedList consumers
// (e.g. HealthCheckHandlerTest comparing against CommonParams.OK == "OK") see
// a plain string rather than "pkg.EnumClass:NAME".
if (o instanceof Enum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] You can avoid the cast by using some a newer bit of Java syntax that declares an already-cast variable:

if (o instanceof Enum e) {
    return e.name();
}
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tha is nicer!

@gerlowskija
Copy link
Contributor

And that it wasn't in the source V2 api either!

Ah - unless you can find specific discussion of it in JIRA or on the original PR, I'd consider that more of a bug than something we want to replicate. One of the huge shortcomings of the original v2 implementation is that nothing ever kept the v1 APIs in sync with their v2 counterparts. So it's extremely common to find cases where:

  1. A v2 API was added back in 2016 or so.
  2. Years later the v1 API evolved, got new functionality, gained some new parameters, etc.
  3. The v2 API wasn't updated at the time and fell behind.

This dynamic is another huge argument for v2-ifying the actual implementation logic for these APIs. When the v1 code is just a shim around the v2 code, there's no way for the "v2" endpoint to fall behind the way we've seen in the past.

@epugh
Copy link
Contributor Author

epugh commented Mar 19, 2026

@gerlowskija I have responded/fixed everything EXCEPT the generation lag..

Do you want me to resolve conversations/comments as I fix them, or push up the fix and let you resolve them?

Copilot AI and others added 2 commits March 19, 2026 11:04
- NodeHealthApi: add @QueryParam("maxGenerationLag") Integer maxGenerationLag
  with @parameter description to healthcheck()
- NodeHealth: update healthcheck() to accept and forward maxGenerationLag;
  remove now-redundant checkNodeHealth() bridge method
- HealthCheckHandler: call healthcheck() directly (no more checkNodeHealth())
- NodeApi (generated SolrJ): regenerated - Healthcheck gains setMaxGenerationLag()
  setter and includes the param in getParams()
- NodeHealthStandaloneTest: remove FIXME; test negative-maxGenerationLag via
  the real V2 HTTP path using NodeApi.Healthcheck.setMaxGenerationLag(-1)

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
- Add new "Monitoring Follower Replication Lag" section to
  user-managed-index-replication.adoc with V1+V2 API examples,
  example responses (success and failure), and a warning about
  using maxGenerationLag=0 in production.
- Update implicit-requesthandlers.adoc Health entry: remove the
  inaccurate "available only in SolrCloud mode" qualifier; add a
  concise description of both SolrCloud params (requireHealthyCores)
  and legacy-mode params (maxGenerationLag) with a cross-reference
  to the new monitoring section.

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
@epugh
Copy link
Contributor Author

epugh commented Mar 19, 2026

And we have generation lag! Plus some actual docs about it since it was a completely NEW parameter to me.

@gerlowskija
Copy link
Contributor

Do you want me to resolve conversations/comments as I fix them, or push up the fix and let you resolve them?

I'm 👍 with you clicking "Resolve Conversation". I trust you're not doing it on stuff that's controversial or whatever, and it saves me from a bunch of additional reading each time I open this PR back up.

(When I'm on the other side of things and doing the editing it also makes for a nice little built-in "To Do" list too...not sure if others users it that way or not)

Anyway, re-reviewing now...

Copy link
Contributor

@gerlowskija gerlowskija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super close - I think there's just one last thing that needs addressed in NodeHealth.java. Feel free to ignore any of my comments that aren't [-1]

type: added
authors:
- name: Eric Pugh
- name: Jason Gerlowski
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] (General question; not PR specific)

Is it our convention to add reviewers here as well?

I'm all for that but it'd be great to document in dev-docs/changelog.adoc if that's a convention we'd like to cement!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added you because you went beyond review, I mean, this PR chagned a lot....! Just causeyou didn't type characters doens't mean you didn't have a lot of input! I think you authored this as much as I did.


public class NodeHealthTest extends SolrCloudTestCase {

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[-0] I love that this test is now SolrCloud only, but should it's name be changed to indicate that. And maybe it should have Javadocs that link to its standalone counterpart.

e.g.

/**
 * Tests for the node-health API, on SolrCloud clusters
 * 
 * @see NodeHealthStandaloneTest
 */
public class NodeHealthSolrCloudTest extends SolrCloudTestCase { 

// 2) Leader's index is wiped clean and the follower is still showing commit generation
// from the old index
if (generationDiff < 0) {
log.warn("core:[{}], generation lag:[{}] is negative.", core, generationDiff);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[-1] Your response seems to be talking about the codeblock that starts on L217 a line or two below this one, which compares generationDiff and maxGenerationLag. Did you maybe leave this reply in the wrong place?

This codeblock is checking whether generationDiff is negative. So your reply above and the Copilot rationale aren't really relevant AFAICT?

// from the old index
if (generationDiff < 0) {
log.warn("core:[{}], generation lag:[{}] is negative.", core, generationDiff);
} else if (generationDiff > maxGenerationLag) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, on further thought:

I think you (and Copilot) are correct that the original else-if condition here was out-of-sync with the log message. With those conditions and ret-vals fixed, the log message looks "correct" to me.

If we're suggesting that this is an error case though, maybe the message should be a "warn" to make that clearer. But otherwise I think this is good 👍

@epugh epugh merged commit 5d2eb05 into apache:main Mar 21, 2026
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:api client:solrj documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants