Skip to content

Conversation

@gupadhyaya
Copy link
Contributor

@gupadhyaya gupadhyaya commented Nov 4, 2025

Summary

Adds bidirectional cross-version API compatibility tests using tastora to detect breaking API changes between versions.

How It Works

  1. Current client → Old server: Uses current codebase's RPC client to connect to an old server version (via Docker image)
  2. Old client → Current server: Compiles client code from an old version inside Docker and connects to current server (built from current branch)

Both directions must pass to ensure API compatibility.

Test Coverage

Tests all major API modules: Node, Header, State, P2P, Share, DAS, and Blob APIs across both bridge and light nodes.

Changes

  • Added cross-version compatibility test suite (nodebuilder/tests/tastora/api/cross_version_client_test.go)
  • Enhanced tastora framework with version-aware node creation (NewBridgeNodeWithVersion, NewLightNodeWithVersion)
  • Added buildCurrentNodeImage() to build Docker images from current codebase
  • Added GitHub Actions workflow for CI integration
  • Reverted local changes to transaction_test.go to match upstream

Running Tests

cd nodebuilder/tests/tastora/api
go test -tags=integration -v -run TestCrossVersionClientTestSuite/TestCrossVersionBidirectional -timeout 30m

@github-actions github-actions bot added the kind:break! Attached to breaking PRs label Nov 4, 2025
renaynay and others added 26 commits November 17, 2025 09:37
…ctx is now used to control lifecycle of tx workers (#4634)
…ctx is now used to control lifecycle of tx workers (#4635)
- Add TearDownSuite method to BlobTestSuite for proper cleanup
- Add comprehensive parallel transaction test in TransactionTestSuite
- Add WithTxWorkerAccounts option to Framework for configuring worker accounts
- Add TxWorkerAccounts field to Config struct with validation
- Test verifies --tx.worker.accounts feature works correctly
- Includes proper error handling and detailed logging for debugging
- Update to use v0.28.1-arabica which contains fixes
- This version includes the queued submission feature with bug fixes
@gupadhyaya gupadhyaya requested a review from walldiss as a code owner November 18, 2025 16:03
@gupadhyaya gupadhyaya changed the base branch from rene/queued_sub_test to main November 18, 2025 16:11
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments here. The code in cross_version_client is a bit confusing and i'd like to reduce it to the bare minimum necessary to run the tests.

The tests definitely pass locally?

return rms.Stat(), nil
stat := rms.Stat()

// Sanitize peer IDs to ensure backward compatibility with old clients that cannot decode certain peer ID formats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait what happened here?

// Use a longer timeout to handle Docker builds and all test combinations
ctx, cancel := context.WithTimeout(context.Background(), 20*time.Minute)
defer cancel()
oldVersion := "v0.28.3-arabica"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could grab this from somewhere (inject from CI for example) bc otherwise we'll have to hardcode it constantly. This can be a FLUP though - but pls track it.


// waitForNodeReadyAndSynced waits for a node to be fully ready and synced before testing APIs.
// This ensures APIs will work without needing individual timeouts.
func (s *CrossVersionClientTestSuite) waitForNodeReadyAndSynced(ctx context.Context, client *rpcclient.Client, nodeName string, timeout time.Duration) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a method on the framework itself

and it can definitely be cleaned up: instead of polling, just use SyncWait.

Also shares available not a necessary pre-check for this test.

please also remove all the AI-generated comments from the code 😭

Comment on lines 275 to 293
// Share APIs - node should be synced by now, but handle "data not available" gracefully for compatibility
err = client.Share.SharesAvailable(ctx, head.Height())
if err != nil && strings.Contains(err.Error(), "data not available") {
s.T().Logf("Share.SharesAvailable: data not available (known limitation with old light servers): %v", err)
} else {
require.NoError(s.T(), err)
}
_, err = client.Share.GetNamespaceData(ctx, head.Height(), namespace)
if err != nil && strings.Contains(err.Error(), "data not available") {
s.T().Logf("Share.GetNamespaceData: data not available (known limitation with old light servers): %v", err)
} else {
require.NoError(s.T(), err)
}
_, err = client.Share.GetEDS(ctx, head.Height())
if err != nil && strings.Contains(err.Error(), "data not available") {
s.T().Logf("Share.GetEDS: data not available (known limitation with old light servers): %v", err)
} else {
require.NoError(s.T(), err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should test for expected success case and failure case (errors expected for example)

instead of allowing silent failure.

}

waitCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
err = client.DAS.WaitCatchUp(waitCtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't need to test the wait methods bc they're just blocking methods that return err in case ctx deadline exceeded.

}

tmpDir := s.T().TempDir()
testProgram := s.generateTestClientProgram(serverRPCAddr, skipGetRow)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Copy link
Collaborator

@chatton chatton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mainly just reviewing the approach of dynamic code generation and image building in this PR in cross_version_client_test.go

If my understanding is correct, what we are attempting to achieve here is creating a set of tests that allows us to use old client code with a new server version, and the other way around.

I don't think the current approach is the way this problem should be solved.

  • It adds a lot of docker logic that isn't really relevant for the scope of the test
  • the dynamic go code is kind of hard to follow and we don't get any type safety with it.
  • it's also a bit weird and unintuitive.

In my opinion, a cleaner approach would be to create a separate binary that is a simple CLI application that performs these checks. It would be built and tagged on every PR/tag. (This would need backported and be built for previous versions, but that would be a one time thing )

It could be run like docker run ghcr.io/celestiaorg/celestia-node-compat-test:v6.1.2-arabica compat-test --rpc-url <some-url>

With this --rpc-url command, all you need is 2 tags, one for your celestia-node and one for the compat-test container.

In this PR, there could be a type added that embeds the tastora *container.Node. This is a type that allows easy access to all the docker volume / exec functionality so none of that needs to be implemented here.

The go code in this PR, can just create an instance of CompatTester or whatever the struct is called. Run it with the --rpc-url of whatever node version you want, and then just check the exit code and stderr of the container. If all assertions pass it will be exit code 0 and the test passes, otherwise you get an error message to display and the go test will fail.

@renaynay
Copy link
Member

Thank you for the feedback here @chatton

Copy link
Collaborator

@chatton chatton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking a lot better! Left a few suggestions for additional improvements

s.T().Logf("Failed to pull image %s: %v", imageName, pullErr)
s.T().Logf("Attempting to build image locally as fallback...")

if buildErr := s.buildCompatTestImageLocally(ctx, clientVersion, imageName); buildErr != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for compat tests, I would imagine we never actually want to build the image locally, it should always be something that has been built from a tag, or this PR (prior to this go test running, so as a previous step in the workflow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we intend to run compatibility test for every PR then we may need to build image locally right? PR branch and the branch on top of which it is based on? not necessarily always against a release version.

NetworkMode: container.NetworkMode(networkID),
}

createResp, err := dockerClient.ContainerCreate(ctx, config, hostConfig, nil, nil, containerName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing a raw container creation, take a look at how the hermes struct is implemented, you can create a type embedding *container.Node and get lots of functionality out of the box

@@ -0,0 +1,192 @@
package main
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is much nicer! I would recommend however that you pull this out into a separate PR, so that it is easy to backport to all the relevant release branches.

s.testAllAPIsWithOptions(ctx, client, false)
}

func (s *CrossVersionClientTestSuite) testAllAPIsWithOptions(ctx context.Context, client *rpcclient.Client, skipGetRow bool) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this anymore because we can replace it with using the container of whatever this PR is, that way all this compat logic is just in one place.

Comment on lines +329 to +370
func (s *CrossVersionClientTestSuite) buildCompatTestImageLocally(ctx context.Context, version, imageName string) error {
s.T().Logf("Building compat-test image locally for version %s...", version)

repoRoot := s.getRepoRoot()
currentCommitCmd := exec.CommandContext(ctx, "git", "rev-parse", "HEAD")
currentCommitCmd.Dir = repoRoot
currentCommit, err := currentCommitCmd.Output()
if err != nil {
return fmt.Errorf("failed to get current commit: %w", err)
}
currentCommitStr := strings.TrimSpace(string(currentCommit))

defer func() {
restoreCmd := exec.CommandContext(ctx, "git", "checkout", currentCommitStr)
restoreCmd.Dir = repoRoot
_ = restoreCmd.Run()
}()

checkoutCmd := exec.CommandContext(ctx, "git", "checkout", version)
checkoutCmd.Dir = repoRoot
if output, err := checkoutCmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to checkout version %s: %w. Output: %s. Note: compat-test code may need to be backported to this version", version, err, string(output))
}

dockerfilePath := filepath.Join(repoRoot, "cmd/compat-test/Dockerfile")
if _, err := os.Stat(dockerfilePath); os.IsNotExist(err) {
return fmt.Errorf("cmd/compat-test/Dockerfile not found in version %s. The compat-test code needs to be backported to this version", version)
}

buildCmd := exec.CommandContext(ctx, "docker", "build",
"-f", "cmd/compat-test/Dockerfile",
"-t", imageName,
".",
)
buildCmd.Dir = repoRoot
output, err := buildCmd.CombinedOutput()
if err != nil {
return fmt.Errorf("docker build failed: %w\nOutput: %s", err, string(output))
}

return nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this at all, we shouldn't need to execute any docker/git commands via cli like this, IMO the cleanest approach is just making sure that the images exist before the test runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:break! Attached to breaking PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants