Skip to content

Conversation

GPortas
Copy link
Contributor

@GPortas GPortas commented Oct 1, 2025

What this PR does / why we need it:

This pull request introduces pagination mechanisms for the dataset and datafile version summary/difference endpoints.

While the initial scope was limited to adding pagination, a preliminary investigation revealed significant underlying issues that would have made a direct implementation inefficient and unsustainable. The core problems identified were:

  • High Coupling and Poor Separation of Concerns: The existing code lacked clear architectural layering and encapsulation for these use cases, leading to tightly coupled components that are difficult to maintain and extend.

  • Severe Performance Bottlenecks: The implementation relied on fetching bulk data from the database and then post-processing it in Java using multiple nested loops. This approach caused significant performance degradation, especially for datasets or files with a large number of versions.

  • Low Test Coverage: The lack of a comprehensive test suite made it risky to extend/alter the existing functionality without introducing regressions.

Changes Made
Given the issues discovered, the decision was made to perform a comprehensive, end-to-end refactoring of these features. This ensures that the new pagination functionality is built upon a robust, performant, and maintainable foundation.

The key changes include:

  • Architectural Realignment: The entire workflow, from the API endpoint to the data access layer, has been refactored to align with the established Dataverse architecture using Commands and Services. This improves modularity and clarifies responsibilities within the code.

  • Performance Optimization with JPA Criteria: All data processing has been pushed down to the database layer. In-memory processing with loops has been replaced with specific, performant JPA Criteria-based queries. This dramatically improves response times for entities with extensive version histories.

  • Improved Test Coverage: New unit tests have been introduced to cover the refactored code, addressing critical logic that was previously untested. This ensures the stability of the new implementation and simplifies future development.

The intermittent 500 errors that were occurring and reported in #11561
have also been resolved. These errors were caused by a null pointer exception resulting from null datafiles. Below, the error trace is shown, followed by a screenshot of the error being reproduced, and then another screenshot taken after deploying this PR branch.

Reproduced error:
Screenshot 2025-10-14 at 16 40 13
Screenshot 2025-10-14 at 16 40 47

Fixed:

Which issue(s) this PR closes:

Sharing some thoughts...:

Let's keep our focus on making sure our codebase stays healthy and easy to work with.

Sticking to our layered architecture is key. It's what makes it way easier for everyone to jump in and make changes without breaking things.

Let's also be serious about our tests. Good unit tests prove the little pieces work, and API tests make sure the whole thing hangs together. Both are necessary.

Finally, let's all try to follow the 'campsite rule' with tech debt: leave the code a little cleaner than you found it. If you spot something you can quickly improve, do it. It's a small effort that saves us from huge headaches down the road.

Suggestions on how to test this:

Performance enhancements could be tested by running this branch on an installation with a dataset or file with a large number of versions, and calling the endpoint below without pagination, and compare the response time with develop.

You can control pagination of the results using the following optional query parameters.

  • limit: The maximum number of version differences to return.
  • offset: The number of version differences to skip from the beginning of the list. Used for retrieving subsequent pages of results.

For example, to get the second page of results, with 2 items per page, you would use limit=2 and offset=2 (skipping the first two results).

For datasets:

curl -H "X-Dataverse-key: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X GET "https://demo.dataverse.org/api/datasets/:persistentId/versions/compareSummary?persistentId=doi:10.5072/FK2/BCCP9Z&limit=2&offset=2"

For files:

curl -H "X-Dataverse-key: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X GET "https://demo.dataverse.org/api/files/1234/versionDifferences?limit=2&offset=2"

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No

Is there a release notes update needed for this change?:

Yes, attached.

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

@GPortas GPortas force-pushed the 11855-version-summaries-pagination branch from 943da66 to b3c4fbd Compare October 2, 2025 18:21

This comment has been minimized.

3 similar comments

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

@GPortas GPortas changed the title Version summaries pagination Version summaries enhancements and pagination Oct 6, 2025

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

2 similar comments

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:11855-version-summaries-pagination
ghcr.io/gdcc/configbaker:11855-version-summaries-pagination

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants