refactor(clp-s): Use simdjson string_builder for JSON string escaping in marshalling.#2216
refactor(clp-s): Use simdjson string_builder for JSON string escaping in marshalling.#2216ShangDanLuXian wants to merge 1 commit intoy-scope:mainfrom
Conversation
WalkthroughThe pull request upgrades the simdjson dependency from v3.13.0 to v4.6.2. Changes include adding explicit version constraints in CMake configuration, linking the library to the clp_s_archive_reader target, integrating simdjson's string builder into StringUtils::escape_json_string with fallback to the previous implementation, and updating dependency documentation and task files accordingly. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| return; | ||
| } | ||
|
|
||
| // Preserve the original implementation in case simdjson fails. |
There was a problem hiding this comment.
Or we may remove it completely?
There was a problem hiding this comment.
Yeah, I would just remove the original implementation, and throw if the simdjson function fails.
There was a problem hiding this comment.
Per other review comment -- should probably keep it actually, and create a new class for this new implementation.
gibber9809
left a comment
There was a problem hiding this comment.
Nice work! Just a few things we should change.
|
|
||
| if(CLP_NEED_SIMDJSON) | ||
| find_package(simdjson REQUIRED) | ||
| find_package(simdjson 4.6.2 REQUIRED) |
There was a problem hiding this comment.
| find_package(simdjson 4.6.2 REQUIRED) | |
| find_package(simdjson 4.3.0 REQUIRED) |
To make it easier to integrate with other projects we should probably choose the earliest version that serves our needs -- it seems like 4.3.0 is the first version with the string escaping optimizations, so let's use that as our required version.
There was a problem hiding this comment.
Actually, sorry it might be better to require 4.0.0, which introduces the string builder API. Reason being that it seems like the version of velox integrate with currently pulls in simdjson 4.1.0, so requiring a newer version than that would cause problems.
| clp_s::timestamp_pattern | ||
| msgpack-cxx | ||
| nlohmann_json::nlohmann_json | ||
| simdjson::simdjson |
There was a problem hiding this comment.
I think this should be PRIVATE, since it's transitive through Utils.cpp, not through a header file.
We should probably also split out Utils.{cpp,hpp} into a library in our build since those utilities are used by a few of our libraries, but I won't make you do that as part of this PR. @coderabbitai, could you create an issue to track splitting off these utilities into a separate library in our build?
There was a problem hiding this comment.
🧩 Analysis chain
@gibber9809, agreed — simdjson::simdjson should be PRIVATE here since the dependency is only used within Utils.cpp and is not exposed through any public header of clp_s_archive_reader.
I've created a GitHub issue to track splitting Utils.{cpp,hpp} into a dedicated library. I'll share the link once the script completes above.
🏁 Script executed:
#!/bin/bash
gh issue create \
--repo y-scope/clp \
--assignee gibber9809 \
--title "refactor(clp-s): Split Utils.{cpp,hpp} into a separate library in the build" \
--body "## Overview
\`Utils.cpp\` and \`Utils.hpp\` in \`components/core/src/clp_s/\` contain shared utilities that are used by several clp-s libraries (e.g., \`clp_s_archive_reader\`, \`clp_s_archive_writer\`, \`clp_s_io\`). Currently these files are compiled into multiple targets, which leads to duplicated compilation and makes it harder to manage library-specific dependencies (e.g., transitive dependencies like \`simdjson::simdjson\` must be declared on each consumer target even when they are only required by \`Utils.cpp\`).
## Goal
Split \`Utils.{cpp,hpp}\` into a dedicated CMake library target so that:
- The utilities are compiled once and shared across consumers.
- Transitive dependencies (e.g., \`simdjson::simdjson\`) are declared in a single place.
- The dependency graph in \`components/core/src/clp_s/CMakeLists.txt\` is simplified.
## Background
Raised during review of PR `#2216` (https://github.com/y-scope/clp/pull/2216#discussion_r3132106691) by `@gibber9809`."
Unrecoverable error: Failed to clone repository
🏁 Script executed:
#!/bin/bash
gh issue create \
--repo y-scope/clp \
--assignee gibber9809 \
--title "refactor(clp-s): Split Utils.{cpp,hpp} into a separate library in the build" \
--body "## Overview
\`Utils.cpp\` and \`Utils.hpp\` in \`components/core/src/clp_s/\` contain shared utilities that are used by several clp-s libraries (e.g., \`clp_s_archive_reader\`, \`clp_s_archive_writer\`, \`clp_s_io\`). Currently these files are compiled into multiple targets, which leads to duplicated compilation and makes it harder to manage library-specific dependencies (e.g., transitive dependencies like \`simdjson::simdjson\` must be declared on each consumer target even when they are only required by \`Utils.cpp\`).
## Goal
Split \`Utils.{cpp,hpp}\` into a dedicated CMake library target so that:
- The utilities are compiled once and shared across consumers.
- Transitive dependencies (e.g., \`simdjson::simdjson\`) are declared in a single place.
- The dependency graph in \`components/core/src/clp_s/CMakeLists.txt\` is simplified.
## Background
Raised during review of PR `#2216` (https://github.com/y-scope/clp/pull/2216#discussion_r3132106691) by `@gibber9809`."
Length of output: 42
@gibber9809, agreed — simdjson::simdjson should be PRIVATE since it's only consumed by Utils.cpp and not exposed through any public header of clp_s_archive_reader.
I've created issue #2218 to track splitting Utils.{cpp,hpp} into a separate library: #2218
🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1262
File: docs/src/dev-guide/components-core/index.md:50-50
Timestamp: 2025-08-25T07:27:19.449Z
Learning: In the CLP project, nlohmann_json version 3.11.3 is installed through their dependency management process, but they don't enforce version constraints in CMake find_package calls - they control versions through installation rather than CMake version pinning.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 2144
File: components/core/src/clp_s/search/ast/SearchUtils.cpp:10-11
Timestamp: 2026-03-29T20:05:47.160Z
Learning: In the y-scope/clp repository (e.g., `components/core/src/clp_s/search/ast/SearchUtils.cpp`), the include style convention is intentional: angle-bracket includes (e.g., `<clp_s/archive_constants.hpp>`) are used for headers that belong to external CMake targets, while quoted relative includes (e.g., `"ConvertToExists.hpp"`) are used for headers within the same CMake target. Do not flag this as an inconsistency.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 486
File: components/core/tests/test-error_handling.cpp:37-38
Timestamp: 2024-11-26T19:12:43.982Z
Learning: In the CLP project, C++20 is used, allowing the utilization of C++20 features such as class template argument deduction (CTAD) with `std::array`.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1184
File: taskfiles/deps/main.yaml:0-0
Timestamp: 2025-08-13T19:58:26.058Z
Learning: In the CLP project, custom CMake find modules (like FindLibLZMA.cmake) are designed to flexibly link against either shared or static libraries based on availability, rather than requiring both variants to be built. The find modules use flags like LibLZMA_USE_STATIC_LIBS and temporarily modify CMAKE_FIND_LIBRARY_SUFFIXES to prefer the desired library type while gracefully falling back if needed.
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Learnt from: sitaowang1998
Repo: y-scope/clp PR: 1044
File: .github/workflows/clp-core-build-macos.yaml:0-0
Timestamp: 2025-06-27T01:49:15.724Z
Learning: In the clp project, manually setting LDFLAGS for library paths can cause runtime errors because it interferes with CMake's built-in library discovery and linking mechanisms. CMake can identify and correctly link to libraries on its own without requiring manual LDFLAGS configuration.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1996
File: components/core/tools/packaging/common/bundle-libs.sh:49-49
Timestamp: 2026-02-15T04:15:30.839Z
Learning: In CLP packaging scripts (components/core/tools/packaging/common/bundle-libs.sh), the BINARIES array listing all CLP core binaries (clg, clo, clp, clp-s, indexer, log-converter, reducer-server) is intentionally explicit rather than dynamically derived. This design ensures that missing binaries fail the build immediately (fail-fast behavior) rather than being silently omitted from packages. The list is documented to be coupled with CMakeLists.txt and the RPM %files list for maintenance.
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Learnt from: gibber9809
Repo: y-scope/clp PR: 955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1124
File: components/core/cmake/Modules/FindBZip2.cmake:49-65
Timestamp: 2025-08-03T18:56:07.366Z
Learning: In CLP's custom CMake Find modules, `FindStaticLibraryDependencies` populates the `*_DYNAMIC_LIBS` variable even when building with static libraries. When the main library is static, all dependencies must also use static libraries - there's no fallback to shared libraries for dependencies.
| TARBALL_SHA256: "c240d4bffcccda4fe3a2bba2872718d96fc92e56d2615bfac4f9b2bad89a6386" | ||
| TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.2.tar.gz" |
There was a problem hiding this comment.
| TARBALL_SHA256: "c240d4bffcccda4fe3a2bba2872718d96fc92e56d2615bfac4f9b2bad89a6386" | |
| TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.2.tar.gz" | |
| TARBALL_SHA256: "bde0c42f43899c4c5c48be826c09abd22500ed537b89f16b3cced5eec477c263" | |
| TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.3.tar.gz" |
Might as well pull the newest version released a few days ago.
| | [msgpack-cxx](https://github.com/msgpack/msgpack-c/tree/cpp_master) | v7.0.0 | | ||
| | [nlohmann_json](https://github.com/nlohmann/json) | v3.11.3 | | ||
| | [simdjson](https://github.com/simdjson/simdjson) | v3.13.0 | | ||
| | [simdjson](https://github.com/simdjson/simdjson) | v4.6.2 | |
There was a problem hiding this comment.
| | [simdjson](https://github.com/simdjson/simdjson) | v4.6.2 | | |
| | [simdjson](https://github.com/simdjson/simdjson) | v4.6.3 | |
Per other comment.
|
I did some more benchmarking on regular JSON datasets, and I noticed that this actually makes serialization a bit slower on average. From profiling it seems like a lot of the regression comes from escaping keys -- right now we don't pre-escape keys at the leaf level, so for every field we end up calling into this escaping function, which appears to always do some heap allocation to create the string builder before doing the escaping. Since the previous implementation was non-allocating, and keys tend to not have any characters that require escaping (which means we don't benefit as much from simdjson's optimizations to the slow path), this ends up adding some meaningful overhead. With that in mind, could we try leaving the current escaping function as is, and creating a new class which holds an instance of |
| } | ||
|
|
||
| void StringUtils::escape_json_string(std::string& destination, std::string_view const source) { | ||
| simdjson::builtin::builder::string_builder json_string_builder{}; |
There was a problem hiding this comment.
| simdjson::builtin::builder::string_builder json_string_builder{}; | |
| simdjson::builder::string_builder json_string_builder{}; |
Didn't catch this the first time I looked -- seems like this is the class they want us to use, based on the simdjson docs.
Description
This PR replaces the JSON string escaping implementation used by
clp-smarshalling.Main changes:
simdjsonfromv3.13.0tov4.6.2.simdjsonversion:components/core/CMakeLists.txtcomponents/core/src/clp_s/CMakeLists.txttaskfiles/deps/main.yamldocs/src/dev-docs/components-core/index.mdclp_s::StringUtils::escape_json_stringincomponents/core/src/clp_s/Utils.cppto usesimdjson::builtin::builder::string_builderfor JSON string escaping.escape_json_string.Why this change:
StringUtils::escape_json_stringshows up as a significant hotspot when marshalling searchresults for string-heavy datasets, experimented with the hive-24h dataset, the percentage time for
StringUtils::escape_json_stringdropped from 17% to 5%.Checklist
breaking change.
Validation performed
Summary by CodeRabbit