Skip to content

refactor(clp-s): Use simdjson string_builder for JSON string escaping in marshalling.#2216

Open
ShangDanLuXian wants to merge 1 commit intoy-scope:mainfrom
ShangDanLuXian:simdjson_marshalling
Open

refactor(clp-s): Use simdjson string_builder for JSON string escaping in marshalling.#2216
ShangDanLuXian wants to merge 1 commit intoy-scope:mainfrom
ShangDanLuXian:simdjson_marshalling

Conversation

@ShangDanLuXian
Copy link
Copy Markdown
Contributor

@ShangDanLuXian ShangDanLuXian commented Apr 22, 2026

Description

This PR replaces the JSON string escaping implementation used by
clp-s marshalling.

Main changes:

  • Bump simdjson from v3.13.0 to v4.6.2.
  • Update build/dependency wiring for the new simdjson version:
    • components/core/CMakeLists.txt
    • components/core/src/clp_s/CMakeLists.txt
    • taskfiles/deps/main.yaml
    • docs/src/dev-docs/components-core/index.md
  • Update clp_s::StringUtils::escape_json_string in components/core/src/clp_s/Utils.cpp to use
    simdjson::builtin::builder::string_builder for JSON string escaping.
  • Keep the previous scalar implementation as a fallback inside escape_json_string.

Why this change:

  • StringUtils::escape_json_string shows up as a significant hotspot when marshalling search
    results for string-heavy datasets, experimented with the hive-24h dataset, the percentage time for StringUtils::escape_json_string dropped from 17% to 5%.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Performed sanity check on search results for the hive-24h and mongodb datasets.

Summary by CodeRabbit

  • Chores
    • Upgraded core dependency to version 4.6.2 with stricter version constraints in the build configuration.
    • Enhanced JSON handling integration within the archive reader component.

@ShangDanLuXian ShangDanLuXian requested review from a team and gibber9809 as code owners April 22, 2026 16:35
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Walkthrough

The pull request upgrades the simdjson dependency from v3.13.0 to v4.6.2. Changes include adding explicit version constraints in CMake configuration, linking the library to the clp_s_archive_reader target, integrating simdjson's string builder into StringUtils::escape_json_string with fallback to the previous implementation, and updating dependency documentation and task files accordingly.

Changes

Cohort / File(s) Summary
CMake Build Configuration
components/core/CMakeLists.txt, components/core/src/clp_s/CMakeLists.txt
Added explicit minimum version constraint (4.6.2) to simdjson dependency discovery and linked simdjson::simdjson library to clp_s_archive_reader target.
String Escaping Implementation
components/core/src/clp_s/Utils.cpp
Modified StringUtils::escape_json_string to attempt JSON escaping using simdjson's built-in string builder with automatic fallback to the original character-by-character escaping logic on failure.
Dependency Management
docs/src/dev-docs/components-core/index.md, taskfiles/deps/main.yaml
Updated simdjson dependency specification from v3.13.0 to v4.6.2, including tarball URL and integrity hash in task configuration and documentation version reference.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: replacing JSON string escaping logic with simdjson's string_builder in marshalling operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

return;
}

// Preserve the original implementation in case simdjson fails.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we may remove it completely?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would just remove the original implementation, and throw if the simdjson function fails.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per other review comment -- should probably keep it actually, and create a new class for this new implementation.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Just a few things we should change.


if(CLP_NEED_SIMDJSON)
find_package(simdjson REQUIRED)
find_package(simdjson 4.6.2 REQUIRED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
find_package(simdjson 4.6.2 REQUIRED)
find_package(simdjson 4.3.0 REQUIRED)

To make it easier to integrate with other projects we should probably choose the earliest version that serves our needs -- it seems like 4.3.0 is the first version with the string escaping optimizations, so let's use that as our required version.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry it might be better to require 4.0.0, which introduces the string builder API. Reason being that it seems like the version of velox integrate with currently pulls in simdjson 4.1.0, so requiring a newer version than that would cause problems.

clp_s::timestamp_pattern
msgpack-cxx
nlohmann_json::nlohmann_json
simdjson::simdjson
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be PRIVATE, since it's transitive through Utils.cpp, not through a header file.

We should probably also split out Utils.{cpp,hpp} into a library in our build since those utilities are used by a few of our libraries, but I won't make you do that as part of this PR. @coderabbitai, could you create an issue to track splitting off these utilities into a separate library in our build?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@gibber9809, agreed — simdjson::simdjson should be PRIVATE here since the dependency is only used within Utils.cpp and is not exposed through any public header of clp_s_archive_reader.

I've created a GitHub issue to track splitting Utils.{cpp,hpp} into a dedicated library. I'll share the link once the script completes above.


🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo y-scope/clp \
  --assignee gibber9809 \
  --title "refactor(clp-s): Split Utils.{cpp,hpp} into a separate library in the build" \
  --body "## Overview

\`Utils.cpp\` and \`Utils.hpp\` in \`components/core/src/clp_s/\` contain shared utilities that are used by several clp-s libraries (e.g., \`clp_s_archive_reader\`, \`clp_s_archive_writer\`, \`clp_s_io\`). Currently these files are compiled into multiple targets, which leads to duplicated compilation and makes it harder to manage library-specific dependencies (e.g., transitive dependencies like \`simdjson::simdjson\` must be declared on each consumer target even when they are only required by \`Utils.cpp\`).

## Goal

Split \`Utils.{cpp,hpp}\` into a dedicated CMake library target so that:
- The utilities are compiled once and shared across consumers.
- Transitive dependencies (e.g., \`simdjson::simdjson\`) are declared in a single place.
- The dependency graph in \`components/core/src/clp_s/CMakeLists.txt\` is simplified.

## Background

Raised during review of PR `#2216` (https://github.com/y-scope/clp/pull/2216#discussion_r3132106691) by `@gibber9809`."

Unrecoverable error: Failed to clone repository


🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo y-scope/clp \
  --assignee gibber9809 \
  --title "refactor(clp-s): Split Utils.{cpp,hpp} into a separate library in the build" \
  --body "## Overview

\`Utils.cpp\` and \`Utils.hpp\` in \`components/core/src/clp_s/\` contain shared utilities that are used by several clp-s libraries (e.g., \`clp_s_archive_reader\`, \`clp_s_archive_writer\`, \`clp_s_io\`). Currently these files are compiled into multiple targets, which leads to duplicated compilation and makes it harder to manage library-specific dependencies (e.g., transitive dependencies like \`simdjson::simdjson\` must be declared on each consumer target even when they are only required by \`Utils.cpp\`).

## Goal

Split \`Utils.{cpp,hpp}\` into a dedicated CMake library target so that:
- The utilities are compiled once and shared across consumers.
- Transitive dependencies (e.g., \`simdjson::simdjson\`) are declared in a single place.
- The dependency graph in \`components/core/src/clp_s/CMakeLists.txt\` is simplified.

## Background

Raised during review of PR `#2216` (https://github.com/y-scope/clp/pull/2216#discussion_r3132106691) by `@gibber9809`."

Length of output: 42


@gibber9809, agreed — simdjson::simdjson should be PRIVATE since it's only consumed by Utils.cpp and not exposed through any public header of clp_s_archive_reader.

I've created issue #2218 to track splitting Utils.{cpp,hpp} into a separate library: #2218


🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1262
File: docs/src/dev-guide/components-core/index.md:50-50
Timestamp: 2025-08-25T07:27:19.449Z
Learning: In the CLP project, nlohmann_json version 3.11.3 is installed through their dependency management process, but they don't enforce version constraints in CMake find_package calls - they control versions through installation rather than CMake version pinning.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 2144
File: components/core/src/clp_s/search/ast/SearchUtils.cpp:10-11
Timestamp: 2026-03-29T20:05:47.160Z
Learning: In the y-scope/clp repository (e.g., `components/core/src/clp_s/search/ast/SearchUtils.cpp`), the include style convention is intentional: angle-bracket includes (e.g., `<clp_s/archive_constants.hpp>`) are used for headers that belong to external CMake targets, while quoted relative includes (e.g., `"ConvertToExists.hpp"`) are used for headers within the same CMake target. Do not flag this as an inconsistency.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 486
File: components/core/tests/test-error_handling.cpp:37-38
Timestamp: 2024-11-26T19:12:43.982Z
Learning: In the CLP project, C++20 is used, allowing the utilization of C++20 features such as class template argument deduction (CTAD) with `std::array`.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1184
File: taskfiles/deps/main.yaml:0-0
Timestamp: 2025-08-13T19:58:26.058Z
Learning: In the CLP project, custom CMake find modules (like FindLibLZMA.cmake) are designed to flexibly link against either shared or static libraries based on availability, rather than requiring both variants to be built. The find modules use flags like LibLZMA_USE_STATIC_LIBS and temporarily modify CMAKE_FIND_LIBRARY_SUFFIXES to prefer the desired library type while gracefully falling back if needed.

Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Learnt from: sitaowang1998
Repo: y-scope/clp PR: 1044
File: .github/workflows/clp-core-build-macos.yaml:0-0
Timestamp: 2025-06-27T01:49:15.724Z
Learning: In the clp project, manually setting LDFLAGS for library paths can cause runtime errors because it interferes with CMake's built-in library discovery and linking mechanisms. CMake can identify and correctly link to libraries on its own without requiring manual LDFLAGS configuration.

Learnt from: jackluo923
Repo: y-scope/clp PR: 1996
File: components/core/tools/packaging/common/bundle-libs.sh:49-49
Timestamp: 2026-02-15T04:15:30.839Z
Learning: In CLP packaging scripts (components/core/tools/packaging/common/bundle-libs.sh), the BINARIES array listing all CLP core binaries (clg, clo, clp, clp-s, indexer, log-converter, reducer-server) is intentionally explicit rather than dynamically derived. This design ensures that missing binaries fail the build immediately (fail-fast behavior) rather than being silently omitted from packages. The list is documented to be coupled with CMakeLists.txt and the RPM %files list for maintenance.

Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Learnt from: gibber9809
Repo: y-scope/clp PR: 955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1124
File: components/core/cmake/Modules/FindBZip2.cmake:49-65
Timestamp: 2025-08-03T18:56:07.366Z
Learning: In CLP's custom CMake Find modules, `FindStaticLibraryDependencies` populates the `*_DYNAMIC_LIBS` variable even when building with static libraries. When the main library is static, all dependencies must also use static libraries - there's no fallback to shared libraries for dependencies.

Comment thread taskfiles/deps/main.yaml
Comment on lines +486 to +487
TARBALL_SHA256: "c240d4bffcccda4fe3a2bba2872718d96fc92e56d2615bfac4f9b2bad89a6386"
TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.2.tar.gz"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TARBALL_SHA256: "c240d4bffcccda4fe3a2bba2872718d96fc92e56d2615bfac4f9b2bad89a6386"
TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.2.tar.gz"
TARBALL_SHA256: "bde0c42f43899c4c5c48be826c09abd22500ed537b89f16b3cced5eec477c263"
TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.3.tar.gz"

Might as well pull the newest version released a few days ago.

| [msgpack-cxx](https://github.com/msgpack/msgpack-c/tree/cpp_master) | v7.0.0 |
| [nlohmann_json](https://github.com/nlohmann/json) | v3.11.3 |
| [simdjson](https://github.com/simdjson/simdjson) | v3.13.0 |
| [simdjson](https://github.com/simdjson/simdjson) | v4.6.2 |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [simdjson](https://github.com/simdjson/simdjson) | v4.6.2 |
| [simdjson](https://github.com/simdjson/simdjson) | v4.6.3 |

Per other comment.

@gibber9809
Copy link
Copy Markdown
Contributor

I did some more benchmarking on regular JSON datasets, and I noticed that this actually makes serialization a bit slower on average.

From profiling it seems like a lot of the regression comes from escaping keys -- right now we don't pre-escape keys at the leaf level, so for every field we end up calling into this escaping function, which appears to always do some heap allocation to create the string builder before doing the escaping. Since the previous implementation was non-allocating, and keys tend to not have any characters that require escaping (which means we don't benefit as much from simdjson's optimizations to the slow path), this ends up adding some meaningful overhead.

With that in mind, could we try leaving the current escaping function as is, and creating a new class which holds an instance of simdjson::string_builder w/ an escaping function that reuses that string builder instance? That way we can avoid the allocations on every invocation of the escaping function by just passing around an instance of that new class. It may also be worth testing whether we get better decompression speed if we only use the simdjson-based implementation for string values, or if it makes sense to use it to escape both keys and values.

}

void StringUtils::escape_json_string(std::string& destination, std::string_view const source) {
simdjson::builtin::builder::string_builder json_string_builder{};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
simdjson::builtin::builder::string_builder json_string_builder{};
simdjson::builder::string_builder json_string_builder{};

Didn't catch this the first time I looked -- seems like this is the class they want us to use, based on the simdjson docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants