Vectorize CSV parsing#1056
Conversation
3bb7da4 to
2b0e405
Compare
Add csv-scanner-simd.{c,h} as a self-contained library of AVX2 helpers
that the CSV scanner will use:
* csv_simd_parse() — full record pre-parse for the simple
comma-delimited / RFC4180 case, producing an array of (start, end,
quoted) field offsets in one pass over the input.
* csv_simd_find_either{,3,4}() — first-match search for up to four
single-byte needles, with vector-domain test-zero short-circuit on
blank chunks and per-needle bitmaps materialized only on hit.
* csv_simd_trim_leading_whitespace() / csv_simd_trim_trailing_whitespace()
— 32-byte-at-a-time SIMD whitespace trim on offset ranges.
The classifier broadcasts are spelled out as _mm256_set1_epi8 literals
inside csv_simd_classify_min_64(), which is static inline; with AVX2
and inlining the compiler folds them into RIP-relative loads and hoists
them out of the chunk loop.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
… input_end The hot inner-loop scans in _parse_characters_with_quotation and _parse_value_with_whitespace_and_delimiter used strchrnul()-pair + MIN() to find the first interesting byte (escape/quote/delimiter). That walks the haystack one byte at a time per needle, twice or thrice over the same memory. Replace with csv_simd_find_either() / csv_simd_find_either3() — single pass, 64 bytes at a time, vector-domain test-zero short-circuit on blank chunks. Drop-in semantic equivalent for the strchrnul case. Also cache input_end (= input + strlen(input)) on the scanner at init time and use input_end - src instead of strlen(self->src) per call. strlen on a long syslog record is O(N) and we were paying it for every field. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
…st-path records Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
- Conditionally compile SIMD types which are available only on selected platforms. The code shall compile on non-SIMD capable toolchains. The conditions are provided externally (ie. CMake) - Implement and guard SIMD-aware implementations of parser functions - Determine SIMD capabilities runtime and gating the fast path Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
2b0e405 to
5d38dd3
Compare
Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
ce85180 to
e082452
Compare
a34886a to
d5d1dee
Compare
|
Thanks for the Light related update. I can approve that. |
We may test the fallback path when the code has compiled-in manual SIMD optimization, but at runtime the CPU doesn't support it. Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
d5d1dee to
b9c8b06
Compare
MrAnno
left a comment
There was a problem hiding this comment.
This PR is not completely ready for review, as it contains a huge amount of probably LLM-generated noise, confusion, and dead code. Let's make some cleanup first.
My first review round is about this cleanup, I'll do a separate review round which will be much more meaningful and I'll concentrate on the actual implementation and perf aspects there.
Sorry for the short and dry comments. They may sound harsh, but I just wanted to get through the LLM-generated stuff as fast as possible and concentrate on the important things afterwards (where I'll be much more thorough and polite, of course).
| #include "csv-scanner-simd.h" | ||
| #include "compat/glib.h" | ||
|
|
||
| /* Per-function target attribute lets GCC and clang emit AVX2 code */ |
There was a problem hiding this comment.
The best comment is the one that is not needed. Clean, self-explanatory code can spare a lot of comments; which has the advantage of not requiring constant updates as the code changes (it won't mislead the reader once it goes stale).
"Comments should not duplicate the code" - based on this "rule", I'll mark the comments which I found unnecessary with *c.
|
|
||
| /* Parser-level chunk: two AVX2 ymm loads (32 B each) → one 64-bit hit | ||
| * bitmap that we iterate with __builtin_ctzll. Not a SIMD-imposed | ||
| * constant; the 64 falls out of "two movemasks fit in a uint64_t". */ |
There was a problem hiding this comment.
*c
I don't really understand what this means. Looks like a generated comment, with a unicode → symbol. I'm sure we can explain this more clearly with code or with a simpler comment near to the code that uses this constant.
I'd be less confused without this comment.
|
|
||
| /* Per-chunk hit bitmaps: bit i is set iff chunk[i] matches the class. | ||
| * `quotes` collapses '"' and '\'' into a single mask; `spaces` does the | ||
| * same for ' ' and '\t'. */ |
There was a problem hiding this comment.
I think the non-redundant valuable information here would be stating that these are bitmaps for finding the important characters within a given chunk.
| __m256i lo = _mm256_loadu_si256((const __m256i *) chunk); | ||
| __m256i hi = _mm256_loadu_si256((const __m256i *) (chunk + SIMD_PARSER_CHUNK_BYTES / 2)); | ||
|
|
||
| /* commas: single character */ |
There was a problem hiding this comment.
*c
Move the declaration of comma next to this call instead.
| self->commas = _combine_mm(_mm256_cmpeq_epi8(lo, comma), | ||
| _mm256_cmpeq_epi8(hi, comma)); | ||
|
|
||
| /* quotes: '"' | '\'' - fuse the OR in vector domain, movemask once per lane */ |
There was a problem hiding this comment.
*c
Move the declaration of (s)quote next to this call instead.
|
|
||
| AVX2_FN | ||
| void | ||
| csv_simd_find_delim_quote_space(const char *start, gsize max_len, gchar delimiter, gchar quote_char, |
|
|
||
| AVX2_FN | ||
| gint | ||
| csv_simd_find_delimiter_or_quote(const char *start, gsize max_len, gchar delimiter, gchar quote_char) |
| if (result.offset >= 0) | ||
| nexthop = self->src + result.offset; | ||
| else | ||
| nexthop = self->src + remaining; /* no match: scan to end of input */ |
There was a problem hiding this comment.
This breaks compatibility on platforms where SIMD operations are not available.
| return -1; | ||
| } | ||
|
|
||
| #else |
There was a problem hiding this comment.
Optional:
I think it's possible to refactor the SIMD code's entry points so that we use only 3-4 short calls that either perform the optimization or signal an error, letting the scanner fall back to the slow path when calling into the SIMD methods from the CSV scanner side. That way, we would no longer need some many NOOP functions, and bugs like the one I found in the previous comment would be easier to catch.
For example:
- Extracting the SIMD-related logic from
_parse_value_with_whitespace_and_delimiter()into a separate function and moving it tocsv-scanner-simd.c. _scan_next_from_fast_path,_trim_fast_path_fields_whitespace, and_can_use_simd_fast_pathshould also go intocsv-scanner-simd.c.
The only code remaining inside csv-scanner.c would be the check for SIMD availability and short one-line calls into the SIMD-optimized paths. Passing CSVScanner *self to these functions is fine. If we want to separate the generic SIMD code that doesn't operate on CSVScanner, we can add a separate simd.c unit, which can be built conditionally as a whole.
|
|
||
| #ifdef SYSLOG_NG_MANUAL_NEON | ||
|
|
||
| #include ??? |
There was a problem hiding this comment.
#error "NEON instructions are currently not supported"
Or even better, we should get rid of all SYSLOG_NG_MANUAL_NEON branches in this PR so that we can compile and use the slow path until we have an actual working implementation.
A TODO comment should be enough.
|
|
||
| /* SIMD fast path for simple unquoted values with default delimiter */ | ||
| if (csv_is_simd_available() && !self->current_quote && !self->options->string_delimiters && !self->options->delimiters && | ||
| self->options->dialect == CSV_SCANNER_ESCAPE_NONE && !(self->options->flags & CSV_SCANNER_STRIP_WHITESPACE)) |
There was a problem hiding this comment.
This can be hidden inside a csv_scanner_can_use_simd_fast_path() or anything similar.
| /* Fall through to handle quote */ | ||
| self->current_quote = *self->src; | ||
| self->src++; | ||
| goto continuation; |
There was a problem hiding this comment.
_parse_unquoted_literal_characters() detects escaping on DEFAULT_DELIM_CHAR, which seems to be missing from here.
What happens with foo"bar,baz? (Klaudia found this one)
The old code didn't allow starting a quote in the middle of the value, the new one does, so it can mess things up.
| gsize remaining = self->input_end - self->src; | ||
| CSVSimdFindResult result; | ||
| /* Find delimiter, double-quote, or single-quote in a single SIMD pass */ | ||
| csv_simd_find_either3(self->src, remaining, DEFAULT_DELIM_CHAR, '"', '\'', &result); |
There was a problem hiding this comment.
quotes_start and quotes_end are configurable. Shouldn't we add this as a condition to running the fast path?
| else if (result.offset == 0) | ||
| { | ||
| /* Delimiter at start */ | ||
| self->src++; |
There was a problem hiding this comment.
We stop at quotes, not just delimiters. I think we shouldn't skip this character unconditionally.
| _find_n(const char *start, gsize max_len, const gchar *chars, int n, | ||
| CSVSimdFindResult *result) | ||
| { | ||
| if (G_LIKELY(cpu_supports_simd)) |
There was a problem hiding this comment.
I think the runtime SIMD detection should be done only once outside of this unit and we should fall back to the original CSV scanner implementation, not to this partial fallback (_scalar_search_impl).
To summarize it:
We shouldn't see multiple cpu_supports_simd and csv_is_simd_available() checks, we should disable the whole thing in the scanner.
|
|
||
| /* Pre-allocate fast_path_fields for at least 16 columns to avoid early reallocations */ | ||
| scanner->fast_path_fields = g_array_sized_new(FALSE, FALSE, sizeof(CSVFieldOfs), 16); | ||
| csv_detect_cpu_features(); |
There was a problem hiding this comment.
Optional: You started caching the CPU feature check, which is a good idea.
This means we could move the detection call into a csv_scanner_global_init() function, which you could call from app_startup(). That runs only once, csv_scanner_init() is called whenever we want to scan a new CSV string.
|
|
||
| AVX2_FN | ||
| void | ||
| csv_simd_find_either(const char *start, gsize max_len, gchar c1, gchar c2, CSVSimdFindResult *result) |
There was a problem hiding this comment.
if max_len is gsize, then all offsets returned should also be gsize.
| * offsets; subsequent scan_next calls dequeue from that array. Bails out | ||
| * to the generic path on cases the classifier's view of quoting can't | ||
| * express (unterminated quoted field, literal content after closing quote | ||
| * like "foo"bar). */ |
This PR adds vectorization to some CSV parsing cases.
You may opt-out with the usage of
-DENABLE_MANUAL_SIMD=Off(CMake) or--enable-manual-simd=no(autotools) flags.The build system then checks the available and implemented SIMD instructions sets and compiles the appropriate code.
At runtime the SIMD blocks may be disabled if the CPU doesn't support them.
You may test this with the following command:
EXTRA_ARGS="--run-under=qemu --qemu-cpu=qemu64 -k panos" make light-check(qemu-user package needs to be installed on Ubuntu)
have_simdbranching from hot path#ifdefsout fromcsv-scanner.c