Skip to content

Vectorize CSV parsing#1056

Draft
fernya wants to merge 11 commits into
mainfrom
simd_aggressive_optimization
Draft

Vectorize CSV parsing#1056
fernya wants to merge 11 commits into
mainfrom
simd_aggressive_optimization

Conversation

@fernya
Copy link
Copy Markdown
Contributor

@fernya fernya commented May 4, 2026

This PR adds vectorization to some CSV parsing cases.

You may opt-out with the usage of -DENABLE_MANUAL_SIMD=Off (CMake) or --enable-manual-simd=no (autotools) flags.
The build system then checks the available and implemented SIMD instructions sets and compiles the appropriate code.

At runtime the SIMD blocks may be disabled if the CPU doesn't support them.
You may test this with the following command: EXTRA_ARGS="--run-under=qemu --qemu-cpu=qemu64 -k panos" make light-check
(qemu-user package needs to be installed on Ubuntu)

  • Remove have_simd branching from hot path
  • Move all #ifdefs out from csv-scanner.c

@fernya fernya force-pushed the simd_aggressive_optimization branch 2 times, most recently from 3bb7da4 to 2b0e405 Compare May 6, 2026 12:53
OverOrion and others added 9 commits May 6, 2026 15:07
Add csv-scanner-simd.{c,h} as a self-contained library of AVX2 helpers
that the CSV scanner will use:

  * csv_simd_parse() — full record pre-parse for the simple
    comma-delimited / RFC4180 case, producing an array of (start, end,
    quoted) field offsets in one pass over the input.

  * csv_simd_find_either{,3,4}() — first-match search for up to four
    single-byte needles, with vector-domain test-zero short-circuit on
    blank chunks and per-needle bitmaps materialized only on hit.

  * csv_simd_trim_leading_whitespace() / csv_simd_trim_trailing_whitespace()
    — 32-byte-at-a-time SIMD whitespace trim on offset ranges.

The classifier broadcasts are spelled out as _mm256_set1_epi8 literals
inside csv_simd_classify_min_64(), which is static inline; with AVX2
and inlining the compiler folds them into RIP-relative loads and hoists
them out of the chunk loop.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
… input_end

The hot inner-loop scans in
  _parse_characters_with_quotation and
  _parse_value_with_whitespace_and_delimiter used strchrnul()-pair +
  MIN() to find the first interesting byte (escape/quote/delimiter).
  That walks the haystack one byte at a time per needle, twice or thrice
  over the same memory.

  Replace with csv_simd_find_either() / csv_simd_find_either3() — single
  pass, 64 bytes at a time, vector-domain test-zero short-circuit on
  blank chunks.  Drop-in semantic equivalent for the strchrnul case.

  Also cache input_end (= input + strlen(input)) on the scanner at init
  time and use input_end - src instead of strlen(self->src) per call.
  strlen on a long syslog record is O(N) and we were paying it for every
  field.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
…st-path records

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
- Conditionally compile SIMD types which are available only on selected platforms. The code shall compile on non-SIMD capable toolchains. The conditions are provided externally (ie. CMake)
- Implement and guard SIMD-aware implementations of parser functions
- Determine SIMD capabilities runtime and gating the fast path

Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
@OverOrion OverOrion force-pushed the simd_aggressive_optimization branch from 2b0e405 to 5d38dd3 Compare May 6, 2026 13:09
Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
@fernya fernya force-pushed the simd_aggressive_optimization branch 3 times, most recently from ce85180 to e082452 Compare May 6, 2026 19:29
@fernya fernya marked this pull request as ready for review May 7, 2026 05:47
@fernya fernya requested a review from mitzkia May 7, 2026 10:48
@fernya fernya force-pushed the simd_aggressive_optimization branch from a34886a to d5d1dee Compare May 7, 2026 12:01
@mitzkia
Copy link
Copy Markdown
Contributor

mitzkia commented May 7, 2026

Thanks for the Light related update. I can approve that.

We may test the fallback path when the code has
compiled-in manual SIMD optimization, but at
runtime the CPU doesn't support it.

Signed-off-by: Ferencz, Balint <balint.ferencz@axoflow.com>
@fernya fernya force-pushed the simd_aggressive_optimization branch from d5d1dee to b9c8b06 Compare May 7, 2026 12:12
Copy link
Copy Markdown
Contributor

@MrAnno MrAnno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not completely ready for review, as it contains a huge amount of probably LLM-generated noise, confusion, and dead code. Let's make some cleanup first.

My first review round is about this cleanup, I'll do a separate review round which will be much more meaningful and I'll concentrate on the actual implementation and perf aspects there.

Sorry for the short and dry comments. They may sound harsh, but I just wanted to get through the LLM-generated stuff as fast as possible and concentrate on the important things afterwards (where I'll be much more thorough and polite, of course).

#include "csv-scanner-simd.h"
#include "compat/glib.h"

/* Per-function target attribute lets GCC and clang emit AVX2 code */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best comment is the one that is not needed. Clean, self-explanatory code can spare a lot of comments; which has the advantage of not requiring constant updates as the code changes (it won't mislead the reader once it goes stale).

"Comments should not duplicate the code" - based on this "rule", I'll mark the comments which I found unnecessary with *c.


/* Parser-level chunk: two AVX2 ymm loads (32 B each) → one 64-bit hit
* bitmap that we iterate with __builtin_ctzll. Not a SIMD-imposed
* constant; the 64 falls out of "two movemasks fit in a uint64_t". */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*c

I don't really understand what this means. Looks like a generated comment, with a unicode symbol. I'm sure we can explain this more clearly with code or with a simpler comment near to the code that uses this constant.

I'd be less confused without this comment.


/* Per-chunk hit bitmaps: bit i is set iff chunk[i] matches the class.
* `quotes` collapses '"' and '\'' into a single mask; `spaces` does the
* same for ' ' and '\t'. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the non-redundant valuable information here would be stating that these are bitmaps for finding the important characters within a given chunk.

__m256i lo = _mm256_loadu_si256((const __m256i *) chunk);
__m256i hi = _mm256_loadu_si256((const __m256i *) (chunk + SIMD_PARSER_CHUNK_BYTES / 2));

/* commas: single character */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*c

Move the declaration of comma next to this call instead.

self->commas = _combine_mm(_mm256_cmpeq_epi8(lo, comma),
_mm256_cmpeq_epi8(hi, comma));

/* quotes: '"' | '\'' - fuse the OR in vector domain, movemask once per lane */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*c

Move the declaration of (s)quote next to this call instead.


AVX2_FN
void
csv_simd_find_delim_quote_space(const char *start, gsize max_len, gchar delimiter, gchar quote_char,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code


AVX2_FN
gint
csv_simd_find_delimiter_or_quote(const char *start, gsize max_len, gchar delimiter, gchar quote_char)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code

if (result.offset >= 0)
nexthop = self->src + result.offset;
else
nexthop = self->src + remaining; /* no match: scan to end of input */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks compatibility on platforms where SIMD operations are not available.

return -1;
}

#else
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional:

I think it's possible to refactor the SIMD code's entry points so that we use only 3-4 short calls that either perform the optimization or signal an error, letting the scanner fall back to the slow path when calling into the SIMD methods from the CSV scanner side. That way, we would no longer need some many NOOP functions, and bugs like the one I found in the previous comment would be easier to catch.

For example:

  • Extracting the SIMD-related logic from _parse_value_with_whitespace_and_delimiter() into a separate function and moving it to csv-scanner-simd.c.
  • _scan_next_from_fast_path, _trim_fast_path_fields_whitespace, and _can_use_simd_fast_path should also go into csv-scanner-simd.c.

The only code remaining inside csv-scanner.c would be the check for SIMD availability and short one-line calls into the SIMD-optimized paths. Passing CSVScanner *self to these functions is fine. If we want to separate the generic SIMD code that doesn't operate on CSVScanner, we can add a separate simd.c unit, which can be built conditionally as a whole.


#ifdef SYSLOG_NG_MANUAL_NEON

#include ???
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#error "NEON instructions are currently not supported"

Or even better, we should get rid of all SYSLOG_NG_MANUAL_NEON branches in this PR so that we can compile and use the slow path until we have an actual working implementation.

A TODO comment should be enough.


/* SIMD fast path for simple unquoted values with default delimiter */
if (csv_is_simd_available() && !self->current_quote && !self->options->string_delimiters && !self->options->delimiters &&
self->options->dialect == CSV_SCANNER_ESCAPE_NONE && !(self->options->flags & CSV_SCANNER_STRIP_WHITESPACE))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be hidden inside a csv_scanner_can_use_simd_fast_path() or anything similar.

/* Fall through to handle quote */
self->current_quote = *self->src;
self->src++;
goto continuation;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_unquoted_literal_characters() detects escaping on DEFAULT_DELIM_CHAR, which seems to be missing from here.


What happens with foo"bar,baz? (Klaudia found this one)
The old code didn't allow starting a quote in the middle of the value, the new one does, so it can mess things up.

gsize remaining = self->input_end - self->src;
CSVSimdFindResult result;
/* Find delimiter, double-quote, or single-quote in a single SIMD pass */
csv_simd_find_either3(self->src, remaining, DEFAULT_DELIM_CHAR, '"', '\'', &result);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quotes_start and quotes_end are configurable. Shouldn't we add this as a condition to running the fast path?

else if (result.offset == 0)
{
/* Delimiter at start */
self->src++;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We stop at quotes, not just delimiters. I think we shouldn't skip this character unconditionally.

_find_n(const char *start, gsize max_len, const gchar *chars, int n,
CSVSimdFindResult *result)
{
if (G_LIKELY(cpu_supports_simd))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the runtime SIMD detection should be done only once outside of this unit and we should fall back to the original CSV scanner implementation, not to this partial fallback (_scalar_search_impl).

To summarize it:
We shouldn't see multiple cpu_supports_simd and csv_is_simd_available() checks, we should disable the whole thing in the scanner.


/* Pre-allocate fast_path_fields for at least 16 columns to avoid early reallocations */
scanner->fast_path_fields = g_array_sized_new(FALSE, FALSE, sizeof(CSVFieldOfs), 16);
csv_detect_cpu_features();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: You started caching the CPU feature check, which is a good idea.
This means we could move the detection call into a csv_scanner_global_init() function, which you could call from app_startup(). That runs only once, csv_scanner_init() is called whenever we want to scan a new CSV string.


AVX2_FN
void
csv_simd_find_either(const char *start, gsize max_len, gchar c1, gchar c2, CSVSimdFindResult *result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if max_len is gsize, then all offsets returned should also be gsize.

* offsets; subsequent scan_next calls dequeue from that array. Bails out
* to the generic path on cases the classifier's view of quoting can't
* express (unterminated quoted field, literal content after closing quote
* like "foo"bar). */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true.

@OverOrion OverOrion marked this pull request as draft May 12, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants