You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
-61Lines changed: 0 additions & 61 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,67 +3,6 @@
3
3
simdcsv is a CSV parser that evaluates 64 bytes at a time. There are many kinds of CSV files; this project adheres to the format described
4
4
in [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.html).
5
5
6
-
**Introduction**
7
-
8
-
We can classify every character in CSV into the following: a COMMA, QUOTATION, NEW_LINE, OTHER. We can build a perfect lookup table and use `vqtbl1q_u8` to classify 16 characters at once. Daniel Lemire calls this "vectorized classification" in the simdjson paper. [[code pointer]](https://github.com/friendlymatthew/simdcsv/blob/main/src/classifier.rs)
9
-
10
-
Once we classify every character, we can build a bitset for each class. We chunk through 64 characters at a time, building a `u64` for every chunk. Here is a naive case:
Now, we can just [count the number of leading zeros](https://doc.rust-lang.org/std/primitive.u64.html#method.leading_zeros) in the comma bitset to pull the csv entries.
27
-
28
-
Using a bitset is pretty powerful in cases where one wants to check if there exists a symbol, count the # of symbols, or remove escaped symbols.
29
-
30
-
**Detecting Escaped Quotations and Commas**
31
-
32
-
Consider the csv row: `"aaa,norm","b""bb","ccc"`
33
-
34
-
In CSV, quotes are escaped by doubling them (`""`). The `""` in `b""bb` is field content, not a structural delimiter. We detect escaped pairs by finding adjacent quotes:
0 commit comments