Skip to content

Commit 817de67

Browse files
authored
Merge pull request #176 from ashvardanian/main-dev
Replacements & Better Documentation
2 parents 87fae70 + a83f948 commit 817de67

File tree

10 files changed

+594
-132
lines changed

10 files changed

+594
-132
lines changed

.github/workflows/prerelease.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ permissions:
2020
jobs:
2121
versioning:
2222
name: Update Version
23-
runs-on: ubuntu-24
23+
runs-on: ubuntu-24.04
2424
steps:
2525
- name: Checkout
2626
uses: actions/checkout@v4
@@ -402,7 +402,7 @@ jobs:
402402

403403
test_alpine:
404404
name: Alpine Linux
405-
runs-on: ubuntu-24
405+
runs-on: ubuntu-24.04
406406
container:
407407
image: alpine:latest
408408
options: --privileged # If needed for certain Docker operations
@@ -451,7 +451,7 @@ jobs:
451451
]
452452
strategy:
453453
matrix:
454-
os: [ubuntu-24, macos-13, windows-2022]
454+
os: [ubuntu-24.04, macos-13, windows-2022]
455455
python-version: ["36", "37", "38", "39", "310", "311", "312"]
456456
steps:
457457
- uses: actions/checkout@v4
@@ -462,7 +462,7 @@ jobs:
462462

463463
# We only need QEMU for Linux builds
464464
- name: Setup QEMU
465-
if: matrix.os == 'ubuntu-24'
465+
if: matrix.os == 'ubuntu-24.04'
466466
uses: docker/setup-qemu-action@v3
467467
- name: Install cibuildwheel
468468
run: python -m pip install cibuildwheel

.github/workflows/release.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ permissions:
1919
jobs:
2020
versioning:
2121
name: Update Version
22-
runs-on: ubuntu-24
22+
runs-on: ubuntu-24.04
2323
steps:
2424
- name: Checkout
2525
uses: actions/checkout@v4
@@ -49,7 +49,7 @@ jobs:
4949

5050
rebase:
5151
name: Rebase Dev. Branch
52-
runs-on: ubuntu-24
52+
runs-on: ubuntu-24.04
5353
if: github.ref == 'refs/heads/main'
5454
needs: versioning
5555
steps:
@@ -78,7 +78,7 @@ jobs:
7878
needs: versioning
7979
strategy:
8080
matrix:
81-
os: [ubuntu-24, macos-13, windows-2022]
81+
os: [ubuntu-24.04, macos-13, windows-2022]
8282
python-version: ["36", "37", "38", "39", "310", "311", "312"]
8383
steps:
8484
- uses: actions/checkout@v4
@@ -90,7 +90,7 @@ jobs:
9090
with:
9191
python-version: 3.x
9292
- name: Setup QEMU
93-
if: matrix.os == 'ubuntu-24' # We only need QEMU for Linux builds
93+
if: matrix.os == 'ubuntu-24.04' # We only need QEMU for Linux builds
9494
uses: docker/setup-qemu-action@v3
9595
- name: Install cibuildwheel
9696
run: python -m pip install cibuildwheel
@@ -153,7 +153,7 @@ jobs:
153153
# publish_javascript:
154154
# name: Publish JavaScript
155155
# needs: versioning
156-
# runs-on: ubuntu-24
156+
# runs-on: ubuntu-24.04
157157
# steps:
158158
# - uses: actions/checkout@v4
159159
# with:

.vscode/settings.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,9 @@
5555
"Hirschberg's",
5656
"Horspool",
5757
"Hyyro",
58+
"illformed",
5859
"initproc",
60+
"inplace",
5961
"intp",
6062
"isprintable",
6163
"itemsize",

README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,28 @@ __Who is this for?__
186186
<span style="color:#ABABAB;">arm:</span> <b>25.8</b> MB/s
187187
</td>
188188
</tr>
189+
<!-- Mapping Characters with Look-Up Table Transforms -->
190+
<tr>
191+
<td colspan="4" align="center">Mapping Characters with Look-Up Table Transforms</td>
192+
</tr>
193+
<tr>
194+
<td align="center">⚪</td>
195+
<td align="center">
196+
<code>transform</code><br/>
197+
<span style="color:#ABABAB;">x86:</span> <b>3.81</b> &centerdot;
198+
<span style="color:#ABABAB;">arm:</span> <b>2.65</b> GB/s
199+
</td>
200+
<td align="center">
201+
<code>str.translate</code><br/>
202+
<span style="color:#ABABAB;">x86:</span> <b>260.0</b> &centerdot;
203+
<span style="color:#ABABAB;">arm:</span> <b>140.0</b> MB/s
204+
</td>
205+
<td align="center">
206+
<code>sz_look_up_transform</code><br/>
207+
<span style="color:#ABABAB;">x86:</span> <b>21.2</b> &centerdot;
208+
<span style="color:#ABABAB;">arm:</span> <b>8.5</b> GB/s
209+
</td>
210+
</tr>
189211
<!-- Sorting -->
190212
<tr>
191213
<td colspan="4" align="center">Get sorted order, ≅ 8 million English words <sup>6</sup></td>
@@ -373,6 +395,25 @@ x: Strs = text.split_charset(separator='chars', maxsplit=sys.maxsize, keepsepara
373395
x: Strs = text.rsplit_charset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
374396
```
375397

398+
You can also transform the string using Look-Up Tables (LUTs), mapping it to a different character set.
399+
This would result in a copy - `str` for `str` inputs and `bytes` for other types.
400+
401+
```py
402+
x: str = text.translate('chars', {}, start=0, end=sys.maxsize, inplace=False)
403+
x: bytes = text.translate(b'chars', {}, start=0, end=sys.maxsize, inplace=False)
404+
```
405+
406+
For efficiency reasons, pass the LUT as a string or bytes object, not as a dictionary.
407+
This can be useful in high-throughput applications dealing with binary data, including bioinformatics and image processing.
408+
Here is an example:
409+
410+
```py
411+
import stringzilla as sz
412+
look_up_table = bytes(range(256)) # Identity LUT
413+
image = open("/image/path.jpeg", "rb").read()
414+
sz.translate(image, look_up_table, inplace=True)
415+
```
416+
376417
### Collection-Level Operations
377418

378419
Once split into a `Strs` object, you can sort, shuffle, and reorganize the slices, with minimum memory footprint.
@@ -1024,6 +1065,18 @@ char uuid[36];
10241065
sz::randomize(sz::string_span(uuid, 36), "0123456789abcdef-"); // Overwrite any buffer
10251066
```
10261067

1068+
### Bulk Replacements
1069+
1070+
In text processing, it's often necessary to replace all occurrences of a specific substring or set of characters within a string.
1071+
Standard library functions may not offer the most efficient or convenient methods for performing bulk replacements, especially when dealing with large strings or performance-critical applications.
1072+
1073+
- `haystack.replace_all(needle_string, replacement_string)`
1074+
- `haystack.replace_all(sz::char_set(""), replacement_string)`
1075+
- `haystack.try_replace_all(needle_string, replacement_string)`
1076+
- `haystack.try_replace_all(sz::char_set(""), replacement_string)`
1077+
- `haystack.transform(sz::look_up_table::identity())`
1078+
- `haystack.transform(sz::look_up_table::identity(), haystack.data())`
1079+
10271080
### Levenshtein Edit Distance and Alignment Scores
10281081

10291082
Levenshtein and Hamming edit distance are provided for both byte-strings and UTF-8 strings.

include/stringzilla/stringzilla.h

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,17 @@
149149
#endif // SZ_DYNAMIC_DISPATCH
150150
#endif // SZ_DYNAMIC
151151

152+
/**
153+
* @brief Alignment macro for 64-byte alignment.
154+
*/
155+
#if defined(_MSC_VER)
156+
#define SZ_ALIGN64 __declspec(align(64))
157+
#elif defined(__GNUC__) || defined(__clang__)
158+
#define SZ_ALIGN64 __attribute__((aligned(64)))
159+
#else
160+
#define SZ_ALIGN64
161+
#endif
162+
152163
#ifdef __cplusplus
153164
extern "C" {
154165
#endif
@@ -172,6 +183,9 @@ typedef ptrdiff_t sz_ssize_t; // Signed version of `sz_size_t`, 32 or 64 bits
172183

173184
#else // if SZ_AVOID_LIBC:
174185

186+
// ! The C standard doesn't specify the signedness of char.
187+
// ! On x86 char is signed by default while on Arm it is unsigned by default.
188+
// ! That's why we don't define `sz_char_t` and generally use explicit `sz_i8_t` and `sz_u8_t`.
175189
typedef signed char sz_i8_t; // Always 8 bits
176190
typedef unsigned char sz_u8_t; // Always 8 bits
177191
typedef unsigned short sz_u16_t; // Always 16 bits

include/stringzilla/stringzilla.hpp

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1962,6 +1962,7 @@ class basic_string_slice {
19621962
* * `try_` exception-free "try" operations that returning non-zero values on success,
19631963
* * `replace_all` and `erase_all` similar to Boost,
19641964
* * `edit_distance` - Levenshtein distance computation reusing the allocator,
1965+
* * `translate` - character mapping,
19651966
* * `randomize`, `random` - for fast random string generation.
19661967
*
19671968
* Functions defined for `basic_string_slice`, but not present in `basic_string`:
@@ -3413,7 +3414,8 @@ class basic_string {
34133414
}
34143415

34153416
/**
3416-
* @brief Maps all chatacters in the current string into another buffer using the provided lookup table.
3417+
* @brief Maps all characters in the current string into another buffer using the provided lookup table.
3418+
* @param output The buffer to write the transformed string into.
34173419
*/
34183420
void transform(look_up_table const &table, pointer output) const noexcept {
34193421
sz_ptr_t start;
@@ -3875,7 +3877,7 @@ void transform(basic_string_slice<char_type_> string, basic_look_up_table<char_t
38753877
}
38763878

38773879
/**
3878-
* @brief Maps all chatacters in the current string into another buffer using the provided lookup table.
3880+
* @brief Maps all characters in the current string into another buffer using the provided lookup table.
38793881
*/
38803882
template <typename char_type_>
38813883
void transform(basic_string_slice<char_type_ const> source, basic_look_up_table<char_type_> const &table,

0 commit comments

Comments
 (0)