Skip to content

Commit

Permalink
Merge pull request #176 from ashvardanian/main-dev
Browse files Browse the repository at this point in the history
Replacements & Better Documentation
  • Loading branch information
ashvardanian authored Oct 13, 2024
2 parents 87fae70 + a83f948 commit 817de67
Show file tree
Hide file tree
Showing 10 changed files with 594 additions and 132 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/prerelease.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ permissions:
jobs:
versioning:
name: Update Version
runs-on: ubuntu-24
runs-on: ubuntu-24.04
steps:
- name: Checkout
uses: actions/checkout@v4
Expand Down Expand Up @@ -402,7 +402,7 @@ jobs:

test_alpine:
name: Alpine Linux
runs-on: ubuntu-24
runs-on: ubuntu-24.04
container:
image: alpine:latest
options: --privileged # If needed for certain Docker operations
Expand Down Expand Up @@ -451,7 +451,7 @@ jobs:
]
strategy:
matrix:
os: [ubuntu-24, macos-13, windows-2022]
os: [ubuntu-24.04, macos-13, windows-2022]
python-version: ["36", "37", "38", "39", "310", "311", "312"]
steps:
- uses: actions/checkout@v4
Expand All @@ -462,7 +462,7 @@ jobs:

# We only need QEMU for Linux builds
- name: Setup QEMU
if: matrix.os == 'ubuntu-24'
if: matrix.os == 'ubuntu-24.04'
uses: docker/setup-qemu-action@v3
- name: Install cibuildwheel
run: python -m pip install cibuildwheel
Expand Down
10 changes: 5 additions & 5 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ permissions:
jobs:
versioning:
name: Update Version
runs-on: ubuntu-24
runs-on: ubuntu-24.04
steps:
- name: Checkout
uses: actions/checkout@v4
Expand Down Expand Up @@ -49,7 +49,7 @@ jobs:

rebase:
name: Rebase Dev. Branch
runs-on: ubuntu-24
runs-on: ubuntu-24.04
if: github.ref == 'refs/heads/main'
needs: versioning
steps:
Expand Down Expand Up @@ -78,7 +78,7 @@ jobs:
needs: versioning
strategy:
matrix:
os: [ubuntu-24, macos-13, windows-2022]
os: [ubuntu-24.04, macos-13, windows-2022]
python-version: ["36", "37", "38", "39", "310", "311", "312"]
steps:
- uses: actions/checkout@v4
Expand All @@ -90,7 +90,7 @@ jobs:
with:
python-version: 3.x
- name: Setup QEMU
if: matrix.os == 'ubuntu-24' # We only need QEMU for Linux builds
if: matrix.os == 'ubuntu-24.04' # We only need QEMU for Linux builds
uses: docker/setup-qemu-action@v3
- name: Install cibuildwheel
run: python -m pip install cibuildwheel
Expand Down Expand Up @@ -153,7 +153,7 @@ jobs:
# publish_javascript:
# name: Publish JavaScript
# needs: versioning
# runs-on: ubuntu-24
# runs-on: ubuntu-24.04
# steps:
# - uses: actions/checkout@v4
# with:
Expand Down
2 changes: 2 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@
"Hirschberg's",
"Horspool",
"Hyyro",
"illformed",
"initproc",
"inplace",
"intp",
"isprintable",
"itemsize",
Expand Down
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,28 @@ __Who is this for?__
<span style="color:#ABABAB;">arm:</span> <b>25.8</b> MB/s
</td>
</tr>
<!-- Mapping Characters with Look-Up Table Transforms -->
<tr>
<td colspan="4" align="center">Mapping Characters with Look-Up Table Transforms</td>
</tr>
<tr>
<td align="center">⚪</td>
<td align="center">
<code>transform</code><br/>
<span style="color:#ABABAB;">x86:</span> <b>3.81</b> &centerdot;
<span style="color:#ABABAB;">arm:</span> <b>2.65</b> GB/s
</td>
<td align="center">
<code>str.translate</code><br/>
<span style="color:#ABABAB;">x86:</span> <b>260.0</b> &centerdot;
<span style="color:#ABABAB;">arm:</span> <b>140.0</b> MB/s
</td>
<td align="center">
<code>sz_look_up_transform</code><br/>
<span style="color:#ABABAB;">x86:</span> <b>21.2</b> &centerdot;
<span style="color:#ABABAB;">arm:</span> <b>8.5</b> GB/s
</td>
</tr>
<!-- Sorting -->
<tr>
<td colspan="4" align="center">Get sorted order, ≅ 8 million English words <sup>6</sup></td>
Expand Down Expand Up @@ -373,6 +395,25 @@ x: Strs = text.split_charset(separator='chars', maxsplit=sys.maxsize, keepsepara
x: Strs = text.rsplit_charset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
```

You can also transform the string using Look-Up Tables (LUTs), mapping it to a different character set.
This would result in a copy - `str` for `str` inputs and `bytes` for other types.

```py
x: str = text.translate('chars', {}, start=0, end=sys.maxsize, inplace=False)
x: bytes = text.translate(b'chars', {}, start=0, end=sys.maxsize, inplace=False)
```

For efficiency reasons, pass the LUT as a string or bytes object, not as a dictionary.
This can be useful in high-throughput applications dealing with binary data, including bioinformatics and image processing.
Here is an example:

```py
import stringzilla as sz
look_up_table = bytes(range(256)) # Identity LUT
image = open("/image/path.jpeg", "rb").read()
sz.translate(image, look_up_table, inplace=True)
```

### Collection-Level Operations

Once split into a `Strs` object, you can sort, shuffle, and reorganize the slices, with minimum memory footprint.
Expand Down Expand Up @@ -1024,6 +1065,18 @@ char uuid[36];
sz::randomize(sz::string_span(uuid, 36), "0123456789abcdef-"); // Overwrite any buffer
```

### Bulk Replacements

In text processing, it's often necessary to replace all occurrences of a specific substring or set of characters within a string.
Standard library functions may not offer the most efficient or convenient methods for performing bulk replacements, especially when dealing with large strings or performance-critical applications.

- `haystack.replace_all(needle_string, replacement_string)`
- `haystack.replace_all(sz::char_set(""), replacement_string)`
- `haystack.try_replace_all(needle_string, replacement_string)`
- `haystack.try_replace_all(sz::char_set(""), replacement_string)`
- `haystack.transform(sz::look_up_table::identity())`
- `haystack.transform(sz::look_up_table::identity(), haystack.data())`

### Levenshtein Edit Distance and Alignment Scores

Levenshtein and Hamming edit distance are provided for both byte-strings and UTF-8 strings.
Expand Down
14 changes: 14 additions & 0 deletions include/stringzilla/stringzilla.h
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,17 @@
#endif // SZ_DYNAMIC_DISPATCH
#endif // SZ_DYNAMIC

/**
* @brief Alignment macro for 64-byte alignment.
*/
#if defined(_MSC_VER)
#define SZ_ALIGN64 __declspec(align(64))
#elif defined(__GNUC__) || defined(__clang__)
#define SZ_ALIGN64 __attribute__((aligned(64)))
#else
#define SZ_ALIGN64
#endif

#ifdef __cplusplus
extern "C" {
#endif
Expand All @@ -172,6 +183,9 @@ typedef ptrdiff_t sz_ssize_t; // Signed version of `sz_size_t`, 32 or 64 bits

#else // if SZ_AVOID_LIBC:

// ! The C standard doesn't specify the signedness of char.
// ! On x86 char is signed by default while on Arm it is unsigned by default.
// ! That's why we don't define `sz_char_t` and generally use explicit `sz_i8_t` and `sz_u8_t`.
typedef signed char sz_i8_t; // Always 8 bits
typedef unsigned char sz_u8_t; // Always 8 bits
typedef unsigned short sz_u16_t; // Always 16 bits
Expand Down
6 changes: 4 additions & 2 deletions include/stringzilla/stringzilla.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1962,6 +1962,7 @@ class basic_string_slice {
* * `try_` exception-free "try" operations that returning non-zero values on success,
* * `replace_all` and `erase_all` similar to Boost,
* * `edit_distance` - Levenshtein distance computation reusing the allocator,
* * `translate` - character mapping,
* * `randomize`, `random` - for fast random string generation.
*
* Functions defined for `basic_string_slice`, but not present in `basic_string`:
Expand Down Expand Up @@ -3413,7 +3414,8 @@ class basic_string {
}

/**
* @brief Maps all chatacters in the current string into another buffer using the provided lookup table.
* @brief Maps all characters in the current string into another buffer using the provided lookup table.
* @param output The buffer to write the transformed string into.
*/
void transform(look_up_table const &table, pointer output) const noexcept {
sz_ptr_t start;
Expand Down Expand Up @@ -3875,7 +3877,7 @@ void transform(basic_string_slice<char_type_> string, basic_look_up_table<char_t
}

/**
* @brief Maps all chatacters in the current string into another buffer using the provided lookup table.
* @brief Maps all characters in the current string into another buffer using the provided lookup table.
*/
template <typename char_type_>
void transform(basic_string_slice<char_type_ const> source, basic_look_up_table<char_type_> const &table,
Expand Down
Loading

0 comments on commit 817de67

Please sign in to comment.