Big update #23

emilmelnikov · 2025-04-28T08:49:11Z

Almost all loads and stores are now aligned.
The inner loop of conv functions now works in 2 steps: unroll up to
the specified factor first, and finish with the unroll factor 1.
This is similar to how LLVM unrolls loops.
Now buffers need to be padded only up to the multiple of the vector
size. This simplifies buffer management.
fastfilters2::simd now communicates via separately defined structs.
Split ufuncs into separate functions.
Add some comments that try to explain what is going on here.
During module import, select the fastest implementation (autotune).
Remove kernel coefficients tests: it is enough to check that
gaussian_smoothing is identical; can probably add C++ tests later.
Revert back to testing on randomly sampled data.
Add some tests for valid and invalid filter inputs.
Improve CMakeLists.
Add .clangd for IDEs/tools that support it (without this config
clangd couldn't properly index Highway code).
Move to FetchContent instead of git submodules for managing deps.

- Almost all loads and stores are now aligned. - The inner loop of `conv` functions now works in 2 steps: unroll up to the specified factor first, and finish with the unroll factor 1. This is similar to how LLVM unrolls loops. - Now buffers need to be padded only up to the multiple of the vector size. This simplifies buffer management. - `fastfilters2::simd` now communicates via separately defined structs. - Split ufuncs into separate functions. - Add some comments that try to explain what is going on here. - During module import, select the fastest implementation (autotune). - Remove kernel coefficients tests: it is enough to check that `gaussian_smoothing` is identical; can probably add C++ tests later. - Revert back to testing on randomly sampled data. - Add some tests for valid and invalid filter inputs. - Improve CMakeLists. - Add .clangd for IDEs/tools that support it (without this config clangd couldn't properly index Highway code).

When computing certain filters, results of intermediate separable convolutions stored in temporary buffers could be reused.

This test is intended to catch padding issues in internal buffers.

Set the target directory of FetchContent to a special directory in the project. This preserves downloaded dependencies if the build directory is gone.

Previously, we *excluded* PyPy builds, now we *include* CPython builds.

On some platforms, these headers might have been transitively included by other headers.

This is necessary because `pytest.importorskip` skips the entire test file, and it's more readable. Also move parametrized test name generation into a hook in `conftest.py`.

The current workaround for the LLVM inliner breaks MSVC. See google/highway@c02c0e3

Position-independent code (PIC) means that all code addresses are relative, which is required for a linkable static or shared library.

Converting double to float may lose precision. In our case this is not a problem, but makes MSVC with /W4 happier. Also fix a critical bug where `convolve_axis` did not return anything. Why this even worked in the first place?..

Stubgen errors out on windows-2022.

Changes: - If there are no files to upload, is it most certainly an error - Disable compression because sdists and wheels are archives themselves - Allow overwriting artifacts with the same name; this probably should be changed later if `setuptools-scm` or a similar tool will update the artifact version

`hwy::CopyBytes` and `hwy::ZeroBytes` use intrinsics on GCC/Clang.

On some non-x86 platforms the minimum alignment can be smaller than the vector size.

`hwy_contrib` contains VQSort which is used internally by auto-tuner. Previously, it was erroneously linked with the nanobind module.

This is important because the address of `row_buf + radius` is not generally aligned to the vector size. Note that the store can be aligned in both contiguous and strided cases.

This causes slowdowns on x86, and hardware exceptions on ARM. It is possible to copy unaligned data to a temporary aligned buffer, but unaligned arrays are quite rare, and it is a good practice to keep arrays aligned to their dtypes anyway.

k-dominik · 2025-04-29T07:58:22Z

AMAZING! can't wait to check it out!!

In order to work well, ASan and UBSan need both compile-time and runtime adjustments.

Also, use `hwy:VectorBytes()` to obtain the number of lanes instead of a custom function.

It is practically impossible to obtain results that are precisely equal to the old implementation without a loss in performance and loss in the implementation flexibility. Therefore some amount of numerical differences between implementations is allowed. Previously, we tried to compare maximum absolute differences between filter outputs, using the full uint8 dynamic range (mapped to float32). The reasoning was to be as close to the real-world usage as possible. This resulted in significant differences between implementations. However, the old implementation used much laxer tests in order to verify its outputs w.r.t. to the Vigra library. First, it used values in the [0; 1) range, where precision is much higher. Second, it used mean absolute difference (*not* max) for 3D multi-channel filters, where numerical differences are especially pronounced. Now, we are using the same strategy for the compatibility testing. The tests are a bit different: we test for more values of sigma (scale), but skip testing the structure tensor with the different derivative scale (because it is almost always derived from the main outer scale). Finally, the script `resultcompare.py` has been updated to match the test code.

Now name checks that trigger custom formatting are more relaxed. This should cover cases when the name should be more precise (e.g. `src_shape`).

cibuildwheel changes: - Don't build wheels for the musl libc on Linux: they double the build time, and probably won't be used (we can always enable them again later) - Try `delvewheel` on Windows GitHub Actions changes: - Split a single workflow into `build` and `release`: this is cleaner than a single workflow that usually skips the upload step - Add Linux ARM and macOS Intel runners - Use trusted publishing instead of username/password or token on PyPI: https://docs.pypi.org/trusted-publishers/ - Fix artifact uploading by giving them unique names

emilmelnikov · 2025-04-29T18:22:05Z

AMAZING! can't wait to check it out!!

Thanks! :)

With the latest batch of changes, and after fixing possible CI issues, this PR should be ready.

I've tried to plug this into ilastik as import fastfilters2.compat as fastfilters in lazyflow/operators/opPixelFeaturesPresmoothed.py and lazyflow/operators/filterOperators.py, and it seems to work fine.

I tested it on a random 1024^3 uint8 dataset, and the speedup is not really that great, only 1.19x. This was measured in a somewhat hacky way by doing this (only the latest cumulative_time values are compared):

class OpPixelFeaturesPresmoothed(Operator):
    def __init__(self, *args, **kwargs):
        ...
        self._cumulative_time = 0

    def execute(self, slot, subindex, slot_roi, target):
        if slot == self.Features:
            ...
        elif slot == self.Output:
            t0 = time.perf_counter()
            ...
            t1 = time.perf_counter()
            self._cumulative_time += t1 - t0
            print(f"OpPixelFeaturesPresmoothed: cumulative_time = {self._cumulative_time}")

This is because main speedups came by vectorizing the border handling code in v1, but with presmoothing most borders are only 1-2 pixels long. I think it's still possible to do it better by fusing some passes together (it's easier to do if the border size is statically known), but maybe later :)

However, I suspect that we can score some easy wins by modifying block shapes. I've monitored the exact block shapes that are passed into the library, and they usually have ~32 pixels in one dim, and ~200 pixels in other dims, which is not really optimal. I'm not sure where exactly these block shapes are defined, tried to update OpFeatureSelection.setupOutputs, but the performance tanked. This is probably due to a shape mismatch in the block cache, or something like that?..

- Fetch the entire history during checkout. This is necessary for `setuptools-scm`. It may be possible to fetch only the most recent tag, and all commits from this tag up to the current commit, but this requires more testing. - Build sdist in the build workflow, instead of the release workflow: it is good to test sdist packaging too, and it is cheap compared to building multiple wheels. - Add a minimum deployment target override for macOS on ARM in order to suppress CI warnings

- setuptools-scm is convenient, but behaves in non-intuitive ways. Moreover, publishing to conda-forge requires specifying a static version in the recipe anyway. In the end, it is probably easier to write a dev script that patches version strings in a couple of places and triggers a new build. - Need to "append" to a TOML table in cibuildwheel override section

emilmelnikov · 2025-04-30T14:16:53Z

@k-dominik PyPI introduced a new way to publish packages: trusted publishing. It is better because you don't have to copy-paste tokens or credentials, but it needs some configuration on the part of the project owner org on PyPI.

You need to add a trusted publisher in the publishing tab.

Owner: ilastik
Repository name: fastfilters2
Workflow name: release.yaml
Environment name: pypi

Not 100% sure, but I think this is all you have to do.

emilmelnikov added 9 commits March 21, 2023 11:23

Update Highway to 1.0.3

a7975d4

Add Google Test dependency

9ac62fe

Large refactoring, add support for 3D filters

45e384a

Skip redundant computations

c373002

When computing certain filters, results of intermediate separable convolutions stored in temporary buffers could be reused.

Add a compatibility test for small inputs

402f390

This test is intended to catch padding issues in internal buffers.

Use CMake FetchContent to pull C++ dependencies

7b2d28a

Set the target directory of FetchContent to a special directory in the project. This preserves downloaded dependencies if the build directory is gone.

Build wheels only for CPython

73b2e54

Previously, we *excluded* PyPy builds, now we *include* CPython builds.

Update actions and runners versions in CI

cdcef35

emilmelnikov force-pushed the refactoring-2 branch from c22e2ce to cdcef35 Compare April 28, 2025 09:14

emilmelnikov added 2 commits April 28, 2025 11:23

Add missing includes

dbba1eb

On some platforms, these headers might have been transitively included by other headers.

Specify test project dir in cibuildwheel config

028e85d

emilmelnikov force-pushed the refactoring-2 branch from 970ff39 to 028e85d Compare April 28, 2025 10:04

emilmelnikov added 5 commits April 28, 2025 12:36

Split tests into the main part and compat part

b5c9b35

This is necessary because `pytest.importorskip` skips the entire test file, and it's more readable. Also move parametrized test name generation into a hook in `conftest.py`.

Revert to a previous Highway commit

a2d4995

The current workaround for the LLVM inliner breaks MSVC. See google/highway@c02c0e3

Always compile the _internal module with PIC

99fcf12

Position-independent code (PIC) means that all code addresses are relative, which is required for a linkable static or shared library.

Make some narrowing conversions explicit

4641bc5

Converting double to float may lose precision. In our case this is not a problem, but makes MSVC with /W4 happier. Also fix a critical bug where `convolve_axis` did not return anything. Why this even worked in the first place?..

Add dependency on ext module for stubgen target

14bb296

emilmelnikov force-pushed the refactoring-2 branch from 96a91f3 to 14bb296 Compare April 28, 2025 12:37

emilmelnikov added 4 commits April 28, 2025 15:42

Temporarily disable stubgen

de7b3ed

Stubgen errors out on windows-2022.

Replace memcpy/memset with Highway alternatives

cc18d0b

`hwy::CopyBytes` and `hwy::ZeroBytes` use intrinsics on GCC/Clang.

Align stack-allocated arrays with HWY_ALIGN_MAX

bce2139

On some non-x86 platforms the minimum alignment can be smaller than the vector size.

This was referenced Apr 28, 2025

Add 3D filters #22

Closed

Bump pypa/cibuildwheel from 2.12.1 to 2.13.1 #21

Closed

Bump pypa/gh-action-pypi-publish from 1.8.1 to 1.8.6 #19

Closed

emilmelnikov added 3 commits April 28, 2025 21:03

Link SIMD static lib with hwy_contrib

b310910

`hwy_contrib` contains VQSort which is used internally by auto-tuner. Previously, it was erroneously linked with the nanobind module.

Make all loads in contiguous conv pass unaligned

881ad41

This is important because the address of `row_buf + radius` is not generally aligned to the vector size. Note that the store can be aligned in both contiguous and strided cases.

emilmelnikov added 5 commits April 29, 2025 11:17

Add support for ASan and UBSan

27588b7

In order to work well, ASan and UBSan need both compile-time and runtime adjustments.

Improve and cleanup some comments

d4024f1

Also, use `hwy:VectorBytes()` to obtain the number of lanes instead of a custom function.

Improve custom parametrized test name formatting

2f28598

Now name checks that trigger custom formatting are more relaxed. This should cover cases when the name should be more precise (e.g. `src_shape`).

emilmelnikov added 3 commits April 30, 2025 12:58

Add dynamic versioning via setuptools-scm

b18f40f

Fix called workflow path in release.yaml

2cd9be1

emilmelnikov had a problem deploying to pypi April 30, 2025 13:07 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Big update #23

Big update #23

Uh oh!

emilmelnikov commented Apr 28, 2025

Uh oh!

k-dominik commented Apr 29, 2025

Uh oh!

emilmelnikov commented Apr 29, 2025

Uh oh!

emilmelnikov commented Apr 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Big update #23

Are you sure you want to change the base?

Big update #23

Uh oh!

Conversation

emilmelnikov commented Apr 28, 2025

Uh oh!

k-dominik commented Apr 29, 2025

Uh oh!

emilmelnikov commented Apr 29, 2025

Uh oh!

emilmelnikov commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

emilmelnikov commented Apr 30, 2025 •

edited

Loading