Skip to content

Big update #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

Big update #23

wants to merge 32 commits into from

Conversation

emilmelnikov
Copy link
Member

  • Almost all loads and stores are now aligned.
  • The inner loop of conv functions now works in 2 steps: unroll up to
    the specified factor first, and finish with the unroll factor 1.
    This is similar to how LLVM unrolls loops.
  • Now buffers need to be padded only up to the multiple of the vector
    size. This simplifies buffer management.
  • fastfilters2::simd now communicates via separately defined structs.
  • Split ufuncs into separate functions.
  • Add some comments that try to explain what is going on here.
  • During module import, select the fastest implementation (autotune).
  • Remove kernel coefficients tests: it is enough to check that
    gaussian_smoothing is identical; can probably add C++ tests later.
  • Revert back to testing on randomly sampled data.
  • Add some tests for valid and invalid filter inputs.
  • Improve CMakeLists.
  • Add .clangd for IDEs/tools that support it (without this config
    clangd couldn't properly index Highway code).
  • Move to FetchContent instead of git submodules for managing deps.

- Almost all loads and stores are now aligned.
- The inner loop of `conv` functions now works in 2 steps: unroll up to
  the specified factor first, and finish with the unroll factor 1.
  This is similar to how LLVM unrolls loops.
- Now buffers need to be padded only up to the multiple of the vector
  size. This simplifies buffer management.
- `fastfilters2::simd` now communicates via separately defined structs.
- Split ufuncs into separate functions.
- Add some comments that try to explain what is going on here.
- During module import, select the fastest implementation (autotune).
- Remove kernel coefficients tests: it is enough to check that
  `gaussian_smoothing` is identical; can probably add C++ tests later.
- Revert back to testing on randomly sampled data.
- Add some tests for valid and invalid filter inputs.
- Improve CMakeLists.
- Add .clangd for IDEs/tools that support it (without this config
  clangd couldn't properly index Highway code).
When computing certain filters, results of intermediate separable
convolutions stored in temporary buffers could be reused.
This test is intended to catch padding issues in internal buffers.
Set the target directory of FetchContent to a special directory in the
project. This preserves downloaded dependencies if the build directory
is gone.
Previously, we *excluded* PyPy builds, now we *include* CPython builds.
On some platforms, these headers might have been transitively included
by other headers.
This is necessary because `pytest.importorskip` skips the entire test
file, and it's more readable.

Also move parametrized test name generation into a hook in
`conftest.py`.
The current workaround for the LLVM inliner breaks MSVC.

See google/highway@c02c0e3
Position-independent code (PIC) means that all code addresses are
relative, which is required for a linkable static or shared library.
Converting double to float may lose precision. In our case this is not
a problem, but makes MSVC with /W4 happier.

Also fix a critical bug where `convolve_axis` did not return anything.
Why this even worked in the first place?..
Stubgen errors out on windows-2022.
Changes:
- If there are no files to upload, is it most certainly an error
- Disable compression because sdists and wheels are archives themselves
- Allow overwriting artifacts with the same name; this probably should
  be changed later if `setuptools-scm` or a similar tool will update the
  artifact version
`hwy::CopyBytes` and `hwy::ZeroBytes` use intrinsics on GCC/Clang.
On some non-x86 platforms the minimum alignment can be smaller than
the vector size.
`hwy_contrib` contains VQSort which is used internally by auto-tuner.
Previously, it was erroneously linked with the nanobind module.
This is important because the address of `row_buf + radius` is not
generally aligned to the vector size. Note that the store can be aligned
in both contiguous and strided cases.
This causes slowdowns on x86, and hardware exceptions on ARM. It is
possible to copy unaligned data to a temporary aligned buffer, but
unaligned arrays are quite rare, and it is a good practice to keep
arrays aligned to their dtypes anyway.
@k-dominik
Copy link

AMAZING! can't wait to check it out!!

In order to work well, ASan and UBSan need both compile-time and runtime
adjustments.
Also, use `hwy:VectorBytes()` to obtain the number of lanes instead of a
custom function.
It is practically impossible to obtain results that are precisely equal
to the old implementation without a loss in performance and loss in the
implementation flexibility. Therefore some amount of numerical
differences between implementations is allowed.

Previously, we tried to compare maximum absolute differences between
filter outputs, using the full uint8 dynamic range (mapped to float32).
The reasoning was to be as close to the real-world usage as possible.
This resulted in significant differences between implementations.

However, the old implementation used much laxer tests in order to verify
its outputs w.r.t. to the Vigra library. First, it used values in the
[0; 1) range, where precision is much higher. Second, it used mean
absolute difference (*not* max) for 3D multi-channel filters, where
numerical differences are especially pronounced.

Now, we are using the same strategy for the compatibility testing. The
tests are a bit different: we test for more values of sigma (scale), but
skip testing the structure tensor with the different derivative scale
(because it is almost always derived from the main outer scale).

Finally, the script `resultcompare.py` has been updated to match the
test code.
Now name checks that trigger custom formatting are more relaxed. This
should cover cases when the name should be more precise (e.g.
`src_shape`).
cibuildwheel changes:
- Don't build wheels for the musl libc on Linux: they double the build
  time, and probably won't be used (we can always enable them again
  later)
- Try `delvewheel` on Windows

GitHub Actions changes:
- Split a single workflow into `build` and `release`: this is cleaner
  than a single workflow that usually skips the upload step
- Add Linux ARM and macOS Intel runners
- Use trusted publishing instead of username/password or token on PyPI:
  https://docs.pypi.org/trusted-publishers/
- Fix artifact uploading by giving them unique names
@emilmelnikov
Copy link
Member Author

AMAZING! can't wait to check it out!!

Thanks! :)

With the latest batch of changes, and after fixing possible CI issues, this PR should be ready.

I've tried to plug this into ilastik as import fastfilters2.compat as fastfilters in lazyflow/operators/opPixelFeaturesPresmoothed.py and lazyflow/operators/filterOperators.py, and it seems to work fine.

I tested it on a random 1024^3 uint8 dataset, and the speedup is not really that great, only 1.19x. This was measured in a somewhat hacky way by doing this (only the latest cumulative_time values are compared):

class OpPixelFeaturesPresmoothed(Operator):
    def __init__(self, *args, **kwargs):
        ...
        self._cumulative_time = 0

    def execute(self, slot, subindex, slot_roi, target):
        if slot == self.Features:
            ...
        elif slot == self.Output:
            t0 = time.perf_counter()
            ...
            t1 = time.perf_counter()
            self._cumulative_time += t1 - t0
            print(f"OpPixelFeaturesPresmoothed: cumulative_time = {self._cumulative_time}")

This is because main speedups came by vectorizing the border handling code in v1, but with presmoothing most borders are only 1-2 pixels long. I think it's still possible to do it better by fusing some passes together (it's easier to do if the border size is statically known), but maybe later :)

However, I suspect that we can score some easy wins by modifying block shapes. I've monitored the exact block shapes that are passed into the library, and they usually have ~32 pixels in one dim, and ~200 pixels in other dims, which is not really optimal. I'm not sure where exactly these block shapes are defined, tried to update OpFeatureSelection.setupOutputs, but the performance tanked. This is probably due to a shape mismatch in the block cache, or something like that?..

- Fetch the entire history during checkout. This is necessary for
  `setuptools-scm`. It may be possible to fetch only the most recent
  tag, and all commits from this tag up to the current commit, but this
  requires more testing.
- Build sdist in the build workflow, instead of the release workflow: it
  is good to test sdist packaging too, and it is cheap compared to
  building multiple wheels.
- Add a minimum deployment target override for macOS on ARM in order to
  suppress CI warnings
- setuptools-scm is convenient, but behaves in non-intuitive ways.
  Moreover, publishing to conda-forge requires specifying a static
  version in the recipe anyway. In the end, it is probably easier to
  write a dev script that patches version strings in a couple of places
  and triggers a new build.
- Need to "append" to a TOML table in cibuildwheel override section
@emilmelnikov
Copy link
Member Author

emilmelnikov commented Apr 30, 2025

@k-dominik PyPI introduced a new way to publish packages: trusted publishing. It is better because you don't have to copy-paste tokens or credentials, but it needs some configuration on the part of the project owner org on PyPI.

You need to add a trusted publisher in the publishing tab.

  • Owner: ilastik
  • Repository name: fastfilters2
  • Workflow name: release.yaml
  • Environment name: pypi

Not 100% sure, but I think this is all you have to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants