-
Notifications
You must be signed in to change notification settings - Fork 0
Big update #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Big update #23
Conversation
- Almost all loads and stores are now aligned. - The inner loop of `conv` functions now works in 2 steps: unroll up to the specified factor first, and finish with the unroll factor 1. This is similar to how LLVM unrolls loops. - Now buffers need to be padded only up to the multiple of the vector size. This simplifies buffer management. - `fastfilters2::simd` now communicates via separately defined structs. - Split ufuncs into separate functions. - Add some comments that try to explain what is going on here. - During module import, select the fastest implementation (autotune). - Remove kernel coefficients tests: it is enough to check that `gaussian_smoothing` is identical; can probably add C++ tests later. - Revert back to testing on randomly sampled data. - Add some tests for valid and invalid filter inputs. - Improve CMakeLists. - Add .clangd for IDEs/tools that support it (without this config clangd couldn't properly index Highway code).
When computing certain filters, results of intermediate separable convolutions stored in temporary buffers could be reused.
This test is intended to catch padding issues in internal buffers.
Set the target directory of FetchContent to a special directory in the project. This preserves downloaded dependencies if the build directory is gone.
Previously, we *excluded* PyPy builds, now we *include* CPython builds.
c22e2ce
to
cdcef35
Compare
On some platforms, these headers might have been transitively included by other headers.
970ff39
to
028e85d
Compare
This is necessary because `pytest.importorskip` skips the entire test file, and it's more readable. Also move parametrized test name generation into a hook in `conftest.py`.
The current workaround for the LLVM inliner breaks MSVC. See google/highway@c02c0e3
Position-independent code (PIC) means that all code addresses are relative, which is required for a linkable static or shared library.
Converting double to float may lose precision. In our case this is not a problem, but makes MSVC with /W4 happier. Also fix a critical bug where `convolve_axis` did not return anything. Why this even worked in the first place?..
96a91f3
to
14bb296
Compare
Stubgen errors out on windows-2022.
Changes: - If there are no files to upload, is it most certainly an error - Disable compression because sdists and wheels are archives themselves - Allow overwriting artifacts with the same name; this probably should be changed later if `setuptools-scm` or a similar tool will update the artifact version
`hwy::CopyBytes` and `hwy::ZeroBytes` use intrinsics on GCC/Clang.
On some non-x86 platforms the minimum alignment can be smaller than the vector size.
`hwy_contrib` contains VQSort which is used internally by auto-tuner. Previously, it was erroneously linked with the nanobind module.
This is important because the address of `row_buf + radius` is not generally aligned to the vector size. Note that the store can be aligned in both contiguous and strided cases.
This causes slowdowns on x86, and hardware exceptions on ARM. It is possible to copy unaligned data to a temporary aligned buffer, but unaligned arrays are quite rare, and it is a good practice to keep arrays aligned to their dtypes anyway.
AMAZING! can't wait to check it out!! |
In order to work well, ASan and UBSan need both compile-time and runtime adjustments.
Also, use `hwy:VectorBytes()` to obtain the number of lanes instead of a custom function.
It is practically impossible to obtain results that are precisely equal to the old implementation without a loss in performance and loss in the implementation flexibility. Therefore some amount of numerical differences between implementations is allowed. Previously, we tried to compare maximum absolute differences between filter outputs, using the full uint8 dynamic range (mapped to float32). The reasoning was to be as close to the real-world usage as possible. This resulted in significant differences between implementations. However, the old implementation used much laxer tests in order to verify its outputs w.r.t. to the Vigra library. First, it used values in the [0; 1) range, where precision is much higher. Second, it used mean absolute difference (*not* max) for 3D multi-channel filters, where numerical differences are especially pronounced. Now, we are using the same strategy for the compatibility testing. The tests are a bit different: we test for more values of sigma (scale), but skip testing the structure tensor with the different derivative scale (because it is almost always derived from the main outer scale). Finally, the script `resultcompare.py` has been updated to match the test code.
Now name checks that trigger custom formatting are more relaxed. This should cover cases when the name should be more precise (e.g. `src_shape`).
cibuildwheel changes: - Don't build wheels for the musl libc on Linux: they double the build time, and probably won't be used (we can always enable them again later) - Try `delvewheel` on Windows GitHub Actions changes: - Split a single workflow into `build` and `release`: this is cleaner than a single workflow that usually skips the upload step - Add Linux ARM and macOS Intel runners - Use trusted publishing instead of username/password or token on PyPI: https://docs.pypi.org/trusted-publishers/ - Fix artifact uploading by giving them unique names
Thanks! :) With the latest batch of changes, and after fixing possible CI issues, this PR should be ready. I've tried to plug this into ilastik as I tested it on a random 1024^3 uint8 dataset, and the speedup is not really that great, only 1.19x. This was measured in a somewhat hacky way by doing this (only the latest class OpPixelFeaturesPresmoothed(Operator):
def __init__(self, *args, **kwargs):
...
self._cumulative_time = 0
def execute(self, slot, subindex, slot_roi, target):
if slot == self.Features:
...
elif slot == self.Output:
t0 = time.perf_counter()
...
t1 = time.perf_counter()
self._cumulative_time += t1 - t0
print(f"OpPixelFeaturesPresmoothed: cumulative_time = {self._cumulative_time}") This is because main speedups came by vectorizing the border handling code in v1, but with presmoothing most borders are only 1-2 pixels long. I think it's still possible to do it better by fusing some passes together (it's easier to do if the border size is statically known), but maybe later :) However, I suspect that we can score some easy wins by modifying block shapes. I've monitored the exact block shapes that are passed into the library, and they usually have ~32 pixels in one dim, and ~200 pixels in other dims, which is not really optimal. I'm not sure where exactly these block shapes are defined, tried to update |
- Fetch the entire history during checkout. This is necessary for `setuptools-scm`. It may be possible to fetch only the most recent tag, and all commits from this tag up to the current commit, but this requires more testing. - Build sdist in the build workflow, instead of the release workflow: it is good to test sdist packaging too, and it is cheap compared to building multiple wheels. - Add a minimum deployment target override for macOS on ARM in order to suppress CI warnings
- setuptools-scm is convenient, but behaves in non-intuitive ways. Moreover, publishing to conda-forge requires specifying a static version in the recipe anyway. In the end, it is probably easier to write a dev script that patches version strings in a couple of places and triggers a new build. - Need to "append" to a TOML table in cibuildwheel override section
@k-dominik PyPI introduced a new way to publish packages: trusted publishing. It is better because you don't have to copy-paste tokens or credentials, but it needs some configuration on the part of the project owner org on PyPI. You need to add a trusted publisher in the publishing tab.
Not 100% sure, but I think this is all you have to do. |
conv
functions now works in 2 steps: unroll up tothe specified factor first, and finish with the unroll factor 1.
This is similar to how LLVM unrolls loops.
size. This simplifies buffer management.
fastfilters2::simd
now communicates via separately defined structs.gaussian_smoothing
is identical; can probably add C++ tests later.clangd couldn't properly index Highway code).