-
Notifications
You must be signed in to change notification settings - Fork 3
GPU implementation (almost complete) #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kartikmandar
wants to merge
24
commits into
tyler-a-cox:main
Choose a base branch
from
kartikmandar:gpu-implementation
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Closed
- Implement direct GPU evaluation for AiryBeam and UVBeam interpolation - Achieve 40-2000x speedup while maintaining numerical accuracy
- Test fftvis (CPU/GPU) and matvis (CPU/GPU) produce identical results - Fix GPU implementation slice handling and loop structure - All implementations match within numerical precision
TODO: Fix GPU-CPU accuracy differences (max diff ~56.3) TODO: Fix tutorial typos and remove debugging cells
- Update test_use_gpu_function to handle cached GPU availability correctly - Fix CPU simulation tests to match current fftvis-matvis API compatibility - Update GPU tests to use current CuPy API (OutOfMemoryError, random.random) - Increase Ray object store memory to meet minimum requirements
Compare fftvis (CPU/GPU) vs matvis performance across varying sources, times, frequencies, and baselines
- Renamed modules: cpu_beams→beams, cpu_nufft→nufft, etc (keeping cpu_simulate/gpu_simulate) - Fixed all imports to use new module names - Resolved conflicts in 4 test files
- Fixed CoordinateRotation.select_chunk() call to use correct number of arguments (2 instead of 3) - Fixed backend parameter compatibility by separating CPU-specific parameters in wrapper - Fixed test to use correct beam function names for CPU vs GPU comparison
63bcecc to
17aeb38
Compare
This comment was marked as resolved.
This comment was marked as resolved.
- Fix type checking to handle BeamInterface-wrapped beams - Add polarized flag to cache key to prevent data collision - Fix CuPy compatibility with wrap mode for azimuth coordinates - GPU and CPU now match within numerical precision (3e-9)
… saving - Add 4-5 more tiers for each benchmark parameter (sources, times, frequencies, antennas) - Add GPU vs CPU speedup to all speedup comparison plots - Add system information collection (CPU, RAM, GPU specs) - Add comprehensive results saving with timestamped directory - Include visibility consistency checks between all backends - Fix baseline calculation to match matvis (all N×N baselines) - Generate summary reports with key findings and speedup statistics
Simplified GPU NUFFT fallback hierarchy by removing the incorrect Type 1+2 decomposition. Now throws informative error with installation instructions for cufinufft==2.4.0b1 and finufft==2.4.0rc1 when Plan-based Type 3 fails.
- Add conftest.py with centralized GPU availability detection - Skip GPU tests when cupy is not available - Add fallbacks for optional dependencies (tabulate, matplotlib) - Fix conditional imports in GPU test modules
This comment was marked as resolved.
This comment was marked as resolved.
Collaborator
Author
|
I am focusing on increasing test coverage and writing better tests now. |
GPU was 2.7x slower than CPU for high frequency counts due to sequential processing. Now batch multiple frequencies per NUFFT call. - Add batch NUFFT functions for 2D and 3D transforms - Process frequencies in batches (up to 128) based on GPU memory - Reduce kernel launches from O(n_freq) to O(n_freq/batch_size) Results: 240 frequencies now 5.78x faster (79.58s → 13.77s), GPU now 2.5x faster than CPU.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It has some CUDA conflicts right now that I have to sort out. While testing cupy and cupy-cuda11x and using CUDA directly, I have messed up my env that I have to sort it out before testing. I will probably try this on collab/runpod if this problem persists.
I will add request for review when I feel its error free and tutorial is working fine.