Skip to content

Fast Function Approximations lowering. #8566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 84 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
60c378a
Fast vectorizable atan and atan2 functions.
mcourteaux Aug 10, 2024
aceab1d
Default to not using fast atan versions if on CUDA.
mcourteaux Aug 10, 2024
7b71f17
Finished fast atan/atan2 functions and tests.
mcourteaux Aug 10, 2024
e611a56
Correct attribution.
mcourteaux Aug 10, 2024
5c221d8
Clang-format
mcourteaux Aug 10, 2024
2c1c4b6
Weird WebAssembly limits...
mcourteaux Aug 11, 2024
bef3ee5
Small improvements to the optimization script.
mcourteaux Aug 11, 2024
b6814e6
Polynomial optimization for log, exp, sin, cos with correct ranges.
mcourteaux Aug 11, 2024
69f31f6
Improve fast atan performance tests for GPU.
mcourteaux Aug 12, 2024
cb74486
Bugfix fast_atan approximation. Fix correctness test to exceed the ra…
mcourteaux Aug 12, 2024
3cc41d8
Cleanup
mcourteaux Aug 12, 2024
4e3e589
Enum class instead of enum for ApproximationPrecision.
mcourteaux Aug 12, 2024
ac26269
Weird Metal limits. There should be a better way...
mcourteaux Aug 12, 2024
d519692
Skip test for WebGPU.
mcourteaux Aug 12, 2024
33f8fe4
Fast atan/atan2 polynomials reoptimized. New optimization strategy: ULP.
mcourteaux Aug 13, 2024
d6d2563
Feedback Steven.
mcourteaux Aug 13, 2024
4b6b61c
More comments and test mantissa error.
mcourteaux Aug 14, 2024
44e2b42
Do not error when testing arctan performance on Metal / WebGPU.
mcourteaux Aug 14, 2024
9f94e4b
Rework precision specification. Generalize towards using this for oth…
mcourteaux Nov 11, 2024
9d65630
Clang-format.
mcourteaux Nov 11, 2024
acc1b92
Fix makefile and clang-tidy.
mcourteaux Nov 11, 2024
f0c1e0b
Fix incorrect approximation selection when required precision is not …
mcourteaux Nov 12, 2024
707e0af
Feedback from Steven.
mcourteaux Dec 3, 2024
f2d9bff
Implemented approximation tables for sin, cos, exp, log fast variants…
mcourteaux Feb 4, 2025
c036d72
Clang-format.
mcourteaux Feb 4, 2025
d39bfe7
Move Polynomial Optimizer Python script to tools/ directory.
mcourteaux Feb 4, 2025
98bbfdd
Enable performance test for fast_atan and fast_atan2.
mcourteaux Feb 4, 2025
da504ad
LLVM upper-limit 99 (CMake needs an upper limit).
mcourteaux Feb 4, 2025
cfce723
Add LLVM IR for PTX sin.approx, cos.approx, tanh.approx
mcourteaux Feb 4, 2025
39176d9
Implemented tan. Improved polynomial optimizer performance for MULPE …
mcourteaux Feb 5, 2025
5107cae
Implemented tanh, tan. Many improvements to accuracy test and perform…
mcourteaux Feb 5, 2025
85d000a
Clang-format.
mcourteaux Feb 5, 2025
ed2527f
WIP: Fiddle with strict_float behavior in CSE. Fix fast math precisio…
mcourteaux Feb 7, 2025
0bcce87
Nuke MAE_MULPE. Separate optimized MULPE-corrected sin and cos.
mcourteaux Feb 8, 2025
48db71b
Clang-format
mcourteaux Feb 8, 2025
7a018d0
Some cleanup.
mcourteaux Feb 8, 2025
21e5398
Fix sine.
mcourteaux Feb 8, 2025
5fca1ab
Fix clang-tidy. Mark OpenCL exp() as fast.
mcourteaux Feb 8, 2025
1e6320b
Clang format is annoying me.
mcourteaux Feb 8, 2025
8a18778
Remove my experimental CSE step.
mcourteaux Feb 9, 2025
6ce2ec6
OpenCL performance of fast_exp forced poly is expected to be worse.
mcourteaux Feb 9, 2025
d78fcb2
OpenCL fast functions selected for fast transcendentals.
mcourteaux Feb 9, 2025
b4fbdf4
Lower fast intrinsics on metal to the fast:: namespace versions.
mcourteaux Feb 9, 2025
56e0d12
Split tables for sin and cos, as metal has odd precision for sin. Add…
mcourteaux Feb 9, 2025
5a1f78c
Move range_reduce_log to a header. Drive-by fix listing libOpenCL.so.…
mcourteaux Feb 10, 2025
3aa14b4
Fix API documentation. Improve measuring accuracy. Fix vector_math te…
mcourteaux Feb 10, 2025
a8b4917
Also vectorize on GPU to make sure we test that.
mcourteaux Feb 11, 2025
f997c6a
Add FastMathFunctions.cpp to Makefile
mcourteaux Feb 11, 2025
47915c4
Add support for derivatives for the fast_ intrinsics.
mcourteaux Feb 11, 2025
a814955
Remove unused helper function.
mcourteaux Feb 11, 2025
4e8611d
Add in a gracefactor for precision when the system does not support FMA.
mcourteaux Feb 11, 2025
b1128ed
Clang Format.
mcourteaux Feb 11, 2025
e170c6e
Windows doesn't print thousand separaters with printf. :(
mcourteaux Feb 11, 2025
4130e44
Remove grace factor, and use safety factor of 5% when selecting a pol…
mcourteaux Feb 16, 2025
d2d05c5
Use 50% tighter constraints when no FMA is available to compensate fo…
mcourteaux Feb 17, 2025
36b81e9
Clang-format.
mcourteaux Feb 17, 2025
8b5b9d9
Working on better optimizations. Improving PR and code.
mcourteaux Mar 12, 2025
bbe7600
Implemented fast_asin() fast_acos(). Slowly redoing coefficients.
mcourteaux Mar 12, 2025
8efc18f
WIP: determine precision of the polynomials.
mcourteaux Mar 13, 2025
bbced27
Revived all tests.
mcourteaux Mar 14, 2025
d71f59c
Clang format
mcourteaux Mar 14, 2025
42bc82d
Implement expm1. Fix accuracy of tanh. Fix lowering of tanh on CUDA. …
mcourteaux Mar 15, 2025
9710ae3
Clang-format
mcourteaux Mar 15, 2025
935c651
Feedback, and remove expm1 test.
mcourteaux Mar 15, 2025
9614851
Fix compilation issues.
mcourteaux Mar 15, 2025
1c2ee24
One more compilation issue.
mcourteaux Mar 15, 2025
08e96f3
Fixed a bracket.
mcourteaux Mar 15, 2025
1dea659
Update some precision info on math intrinsics for Vulkan and Metal.
mcourteaux Mar 17, 2025
591f20d
Fix makefile after I accidentally broke it by sorting files alphabeti…
mcourteaux Apr 9, 2025
4971a0e
Add fast math calls to new extern_function_name_map for OpenCL.
mcourteaux Jun 1, 2025
bc63788
Move fast function calls to extern table for Metal.
mcourteaux Jun 1, 2025
2d2ad60
Try to fix compile/test issues.
mcourteaux Jun 1, 2025
9b063fb
Fix Makefile and symbol visibility issue.
mcourteaux Jun 1, 2025
5ee7c6a
Clang-format
mcourteaux Jun 1, 2025
58bf523
Make use of the new strict_float intrinsics for the fast math functions.
mcourteaux Jun 14, 2025
845d83a
Relax performance tests for GPUs.
mcourteaux Jun 14, 2025
48f2096
Clang-format
mcourteaux Jun 14, 2025
fc53345
Fix incorrect forward declaration.
mcourteaux Jun 14, 2025
9b4c5e4
Fix acos on Metal. Relax perf-test for tanh on OpenCL.
mcourteaux Jun 16, 2025
f58f349
Fix strict float behavior for the fast_tan function. Implemented spli…
mcourteaux Jul 3, 2025
d2604a5
Enable fp16 fast_math functions without promises.
mcourteaux Jul 3, 2025
80feb6a
Clear internal assert, as it assumed SSE floating point behavior, whi…
mcourteaux Jul 3, 2025
acdd764
Let CodeGen_C handle all float-literal printing (also for Float(16) i…
mcourteaux Jul 4, 2025
c05f2cc
Fix internal test for CodeGen_C given the scientific way of printing …
mcourteaux Jul 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,9 @@ xcuserdata
# NeoVim + clangd
.cache

# CCLS
.ccls-cache

# Emacs
tags
TAGS
Expand Down
88 changes: 46 additions & 42 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -424,21 +424,24 @@ SOURCE_FILES = \
AlignLoads.cpp \
AllocationBoundsInference.cpp \
ApplySplit.cpp \
ApproximationTables.cpp \
Argument.cpp \
AssociativeOpsTable.cpp \
Associativity.cpp \
AsyncProducers.cpp \
AutoScheduleUtils.cpp \
BoundConstantExtentLoops.cpp \
BoundSmallAllocations.cpp \
BoundaryConditions.cpp \
Bounds.cpp \
BoundsInference.cpp \
BoundConstantExtentLoops.cpp \
BoundSmallAllocations.cpp \
Buffer.cpp \
CPlusPlusMangle.cpp \
CSE.cpp \
Callable.cpp \
CanonicalizeGPUVars.cpp \
Closure.cpp \
ClampUnsafeAccesses.cpp \
Closure.cpp \
CodeGen_ARM.cpp \
CodeGen_C.cpp \
CodeGen_D3D12Compute_Dev.cpp \
Expand All @@ -448,20 +451,18 @@ SOURCE_FILES = \
CodeGen_LLVM.cpp \
CodeGen_Metal_Dev.cpp \
CodeGen_OpenCL_Dev.cpp \
CodeGen_Vulkan_Dev.cpp \
CodeGen_PTX_Dev.cpp \
CodeGen_Posix.cpp \
CodeGen_PowerPC.cpp \
CodeGen_PTX_Dev.cpp \
CodeGen_PyTorch.cpp \
CodeGen_RISCV.cpp \
CodeGen_Vulkan_Dev.cpp \
CodeGen_WebAssembly.cpp \
CodeGen_WebGPU_Dev.cpp \
CodeGen_X86.cpp \
CompilerLogger.cpp \
ConstantBounds.cpp \
ConstantInterval.cpp \
CPlusPlusMangle.cpp \
CSE.cpp \
Debug.cpp \
DebugArguments.cpp \
DebugToFile.cpp \
Expand All @@ -482,6 +483,7 @@ SOURCE_FILES = \
Expr.cpp \
ExtractTileOperations.cpp \
FastIntegerDivide.cpp \
FastMathFunctions.cpp \
FindCalls.cpp \
FindIntrinsics.cpp \
FlattenNestedRamps.cpp \
Expand All @@ -493,26 +495,26 @@ SOURCE_FILES = \
Generator.cpp \
HexagonOffload.cpp \
HexagonOptimize.cpp \
ImageParam.cpp \
InferArguments.cpp \
InjectHostDevBufferCopies.cpp \
Inline.cpp \
InlineReductions.cpp \
IntegerDivisionTable.cpp \
Interval.cpp \
IR.cpp \
IREquality.cpp \
IRMatch.cpp \
IRMutator.cpp \
IROperator.cpp \
IRPrinter.cpp \
IRVisitor.cpp \
ImageParam.cpp \
InferArguments.cpp \
InjectHostDevBufferCopies.cpp \
Inline.cpp \
InlineReductions.cpp \
IntegerDivisionTable.cpp \
Interval.cpp \
JITModule.cpp \
Lambda.cpp \
Lerp.cpp \
LICM.cpp \
LLVM_Output.cpp \
LLVM_Runtime_Linker.cpp \
Lambda.cpp \
Lerp.cpp \
LoopCarry.cpp \
Lower.cpp \
LowerParallelTasks.cpp \
Expand All @@ -535,8 +537,8 @@ SOURCE_FILES = \
PurifyIndexMath.cpp \
PythonExtensionGen.cpp \
Qualify.cpp \
Random.cpp \
RDom.cpp \
Random.cpp \
Realization.cpp \
RealizationOrder.cpp \
RebaseLoopsToZero.cpp \
Expand All @@ -550,28 +552,28 @@ SOURCE_FILES = \
SelectGPUAPI.cpp \
Serialization.cpp \
Simplify.cpp \
SimplifyCorrelatedDifferences.cpp \
SimplifySpecializations.cpp \
Simplify_Add.cpp \
Simplify_And.cpp \
Simplify_Call.cpp \
Simplify_Cast.cpp \
Simplify_Reinterpret.cpp \
Simplify_Div.cpp \
Simplify_EQ.cpp \
Simplify_Exprs.cpp \
Simplify_Let.cpp \
Simplify_LT.cpp \
Simplify_Let.cpp \
Simplify_Max.cpp \
Simplify_Min.cpp \
Simplify_Mod.cpp \
Simplify_Mul.cpp \
Simplify_Not.cpp \
Simplify_Or.cpp \
Simplify_Reinterpret.cpp \
Simplify_Select.cpp \
Simplify_Shuffle.cpp \
Simplify_Stmts.cpp \
Simplify_Sub.cpp \
SimplifyCorrelatedDifferences.cpp \
SimplifySpecializations.cpp \
SkipStages.cpp \
SlidingWindow.cpp \
Solve.cpp \
Expand Down Expand Up @@ -623,17 +625,20 @@ HEADER_FILES = \
AlignLoads.h \
AllocationBoundsInference.h \
ApplySplit.h \
ApproximationTables.h \
Argument.h \
AssociativeOpsTable.h \
Associativity.h \
AsyncProducers.h \
AutoScheduleUtils.h \
BoundConstantExtentLoops.h \
BoundSmallAllocations.h \
BoundaryConditions.h \
Bounds.h \
BoundsInference.h \
BoundConstantExtentLoops.h \
BoundSmallAllocations.h \
Buffer.h \
CPlusPlusMangle.h \
CSE.h \
Callable.h \
CanonicalizeGPUVars.h \
ClampUnsafeAccesses.h \
Expand All @@ -645,18 +650,16 @@ HEADER_FILES = \
CodeGen_LLVM.h \
CodeGen_Metal_Dev.h \
CodeGen_OpenCL_Dev.h \
CodeGen_Vulkan_Dev.h \
CodeGen_Posix.h \
CodeGen_PTX_Dev.h \
CodeGen_Posix.h \
CodeGen_PyTorch.h \
CodeGen_Targets.h \
CodeGen_Vulkan_Dev.h \
CodeGen_WebGPU_Dev.h \
CompilerLogger.h \
ConciseCasts.h \
CPlusPlusMangle.h \
ConstantBounds.h \
ConstantInterval.h \
CSE.h \
Debug.h \
DebugArguments.h \
DebugToFile.h \
Expand All @@ -681,6 +684,7 @@ HEADER_FILES = \
ExternFuncArgument.h \
ExtractTileOperations.h \
FastIntegerDivide.h \
FastMathFunctions.h \
FindCalls.h \
FindIntrinsics.h \
FlattenNestedRamps.h \
Expand All @@ -693,6 +697,13 @@ HEADER_FILES = \
Generator.h \
HexagonOffload.h \
HexagonOptimize.h \
IR.h \
IREquality.h \
IRMatch.h \
IRMutator.h \
IROperator.h \
IRPrinter.h \
IRVisitor.h \
ImageParam.h \
InferArguments.h \
InjectHostDevBufferCopies.h \
Expand All @@ -701,20 +712,12 @@ HEADER_FILES = \
IntegerDivisionTable.h \
Interval.h \
IntrusivePtr.h \
IR.h \
IREquality.h \
IRMatch.h \
IRMutator.h \
IROperator.h \
IRPrinter.h \
IRVisitor.h \
WasmExecutor.h \
JITModule.h \
Lambda.h \
Lerp.h \
LICM.h \
LLVM_Output.h \
LLVM_Runtime_Linker.h \
Lambda.h \
Lerp.h \
LoopCarry.h \
LoopPartitioningDirective.h \
Lower.h \
Expand All @@ -740,18 +743,16 @@ HEADER_FILES = \
PurifyIndexMath.h \
PythonExtensionGen.h \
Qualify.h \
RDom.h \
Random.h \
Realization.h \
RDom.h \
RealizationOrder.h \
RebaseLoopsToZero.h \
Reduction.h \
RegionCosts.h \
RemoveDeadAllocations.h \
RemoveExternLoops.h \
RemoveUndef.h \
runtime/HalideBuffer.h \
runtime/HalideRuntime.h \
Schedule.h \
ScheduleFunctions.h \
Scope.h \
Expand Down Expand Up @@ -785,7 +786,10 @@ HEADER_FILES = \
Util.h \
Var.h \
VectorizeLoops.h \
WrapCalls.h
WasmExecutor.h \
WrapCalls.h \
runtime/HalideBuffer.h \
runtime/HalideRuntime.h

OBJECTS = $(SOURCE_FILES:%.cpp=$(BUILD_DIR)/%.o)
HEADERS = $(HEADER_FILES:%.h=$(SRC_DIR)/%.h)
Expand Down Expand Up @@ -887,7 +891,7 @@ RUNTIME_CPP_COMPONENTS = \
windows_yield \
write_debug_image \
vulkan \
x86_cpu_features \
x86_cpu_features

RUNTIME_LL_COMPONENTS = \
aarch64 \
Expand Down
Loading
Loading