Skip to content

Fast Function Approximations lowering. #8566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
e6368c7
Fast vectorizable atan and atan2 functions.
mcourteaux Aug 10, 2024
286a3a4
Default to not using fast atan versions if on CUDA.
mcourteaux Aug 10, 2024
acfacfb
Finished fast atan/atan2 functions and tests.
mcourteaux Aug 10, 2024
59af796
Correct attribution.
mcourteaux Aug 10, 2024
c7d8d4d
Clang-format
mcourteaux Aug 10, 2024
fa94b6e
Weird WebAssembly limits...
mcourteaux Aug 11, 2024
64b4c1d
Small improvements to the optimization script.
mcourteaux Aug 11, 2024
328b92d
Polynomial optimization for log, exp, sin, cos with correct ranges.
mcourteaux Aug 11, 2024
a4e95d8
Improve fast atan performance tests for GPU.
mcourteaux Aug 12, 2024
0fc45b5
Bugfix fast_atan approximation. Fix correctness test to exceed the ra…
mcourteaux Aug 12, 2024
8819960
Cleanup
mcourteaux Aug 12, 2024
366d3c4
Enum class instead of enum for ApproximationPrecision.
mcourteaux Aug 12, 2024
1f13dbf
Weird Metal limits. There should be a better way...
mcourteaux Aug 12, 2024
b33ef6f
Skip test for WebGPU.
mcourteaux Aug 12, 2024
90b70ef
Fast atan/atan2 polynomials reoptimized. New optimization strategy: ULP.
mcourteaux Aug 13, 2024
26d4e83
Feedback Steven.
mcourteaux Aug 13, 2024
c6aeecd
More comments and test mantissa error.
mcourteaux Aug 14, 2024
bb7528c
Do not error when testing arctan performance on Metal / WebGPU.
mcourteaux Aug 14, 2024
bfe6c5e
Rework precision specification. Generalize towards using this for oth…
mcourteaux Nov 11, 2024
e060e4a
Clang-format.
mcourteaux Nov 11, 2024
e9dda7c
Fix makefile and clang-tidy.
mcourteaux Nov 11, 2024
4f2173d
Fix incorrect approximation selection when required precision is not …
mcourteaux Nov 12, 2024
a2b02aa
Feedback from Steven.
mcourteaux Dec 3, 2024
fd54514
Implemented approximation tables for sin, cos, exp, log fast variants…
mcourteaux Feb 4, 2025
aa970b0
Clang-format.
mcourteaux Feb 4, 2025
a75322f
Move Polynomial Optimizer Python script to tools/ directory.
mcourteaux Feb 4, 2025
978e037
Enable performance test for fast_atan and fast_atan2.
mcourteaux Feb 4, 2025
0b8e07b
LLVM upper-limit 99 (CMake needs an upper limit).
mcourteaux Feb 4, 2025
6f65de8
Add LLVM IR for PTX sin.approx, cos.approx, tanh.approx
mcourteaux Feb 4, 2025
05437fb
Implemented tan. Improved polynomial optimizer performance for MULPE …
mcourteaux Feb 5, 2025
e8532a6
Implemented tanh, tan. Many improvements to accuracy test and perform…
mcourteaux Feb 5, 2025
604580d
Clang-format.
mcourteaux Feb 5, 2025
e2e83cf
WIP: Fiddle with strict_float behavior in CSE. Fix fast math precisio…
mcourteaux Feb 7, 2025
677aaba
Nuke MAE_MULPE. Separate optimized MULPE-corrected sin and cos.
mcourteaux Feb 8, 2025
8734458
Clang-format
mcourteaux Feb 8, 2025
6f006bd
Some cleanup.
mcourteaux Feb 8, 2025
eb5ed0f
Fix sine.
mcourteaux Feb 8, 2025
ce1147c
Fix clang-tidy. Mark OpenCL exp() as fast.
mcourteaux Feb 8, 2025
762cb53
Clang format is annoying me.
mcourteaux Feb 8, 2025
474595f
Remove my experimental CSE step.
mcourteaux Feb 9, 2025
f19e439
OpenCL performance of fast_exp forced poly is expected to be worse.
mcourteaux Feb 9, 2025
c3e1e8e
OpenCL fast functions selected for fast transcendentals.
mcourteaux Feb 9, 2025
5c0a3a6
Lower fast intrinsics on metal to the fast:: namespace versions.
mcourteaux Feb 9, 2025
92d0fcf
Split tables for sin and cos, as metal has odd precision for sin. Add…
mcourteaux Feb 9, 2025
b0d3a40
Move range_reduce_log to a header. Drive-by fix listing libOpenCL.so.…
mcourteaux Feb 10, 2025
9b33c15
Fix API documentation. Improve measuring accuracy. Fix vector_math te…
mcourteaux Feb 10, 2025
6c7ef2a
Also vectorize on GPU to make sure we test that.
mcourteaux Feb 11, 2025
a8a330e
Remove libOpenCL.so from search list in favor of libOpenCL.so.1
mcourteaux Feb 11, 2025
af06927
Add FastMathFunctions.cpp to Makefile
mcourteaux Feb 11, 2025
4d2ec11
Add support for derivatives for the fast_ intrinsics.
mcourteaux Feb 11, 2025
8b5b841
Remove unused helper function.
mcourteaux Feb 11, 2025
f2135bf
Add in a gracefactor for precision when the system does not support FMA.
mcourteaux Feb 11, 2025
6e7926b
Clang Format.
mcourteaux Feb 11, 2025
cd949ae
Windows doesn't print thousand separaters with printf. :(
mcourteaux Feb 11, 2025
39e3a97
Remove grace factor, and use safety factor of 5% when selecting a pol…
mcourteaux Feb 16, 2025
a8fd03b
Use 50% tighter constraints when no FMA is available to compensate fo…
mcourteaux Feb 17, 2025
b30c24f
Clang-format.
mcourteaux Feb 17, 2025
f6f7fd0
Working on better optimizations. Improving PR and code.
mcourteaux Mar 12, 2025
3333ca5
Implemented fast_asin() fast_acos(). Slowly redoing coefficients.
mcourteaux Mar 12, 2025
7d70fdf
WIP: determine precision of the polynomials.
mcourteaux Mar 13, 2025
faa368e
Revived all tests.
mcourteaux Mar 14, 2025
4c08aa3
Clang format
mcourteaux Mar 14, 2025
23f6ff7
Implement expm1. Fix accuracy of tanh. Fix lowering of tanh on CUDA. …
mcourteaux Mar 15, 2025
8b3769a
Clang-format
mcourteaux Mar 15, 2025
bd2c7ac
Feedback, and remove expm1 test.
mcourteaux Mar 15, 2025
1b06a7f
Fix compilation issues.
mcourteaux Mar 15, 2025
062d686
One more compilation issue.
mcourteaux Mar 15, 2025
58a6d7c
Fixed a bracket.
mcourteaux Mar 15, 2025
7000f21
Update some precision info on math intrinsics for Vulkan and Metal.
mcourteaux Mar 17, 2025
a171ec1
Fix makefile after I accidentally broke it by sorting files alphabeti…
mcourteaux Apr 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 45 additions & 42 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -421,21 +421,24 @@ SOURCE_FILES = \
AlignLoads.cpp \
AllocationBoundsInference.cpp \
ApplySplit.cpp \
ApproximationTables.cpp \
Argument.cpp \
AssociativeOpsTable.cpp \
Associativity.cpp \
AsyncProducers.cpp \
AutoScheduleUtils.cpp \
BoundConstantExtentLoops.cpp \
BoundSmallAllocations.cpp \
BoundaryConditions.cpp \
Bounds.cpp \
BoundsInference.cpp \
BoundConstantExtentLoops.cpp \
BoundSmallAllocations.cpp \
Buffer.cpp \
CPlusPlusMangle.cpp \
CSE.cpp \
Callable.cpp \
CanonicalizeGPUVars.cpp \
Closure.cpp \
ClampUnsafeAccesses.cpp \
Closure.cpp \
CodeGen_ARM.cpp \
CodeGen_C.cpp \
CodeGen_D3D12Compute_Dev.cpp \
Expand All @@ -445,20 +448,18 @@ SOURCE_FILES = \
CodeGen_LLVM.cpp \
CodeGen_Metal_Dev.cpp \
CodeGen_OpenCL_Dev.cpp \
CodeGen_Vulkan_Dev.cpp \
CodeGen_PTX_Dev.cpp \
CodeGen_Posix.cpp \
CodeGen_PowerPC.cpp \
CodeGen_PTX_Dev.cpp \
CodeGen_PyTorch.cpp \
CodeGen_RISCV.cpp \
CodeGen_Vulkan_Dev.cpp \
CodeGen_WebAssembly.cpp \
CodeGen_WebGPU_Dev.cpp \
CodeGen_X86.cpp \
CompilerLogger.cpp \
ConstantBounds.cpp \
ConstantInterval.cpp \
CPlusPlusMangle.cpp \
CSE.cpp \
Debug.cpp \
DebugArguments.cpp \
DebugToFile.cpp \
Expand All @@ -479,6 +480,7 @@ SOURCE_FILES = \
Expr.cpp \
ExtractTileOperations.cpp \
FastIntegerDivide.cpp \
FastMathFunctions.cpp \
FindCalls.cpp \
FindIntrinsics.cpp \
FlattenNestedRamps.cpp \
Expand All @@ -490,26 +492,26 @@ SOURCE_FILES = \
Generator.cpp \
HexagonOffload.cpp \
HexagonOptimize.cpp \
ImageParam.cpp \
InferArguments.cpp \
InjectHostDevBufferCopies.cpp \
Inline.cpp \
InlineReductions.cpp \
IntegerDivisionTable.cpp \
Interval.cpp \
IR.cpp \
IREquality.cpp \
IRMatch.cpp \
IRMutator.cpp \
IROperator.cpp \
IRPrinter.cpp \
IRVisitor.cpp \
ImageParam.cpp \
InferArguments.cpp \
InjectHostDevBufferCopies.cpp \
Inline.cpp \
InlineReductions.cpp \
IntegerDivisionTable.cpp \
Interval.cpp \
JITModule.cpp \
Lambda.cpp \
Lerp.cpp \
LICM.cpp \
LLVM_Output.cpp \
LLVM_Runtime_Linker.cpp \
Lambda.cpp \
Lerp.cpp \
LoopCarry.cpp \
Lower.cpp \
LowerParallelTasks.cpp \
Expand All @@ -532,8 +534,8 @@ SOURCE_FILES = \
PurifyIndexMath.cpp \
PythonExtensionGen.cpp \
Qualify.cpp \
Random.cpp \
RDom.cpp \
Random.cpp \
Realization.cpp \
RealizationOrder.cpp \
RebaseLoopsToZero.cpp \
Expand All @@ -547,28 +549,28 @@ SOURCE_FILES = \
SelectGPUAPI.cpp \
Serialization.cpp \
Simplify.cpp \
SimplifyCorrelatedDifferences.cpp \
SimplifySpecializations.cpp \
Simplify_Add.cpp \
Simplify_And.cpp \
Simplify_Call.cpp \
Simplify_Cast.cpp \
Simplify_Reinterpret.cpp \
Simplify_Div.cpp \
Simplify_EQ.cpp \
Simplify_Exprs.cpp \
Simplify_Let.cpp \
Simplify_LT.cpp \
Simplify_Let.cpp \
Simplify_Max.cpp \
Simplify_Min.cpp \
Simplify_Mod.cpp \
Simplify_Mul.cpp \
Simplify_Not.cpp \
Simplify_Or.cpp \
Simplify_Reinterpret.cpp \
Simplify_Select.cpp \
Simplify_Shuffle.cpp \
Simplify_Stmts.cpp \
Simplify_Sub.cpp \
SimplifyCorrelatedDifferences.cpp \
SimplifySpecializations.cpp \
SkipStages.cpp \
SlidingWindow.cpp \
Solve.cpp \
Expand Down Expand Up @@ -620,17 +622,20 @@ HEADER_FILES = \
AlignLoads.h \
AllocationBoundsInference.h \
ApplySplit.h \
ApproximationTables.h \
Argument.h \
AssociativeOpsTable.h \
Associativity.h \
AsyncProducers.h \
AutoScheduleUtils.h \
BoundConstantExtentLoops.h \
BoundSmallAllocations.h \
BoundaryConditions.h \
Bounds.h \
BoundsInference.h \
BoundConstantExtentLoops.h \
BoundSmallAllocations.h \
Buffer.h \
CPlusPlusMangle.h \
CSE.h \
Callable.h \
CanonicalizeGPUVars.h \
ClampUnsafeAccesses.h \
Expand All @@ -642,18 +647,16 @@ HEADER_FILES = \
CodeGen_LLVM.h \
CodeGen_Metal_Dev.h \
CodeGen_OpenCL_Dev.h \
CodeGen_Vulkan_Dev.h \
CodeGen_Posix.h \
CodeGen_PTX_Dev.h \
CodeGen_Posix.h \
CodeGen_PyTorch.h \
CodeGen_Targets.h \
CodeGen_Vulkan_Dev.h \
CodeGen_WebGPU_Dev.h \
CompilerLogger.h \
ConciseCasts.h \
CPlusPlusMangle.h \
ConstantBounds.h \
ConstantInterval.h \
CSE.h \
Debug.h \
DebugArguments.h \
DebugToFile.h \
Expand Down Expand Up @@ -690,6 +693,13 @@ HEADER_FILES = \
Generator.h \
HexagonOffload.h \
HexagonOptimize.h \
IR.h \
IREquality.h \
IRMatch.h \
IRMutator.h \
IROperator.h \
IRPrinter.h \
IRVisitor.h \
ImageParam.h \
InferArguments.h \
InjectHostDevBufferCopies.h \
Expand All @@ -698,20 +708,12 @@ HEADER_FILES = \
IntegerDivisionTable.h \
Interval.h \
IntrusivePtr.h \
IR.h \
IREquality.h \
IRMatch.h \
IRMutator.h \
IROperator.h \
IRPrinter.h \
IRVisitor.h \
WasmExecutor.h \
JITModule.h \
Lambda.h \
Lerp.h \
LICM.h \
LLVM_Output.h \
LLVM_Runtime_Linker.h \
Lambda.h \
Lerp.h \
LoopCarry.h \
Lower.h \
LowerParallelTasks.h \
Expand All @@ -735,18 +737,16 @@ HEADER_FILES = \
PurifyIndexMath.h \
PythonExtensionGen.h \
Qualify.h \
RDom.h \
Random.h \
Realization.h \
RDom.h \
RealizationOrder.h \
RebaseLoopsToZero.h \
Reduction.h \
RegionCosts.h \
RemoveDeadAllocations.h \
RemoveExternLoops.h \
RemoveUndef.h \
runtime/HalideBuffer.h \
runtime/HalideRuntime.h \
Schedule.h \
ScheduleFunctions.h \
Scope.h \
Expand Down Expand Up @@ -780,7 +780,10 @@ HEADER_FILES = \
Util.h \
Var.h \
VectorizeLoops.h \
WrapCalls.h
WasmExecutor.h \
WrapCalls.h \
runtime/HalideBuffer.h \
runtime/HalideRuntime.h

OBJECTS = $(SOURCE_FILES:%.cpp=$(BUILD_DIR)/%.o)
HEADERS = $(HEADER_FILES:%.h=$(SRC_DIR)/%.h)
Expand Down Expand Up @@ -882,7 +885,7 @@ RUNTIME_CPP_COMPONENTS = \
windows_yield \
write_debug_image \
vulkan \
x86_cpu_features \
x86_cpu_features

RUNTIME_LL_COMPONENTS = \
aarch64 \
Expand Down
Loading
Loading