FX 支持使用 FP16 by Blinue · Pull Request #1049 · Blinue/Magpie

Blinue · 2025-01-02T14:41:29Z

效果可以使用 //!USE FP16 声明对半精度浮点数的支持，条件满足时会有以下变化：

MP_FP16 被定义。
MF 系列宏被定义为 min16float 族，如 MF4 为 min16float4，MF3x3为 min16float3x3 等。不使用 FP16 时这些宏被定义为对应的 float 类型。
符合条件的纹理被声明为 min16float 类型，例如 R16G16B16A16_FLOAT 格式的输入定义变为 Texture2D<min16float4>，输出定义变为 RWTexture2D<min16float4>；R16G16_UNORM 格式的输入定义变为 Texture2D<min16float2>，输出定义变为 RWTexture2D<unorm min16float2>。包含 32 位浮点数的格式仍使用 float 类型。

即使效果声明支持 FP16，也不意味着一定使用，有两种例外情况：GPU 不支持 FP16 或通过开发者选项禁用了 FP16。

添加了新的内置函数 MulAdd，等效于矩阵乘然后加上向量，让我们可以在 dp4 或 mad 之间灵活切换。目前大部分基于机器学习的效果大量使用 dp4，根据我的测试，切换为 mad 后性能提升相当可观。如果使用 FP16，mad 的性能可以进一步提升，而 dp4 的性能不升反降。

所有合适的效果都会适配 FP16 和 MulAdd，性能对比如下：

效果	当前	此 PR	使用 FP16	性能提升
Jinc	0.205ms	0.201ms	否	+2%
Anime4K_3D_(AA_)Upscale_US	0.132ms	0.131ms	是	+0.1%
Anime4K_Restore_(Soft_)S	0.161ms	0.154ms	是	+4.3%
Anime4K_Restore_(Soft_)M	0.351ms	0.344ms	是	+2%
Anime4K_Restore_(Soft_)L	0.559ms	0.433ms	是	+22.5%
Anime4K_Restore_(Soft_)VL	1.22ms	0.862ms	是	+29.3%
Anime4K_Restore_(Soft_)UL	2.82ms	1.83ms	是	+35.1%
Anime4K_Upscale_(Denoise_)S	0.144ms	0.113ms	是	+21.5%
Anime4K_Upscale_(Denoise_)L	0.549ms	0.432ms	是	+21.3%
Anime4K_Upscale_(Denoise_)VL	1.38ms	1.08ms	是	+21.7%
Anime4K_Upscale_(Denoise_)UL	2.44ms	1.92ms	是	+21.3%
Anime4K_Upscale_GAN_x2_S	0.82ms	0.689ms	是	+16%
Anime4K_Upscale_GAN_x2_M	1.73ms	1.31ms	是	+24.3%
Anime4K_Upscale_GAN_x3_L	4.7ms	3.3ms	是	+29.8%
CAS	0.016ms	0.015ms	是	+6.3%
CuNNy-2x4C-NVL(-DN)	0.167ms	0.132ms	是	+21%
CuNNy-3x4C-NVL(-DN)	0.213ms	0.166ms	是	+22.1%
CuNNy-4x4C-NVL(-DN)	0.259ms	0.202ms	是	+22%
CuNNy-8x4C-NVL(-DN)	0.443ms	0.336ms	是	+24.2%
CuNNy-4x8C-NVL(-DN)	0.744ms	0.5ms	是	+32.8%
CuNNy-6x8C-NVL(-DN)	1.19ms	0.702ms	是	+41%
CuNNy-8x8C-NVL(-DN)	1.52ms	0.907ms	是	+40.3%
CuNNy-4x16C-NVL(-DN)	4.8ms	1.8ms	是	+62.5%
CuNNy-8x16C-NVL(-DN)	8.6ms	3.35ms	是	+61%
CuNNy-16x16C-NVL(-DN)	15.7ms	6.4ms	是	+59.2%
FSR_EASU	0.159ms	0.149ms	是	+6.3%
FSR_RCAS	0.024ms	0.016ms	是	+33.3%
FSR_EASU+FSR_RCAS	0.238ms	0.197ms	是	+17.2%
FSRCNNX(_LineArt)	0.403ms	0.363ms	是	+9.9%
ACNet	0.612ms	0.514ms	是	+16%
NIS	0.181ms	0.173ms	是	+4.4%
NVSharpen	0.027ms	0.027ms	是	+0%

NIS 的性能提升是因为更新到了 v1.0.3，FP16 会使性能稍微下降，但显存占用更低。

其他更改：

添加开发者选项性能测试模式，开启后将持续渲染不做等待，用于测试效果的性能。
不再使用 wil::CreateDirectoryDeepNoThrow，因为它不支持相对路径，应改为使用 Win32Helper::CreateDir。
内联常量改为使用全局只读变量实现以避免宏定义引起的名字冲突，如 Fix effect shader compile error #678。
引入 rapidhash，删除现有 wyhash 实现，这会使现有缓存失效，但也是清理技术债务的好机会。
优化效果缓存逻辑，避免出现哈希碰撞时读取错误的缓存，修改了缓存文件名。
效果可以在 //!MAGPIE EFFECT 块包含 StubDefs.hlsli 以减少 IDE 中的错误，不影响编译结果。

这可以避免宏定义引起的名字冲突，如 #678

这会使效果缓存失效

加载缓存将检查源码是否匹配，更改缓存文件名

Blinue · 2025-01-03T12:39:04Z

我分别使用 N 卡（RTX 4070 Laptop）和 I 卡（Intel UHD）在同样的条件下测试结果如下：

效果	FP32-N	FP16-N	性能提升	FP32-I	FP16-I	性能提升
ACNet	0.64ms	0.56ms	+12.5%	19.5ms	7.6ms	+61%
Anime4K_Upscale_L	0.55ms	0.61ms	-10.9%	53.9ms	63.3ms	-17.4%
CuNNy-6x8C-NVL	0.93ms	1.1ms	-18.3%	38.6ms	105ms	-172%
Anime4K_Upscale_Denoise_UL	2.67ms	2.85ms	-6.7%	490ms	421ms	+14.1%
Anime4K_Restore_UL	2.95ms	2.81ms	+4.7%	657ms	474ms	+27.9%
Anime4K_Restore_Soft_UL	2.95ms	2.81ms	+4.7%	657ms	470ms	+28.5%
FSRCNNX	0.486ms	0.506ms	-4.1%	14.6ms	6.2ms	+57.5%

N 卡只有 ACNet 有较大的性能提升，其他效果反而下降；I 卡 ACNet 和 FSRCNNX 提升，其他则下降，而且性能变化幅度非常大。看来不同显卡的 FP16 性能差别很大，正确配置时可以大幅提高性能，反之则会大幅降低。这与我预想的不同，看来不能简单的全局启用或禁用。

Blinue · 2025-01-04T05:41:10Z

我这里串联了三个上了点强度，发现开关fp16的区别只能算误差...都是22.5xx ms波动

因为这几个效果还没做适配，现在都适配了。

hooke007 · 2025-01-04T06:16:56Z

好像还是不算明显

Blinue · 2025-01-04T06:40:55Z

试试 CuNNy-16x16C-NVL，我这里差别比较明显

FP16

FP32

hooke007 · 2025-01-04T07:20:49Z

。。。更慢了
fp16 -- fp32

Blinue · 2025-01-06T02:18:41Z

鉴于不同显卡 fp16 能力不同，我们应该支持针对单个效果启用或禁用 fp16。我想到两个方案：

允许用户针对单个效果启用 fp16
和 TensorRT 类似自动进行性能测试决定是否使用 fp16

我更喜欢第二个方案，虽然它很复杂，但优势很大

可以将粒度减小到通道，同一个效果内分别测试每个通道决定是否使用 fp16
不需要用户自己测试，开箱即用
为 TensorRT 铺路，由于机制类似，代码路径可以共用

hooke007 · 2025-01-06T02:38:26Z

我略作搜索好像确实只能算一点误差和显卡工作时频率影响的区别

nvidia的家用卡fp16和fp32似乎就是一个级别，这两代加入的fp16 tensor core和fp16不是一个东西
https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix

Support Matrix :: NVIDIA Deep Learning TensorRT Documentation
These support matrices provide an overview of the supported platforms, features, and hardware capabilities of the TensorRT APIs, parsers, and layers.

Blinue · 2025-01-06T03:02:40Z

计算速度上 fp16 和 fp32 是一样的，fp16 的主要优势是驱动可以将 2 个 fp16 打包到一个 32 位 VGPR 寄存器。

如果驱动支持，一个指令可以同时计算两个 fp16 值，相当于时间减少了一半
使用的 VGPR 寄存器数量减少一半，VGPR 用的太多会影响并发性能

plainround · 2025-01-12T07:19:39Z

@soi8391 我也觉得老版本游戏更稳定
我打算就这个magpie性能问题向intel提交issue了，如果他们修不好，我就把5972用到换显卡🤪

Blinue · 2025-01-12T08:55:34Z

总结一下 @plainround 的测试结果

驱动版本 32.0.101.5972：

指令\精度	fp32	fp16
dp4		30.3ms
mad	55.6ms	12.4ms

驱动版本 32.0.101.6319：

指令\精度	fp32	fp16
dp4		49.9ms
mad	52.6ms	52.7ms

看来 6319 版本的 fp16 性能大幅下降了。

也添加了 FP16 支持，性能有很小的下降，但显存占用大幅下降

Blinue · 2025-02-01T04:33:03Z

这个功能完成了，感谢你们的测试！

鉴于不同显卡 fp16 能力不同，我们应该支持针对单个效果启用或禁用 fp16。我想到两个方案：

允许用户针对单个效果启用 fp16

和 TensorRT 类似自动进行性能测试决定是否使用 fp16

我更喜欢第二个方案，虽然它很复杂，但优势很大

可以将粒度减小到通道，同一个效果内分别测试每个通道决定是否使用 fp16

不需要用户自己测试，开箱即用

为 TensorRT 铺路，由于机制类似，代码路径可以共用

如果某些设备上 mad 比 dp4 更慢，这个功能还是有必要的，目前只发现 I 系独立显卡是这样，而且可能是驱动 bug。需要更多测试。

Blinue and others added 10 commits January 1, 2025 17:00

feat: 自动使用半精度浮点数，除非在开发者选项中禁用

c3f5a5f

feat: 添加用于测试效果性能的模式，将持续渲染不做等待

0b54ca1

chore: 避免不同配置使用相同的着色器头文件

c6ef833

fix: 不再使用 wil::CreateDirectoryDeepNoThrow，因为它不支持相对路径

c102cd5

feat: 内联常量改为使用全局只读变量实现

ca68838

这可以避免宏定义引起的名字冲突，如 #678

feat: 引入 rapidhash，不再使用 wyhash

1424ddc

这会使效果缓存失效

feat: 优化缓存系统

2a45fda

加载缓存将检查源码是否匹配，更改缓存文件名

ui: 优化开发者选项 UI

fea02ca

perf: 避免复制

9848df4

feat: 使用 USE_FP16 指令声明效果支持 FP16

2de6df2

Blinue added enhancement New feature or request area: performance area: effect labels Jan 2, 2025

Blinue and others added 4 commits January 2, 2025 23:27

fix: 小修复

4df6dee

chore: 修改措辞

9aade85

Merge branch 'dev' into feat/fp16

8e49d36

feat: 使几个效果支持 FP16，但性能变化不如预期

e1ccbb5

This comment was marked as outdated.

Sign in to view

feat: 适配几个效果供测试

181e4f9

CuNNy-D16N16

5f65096

Merge branch 'dev' into feat/fp16

9fd81ca

ACNet 从 mad 改为使用 dp4

0f6489c

plainround mentioned this pull request Jan 12, 2025

The new driver performs poorly in Magpie IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT#942

Closed

10 tasks

Blinue and others added 22 commits January 13, 2025 17:08

perf: 优化 FSRCNNX

7ca07db

Merge branch 'dev' into feat/fp16

5c767e3

perf: 优化更多效果

b16ffb1

perf: 优化更多效果

3fba18c

perf: 优化更多效果

5715d19

perf: 优化更多效果

09ddff9

perf: 优化更多效果

1d99f5f

Merge branch 'dev' into feat/fp16

9742f24

fix: 修正字符串资源

b5f3ae6

Merge branch 'dev' into feat/fp16

a1eb9a1

fix: 修正字符串资源

c271fb0

Merge branch 'dev' into feat/fp16

9d88503

Merge branch 'dev' into feat/fp16

600091e

Merge branch 'dev' into feat/fp16

d3fcf60

perf: 优化 FSR_EASU

7ba2a69

perf: 优化 FSR_RCAS

c17b6d4

perf: NIS 和 NVSharpen 更新到 v1.0.3

a3eb2de

也添加了 FP16 支持，性能有很小的下降，但显存占用大幅下降

feat: 叠加层上显示效果是否使用 FP16

52c259c

Merge branch 'dev' into feat/fp16

7ef132f

docs: 更新文档

4704bd3

docs: 优化文档

c90c6f8

docs: 优化文档

ecb6554

Blinue merged commit a6f86d0 into dev Feb 1, 2025
2 checks passed

Blinue deleted the feat/fp16 branch February 1, 2025 04:33

bacbt9 mentioned this pull request Feb 2, 2025

Conversion(?) to fp16 of updated magpie shaders? funnyplanter/CuNNy#7

Closed

Blinue mentioned this pull request Apr 10, 2025

请问，当前的0.11.2版本，是不是不支持启用FP16？ #1119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FX 支持使用 FP16#1049

FX 支持使用 FP16#1049
Blinue merged 56 commits intodevfrom
feat/fp16

Blinue commented Jan 2, 2025 •

edited

Loading

Uh oh!

Blinue commented Jan 3, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Blinue commented Jan 4, 2025

Uh oh!

hooke007 commented Jan 4, 2025 •

edited

Loading

Uh oh!

Blinue commented Jan 4, 2025

Uh oh!

hooke007 commented Jan 4, 2025

Uh oh!

Blinue commented Jan 6, 2025

Uh oh!

hooke007 commented Jan 6, 2025 •

edited by unfurl-links bot

Loading

Uh oh!

Blinue commented Jan 6, 2025

Uh oh!

plainround commented Jan 12, 2025

Uh oh!

Blinue commented Jan 12, 2025

Uh oh!

Blinue commented Feb 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Blinue commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blinue commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Blinue commented Jan 4, 2025

Uh oh!

hooke007 commented Jan 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blinue commented Jan 4, 2025

Uh oh!

hooke007 commented Jan 4, 2025

Uh oh!

Blinue commented Jan 6, 2025

Uh oh!

hooke007 commented Jan 6, 2025 • edited by unfurl-links bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blinue commented Jan 6, 2025

Uh oh!

plainround commented Jan 12, 2025

Uh oh!

Blinue commented Jan 12, 2025

Uh oh!

Blinue commented Feb 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Blinue commented Jan 2, 2025 •

edited

Loading

Blinue commented Jan 3, 2025 •

edited

Loading

hooke007 commented Jan 4, 2025 •

edited

Loading

hooke007 commented Jan 6, 2025 •

edited by unfurl-links bot

Loading