Inference speed is slow even when cuda activated #962

chiragpatel39 · 2025-04-30T11:32:40Z

chiragpatel39
Apr 30, 2025

I am following the steps given in https://speech.fish.audio/inference/
I am trying inference with --compile flag. Still inference is very slow. It able to infer at the speed of only 4.45 tokens. The inference is very fast on fish audio website.

What am I doing wrong? Could you please throw some lights?

My terminal output looks like:

2025-04-30 11:10:09.830 | INFO | main:main:1056 - Loading model ...
2025-04-30 11:10:17.056 | INFO | main:load_model:681 - Restored model from checkpoint
2025-04-30 11:10:17.057 | INFO | main:load_model:687 - Using DualARTransformer
2025-04-30 11:10:17.057 | INFO | main:load_model:695 - Compiling function...
2025-04-30 11:10:18.856 | INFO | main:main:1070 - Time to load model: 9.03 seconds
2025-04-30 11:10:18.886 | INFO | main:generate_long:788 - Encoded text: Today president announced additional tarrifs on some countries.
2025-04-30 11:10:18.886 | INFO | main:generate_long:806 - Generating sentence 1/1 of sample 1/1
0%| | 0/7915 [00:00<?, ?it/s]
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/contextlib.py:103: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
0%| | 1/7915 [00:29<63:59:32, 29.11s/it]
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/contextlib.py:103: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
0%| | 2/7915 [00:29<26:49:17, 12.20s/it]
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/contextlib.py:103: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
2%|██▍ | 134/7915 [00:30<29:04, 4.46it/s]
2025-04-30 11:10:49.235 | INFO | main:generate_long:851 - Compilation time: 30.35 seconds
2025-04-30 11:10:49.235 | INFO | main:generate_long:860 - Generated 136 tokens in 30.35 seconds, 4.48 tokens/sec
2025-04-30 11:10:49.236 | INFO | main:generate_long:863 - Bandwidth achieved: 2.86 GB/s
2025-04-30 11:10:49.236 | INFO | main:generate_long:868 - GPU Memory used: 1.80 GB
2025-04-30 11:10:49.236 | INFO | main:main:1103 - Sampled text: Today president announced additional tarrifs on some countries.
2025-04-30 11:10:49.246 | INFO | main:main:1108 - Saved codes to temp/codes_0.npy
2025-04-30 11:10:49.246 | INFO | main:main:1109 - Next sample
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/site-packages/vector_quantize_pytorch/vector_quantize_pytorch.py:445: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/site-packages/vector_quantize_pytorch/vector_quantize_pytorch.py:630: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/site-packages/vector_quantize_pytorch/finite_scalar_quantization.py:147: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/site-packages/vector_quantize_pytorch/lookup_free_quantization.py:209: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.

farzadsbmila · 2025-05-26T21:09:07Z

farzadsbmila
May 26, 2025

Hey,
I have the same problem. Is your second sample very fast?

1 reply

farzadsbmila May 26, 2025

refer to this #578

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference speed is slow even when cuda activated #962

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Inference speed is slow even when cuda activated #962

Uh oh!

chiragpatel39 Apr 30, 2025

Replies: 1 comment · 1 reply

Uh oh!

farzadsbmila May 26, 2025

Uh oh!

farzadsbmila May 26, 2025

chiragpatel39
Apr 30, 2025

Replies: 1 comment 1 reply

farzadsbmila
May 26, 2025