Inference speed is slow even when cuda activated #962
chiragpatel39
started this conversation in
General
Replies: 1 comment 1 reply
-
|
Hey, |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am following the steps given in https://speech.fish.audio/inference/
I am trying inference with --compile flag. Still inference is very slow. It able to infer at the speed of only 4.45 tokens. The inference is very fast on fish audio website.
What am I doing wrong? Could you please throw some lights?
My terminal output looks like:
2025-04-30 11:10:09.830 | INFO | main:main:1056 - Loading model ...
2025-04-30 11:10:17.056 | INFO | main:load_model:681 - Restored model from checkpoint
2025-04-30 11:10:17.057 | INFO | main:load_model:687 - Using DualARTransformer
2025-04-30 11:10:17.057 | INFO | main:load_model:695 - Compiling function...
2025-04-30 11:10:18.856 | INFO | main:main:1070 - Time to load model: 9.03 seconds
2025-04-30 11:10:18.886 | INFO | main:generate_long:788 - Encoded text: Today president announced additional tarrifs on some countries.
2025-04-30 11:10:18.886 | INFO | main:generate_long:806 - Generating sentence 1/1 of sample 1/1
0%| | 0/7915 [00:00<?, ?it/s]
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/contextlib.py:103: FutureWarning:
torch.backends.cuda.sdp_kernel()is deprecated. In the future, this context manager will be removed. Please seetorch.nn.attention.sdpa_kernel()for the new context manager, with updated signature.self.gen = func(*args, **kwds)
0%| | 1/7915 [00:29<63:59:32, 29.11s/it]
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/contextlib.py:103: FutureWarning:
torch.backends.cuda.sdp_kernel()is deprecated. In the future, this context manager will be removed. Please seetorch.nn.attention.sdpa_kernel()for the new context manager, with updated signature.self.gen = func(*args, **kwds)
0%| | 2/7915 [00:29<26:49:17, 12.20s/it]
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/contextlib.py:103: FutureWarning:
torch.backends.cuda.sdp_kernel()is deprecated. In the future, this context manager will be removed. Please seetorch.nn.attention.sdpa_kernel()for the new context manager, with updated signature.self.gen = func(*args, **kwds)
2%|██▍ | 134/7915 [00:30<29:04, 4.46it/s]
2025-04-30 11:10:49.235 | INFO | main:generate_long:851 - Compilation time: 30.35 seconds
2025-04-30 11:10:49.235 | INFO | main:generate_long:860 - Generated 136 tokens in 30.35 seconds, 4.48 tokens/sec
2025-04-30 11:10:49.236 | INFO | main:generate_long:863 - Bandwidth achieved: 2.86 GB/s
2025-04-30 11:10:49.236 | INFO | main:generate_long:868 - GPU Memory used: 1.80 GB
2025-04-30 11:10:49.236 | INFO | main:main:1103 - Sampled text: Today president announced additional tarrifs on some countries.
2025-04-30 11:10:49.246 | INFO | main:main:1108 - Saved codes to temp/codes_0.npy
2025-04-30 11:10:49.246 | INFO | main:main:1109 - Next sample
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/site-packages/vector_quantize_pytorch/vector_quantize_pytorch.py:445: FutureWarning:
torch.cuda.amp.autocast(args...)is deprecated. Please usetorch.amp.autocast('cuda', args...)instead.@autocast(enabled = False)
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/site-packages/vector_quantize_pytorch/vector_quantize_pytorch.py:630: FutureWarning:
torch.cuda.amp.autocast(args...)is deprecated. Please usetorch.amp.autocast('cuda', args...)instead.@autocast(enabled = False)
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/site-packages/vector_quantize_pytorch/finite_scalar_quantization.py:147: FutureWarning:
torch.cuda.amp.autocast(args...)is deprecated. Please usetorch.amp.autocast('cuda', args...)instead.@autocast(enabled = False)
/home/crp/miniforge3/envs/fish-speech-new/lib/python3.10/site-packages/vector_quantize_pytorch/lookup_free_quantization.py:209: FutureWarning:
torch.cuda.amp.autocast(args...)is deprecated. Please usetorch.amp.autocast('cuda', args...)instead.Beta Was this translation helpful? Give feedback.
All reactions