Understanding n_batch in Llama.cpp for a Mobile AI Chat App #125

Helldez · 2025-03-03T07:59:10Z

Helldez
Mar 3, 2025

Hi everyone,
I'm working on a mobile AI assistant app called d.ai (decentralized AI) that runs LLMs locally using Llama.cpp. My app processes chat requests in sequence, meaning there's no parallel request handling—just one user input at a time, generating a response before the next input is sent.

Given this setup, I'm wondering about the relevance of the n_batch parameter.

Does it have any impact in a scenario where requests are processed sequentially?
Is it essentially locked at 1 in such a case, making it an irrelevant parameter?
Or does it still affect things like token generation speed, memory usage, or CPU/GPU efficiency?

Any recommendations on best practices for managing batch size in this context (mobile)?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding n_batch in Llama.cpp for a Mobile AI Chat App #125

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Understanding n_batch in Llama.cpp for a Mobile AI Chat App #125

Uh oh!

Helldez Mar 3, 2025

Replies: 0 comments

Helldez
Mar 3, 2025