Any way to set larger context length when using Ollama models? #5254

rylativity · 2025-01-29T19:47:58Z

rylativity
Jan 29, 2025

I have successfully set up autogen to use Ollama and create agents and teams. I am currently using Ollama as the inference engine behind a MagneticOneGroupChat team and it is working well.

However, when I look at the Ollama model resource usage, it is clear (based on RAM/vRAM usage) that it is only using the default context length of 2048, even though the relevant context extends far beyond that context length limit as the MagneticOne team continues to work and create additional output. There is no error thrown, but I have seen a couple instances where the team attempt to perform tasks they have already completed further up in the team "chat", which (when combined with the evidence from RAM/vRAM usage) leads me to believe that any context beyond 2048 tokens is being silently truncated.

Is there a way to set an explicit, increased context length when instantiating an OpenAIChatCompletionClient, Agent, or Team? (Relevant documentation for using Ollama with Autogen here)

kb- · 2025-01-31T23:50:39Z

kb-
Jan 31, 2025

You can create new models on Ollama, with a larger context. https://ollama.readthedocs.io/en/modelfile/#format

Create a Modelfile

FROM llama3.1:8b-instruct-q4_K_M

PARAMETER num_ctx 8096

Then create your model, derived from a model you have
ollama create llama3.1-8b-instruct-q4_K_M-8k --file Modelfile

You can then use llama3.1-8b-instruct-q4_K_M-8k like other Ollama models.

0 replies

rylativity · 2025-02-01T01:04:43Z

rylativity
Feb 1, 2025
Author

Thanks for your response, @kb-. I want to avoid defining a new Ollama model if possible, so I apologize for not being more specific.

I'd like to be able to set the context length (and ideally other options) at query/inference time. Ollama allows passing a num_ctx key & value in the options object sent in a request body to the /generate and /chat endpoints (example of num_ctx in request shown here).

Open WebUI, for example, takes advantage of this and allows you to set num_ctx on a query-by-query basis without defining a new model in Ollama for every possible context length you might want to use.

I'm wondering if there is a way to pass query-time generation options through to Ollama from Autogen by setting a value when instantiating a CompletionClient or Agent or when running a task.

1 reply

ekzhu Feb 3, 2025
Maintainer

You can include extra arguments in extra_create_args in the create method.

https://microsoft.github.io/autogen/stable/reference/python/autogen_ext.models.openai.html#autogen_ext.models.openai.BaseOpenAIChatCompletionClient.create

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any way to set larger context length when using Ollama models? #5254

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Any way to set larger context length when using Ollama models? #5254

rylativity Jan 29, 2025

Replies: 2 comments · 1 reply

kb- Jan 31, 2025

rylativity Feb 1, 2025 Author

ekzhu Feb 3, 2025 Maintainer

rylativity
Jan 29, 2025

Replies: 2 comments 1 reply

kb-
Jan 31, 2025

rylativity
Feb 1, 2025
Author

ekzhu Feb 3, 2025
Maintainer