Skip to content

Fix issue 1282 self-hosted model selection when billing is disabled#1318

Open
MujtabaHadi wants to merge 330 commits into
khoj-ai:release/1.xfrom
MujtabaHadi:fix-1282-selfhosted-model-selection
Open

Fix issue 1282 self-hosted model selection when billing is disabled#1318
MujtabaHadi wants to merge 330 commits into
khoj-ai:release/1.xfrom
MujtabaHadi:fix-1282-selfhosted-model-selection

Conversation

@MujtabaHadi

Copy link
Copy Markdown

Summary

Fix self-hosted model selection so chat_advanced does not override the default chat model path when billing is disabled.

Problem

On self-hosted Khoj, users could be treated as subscribed in the default-model selection path. As a result, chat_advanced could override normal model selection behavior, including the default chat model and user-selected model behavior.

Fix

Update:

  • ConversationAdapters.get_default_chat_model
  • ConversationAdapters.aget_default_chat_model

so the subscribed/advanced branch only applies when state.billing_enabled is true.

Validation

I reproduced the issue locally in a self-hosted Docker setup using Ollama and OpenAI-compatible routing.

Before the fix:

  • changing chat_advanced controlled the actual chat response path

After the fix:

  • helper/background steps used the advanced model
  • the actual chat response used the default model

Related issue

Fixes #1282

debanjum and others added 30 commits July 25, 2025 13:28
A previous regression resulted in the start llm response event being
sent with every (non-thought) message chunk. It should only be sent
once after thoughts and before first normal message chunk is streamed.

Regression probably introduced with changes to stream thoughts.

This should fix the chat streaming latency logs.
This is required by llama.cpp server and is recommended in general for
openai compatible models
- Extract llm thoughts from more openai compatible ai api providers
  like llama.cpp server vllm and litellm.
  - Try structured thought extraction by default
  - Try in-stream thought extraction for specific model families like
    qwen and deepseek.
- Show thoughts with tool use. For intermediate steps like research
  mode from openai compatible models

Some consensus on thought in model response is being reached with
using deepseek style thoughts in structured response (via
"reasoning_content" field)  or qwen style thoughts in main
response (i.e <think></think> tags).

Default to try deepseek style structured thought extraction. So the
previous default stream processor isn't required.
Save to conversation in normal flow should only be done if
interrupt wasn't triggered.

Saving conversations on interrupt is handled completely by the
disconnect monitor since the improvements to interrupt.

This abort is handled correctly for steps before final response. But
not if interrupt occurs while final response is being sent. This
changes checks for cancellation after final response send attempt and
avoids duplicate chat turn save.
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
## PR Summary
This PR resolves the deprecation warnings of the Pydantic library, which
you can find in the [CI
logs](https://github.com/khoj-ai/khoj/actions/runs/16528997676/job/46749452047#step:9:142):
```python
PydanticDeprecatedSince20: The `copy` method is deprecated; use `model_copy` instead. See the docstring of `BaseModel.copy` for details about how to handle `include` and `exclude`. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
```
This should avoid the sync_to_async errors thrown by django when
calling the /api/agent/conversation API endpoint
- Ask both manager and code gen AI to not run or write
  unsafe code for some safety improvement (over code exec in sandbox).
- Disallow custom agent prompts instructing unsafe code gen
Grok 3 mini at least sends thoughts in reasoning_content field of
streamed chunk delta. Extract model thoughts from that when available.
Send larger thought chunks to improve streaming efficiency and
reduce rendering load on web client.

This rendering load was most evident when using high throughput
models or low compute clients.

The server side message buffering should result in fewer re-renders,
faster streaming and lower compute load on client.

Related commit to buffer message content in fc99f8b
Clarify that the tool AI will perform a maximum of X sub-queries for
each query passed to it by the manager AI.

Avoids the manager AI from trying to directly pass a list of queries
to the search tool AI. It should just pass just a single query.
These were used when khoj was configured using khoj.yml file
It is recommended to chat with open-source models by running an
open-source server like Ollama, Llama.cpp on your GPU powered machine
or use a commercial provider of open-source models like DeepInfra or
OpenRouter.

These chat model serving options provide a mature Openai compatible
API that already works with Khoj.

Directly using offline chat models only worked reasonably with pip
install on a machine with GPU. Docker setup of khoj had trouble with
accessing GPU. And without GPU access offline chat is too slow.

Deprecating support for an offline chat provider directly from within
Khoj will reduce code complexity and increase developement velocity.
Offline models are subsumed to use existing Openai ai model provider.
This stale code was originally used to index files on server file
system directly by server. We currently push files to sync via API.

Server side syncing of remote content like Github and Notion is still
supported. But old, unused code for server side sync of files on
server fs is being cleaned out.

New --log-file cli args allows specifying where khoj server should
store logs on fs. This replaces the --config-file cli arg that was
only being used as a proxy for deciding where to store the log file.

- TODO
  - Tests are broken. They were relying on the server side content
    syncing for test setup
- Delete tests testing deprecated server side indexing flows
- Delete `Local(Plaintext|Org|Markdown|Pdf)Config' methods, files and
  references in tests
- Index test data via new helper method, `get_index_files'
  - It is modelled after the old `get_org_files' variants in main app
  - It passes the test data in required format to `configure_content'
    Allows maintaining the more realistic tests from before while
    using new indexing mechanism (rather than the deprecated server
    side indexing mechanism
…hoj-ai#1212)

### Overview
Make server leaner to increase development speed. 
Remove old indexing code and the native offline chat which was hard to
maintain.

- The native offline chat module was written when the local ai model api
ecosystem wasn't mature. Now it is. Reuse that.
- Offline chat requires GPU for usable speeds. Decoupling offline chat
from Khoj server is the recommended way to go for practical inference
speeds (e.g Ollama on machine, Khoj in docker etc.)

### Details
- Drop old code to index files on server filesystem. Clean cli, init
paths.
- Drop native offline chat support with llama-cpp-python. 
  Use established local ai APIs like Llama.cpp Server, Ollama, vLLM etc.
- Drop old pre 1.0 khoj config migration scripts
- Update test setup to index test data after old indexing code removed.
- Use khoj username on khoj's computer
- Uv is much faster for builds
…lows

It's much faster than pip, includes dependency locks via uv.lock and
comes with standard convenience utilities (e.g pipx, venv replacement)
It's faster than yarn and comes with standard convenience utilities
samhoooo and others added 30 commits February 23, 2026 00:33
…hoj-ai#1128)

- When you type in search modal, and matches the pattern `file:`, you
should see list of all files in vault and non-vault
- This list is filtered down as you type more letters 


### Technical Details

- Added file filter mode (`isFileFilterMode` state) to filter search
results by specific files
- Updated `getSuggestions()` function to search file from vault and
non-vault via khoj backend.
- Updated the selection behavior to handle both file selection and
search result selection

Closes khoj-ai#1025

---------

Co-authored-by: Debanjum <debanjum@gmail.com>
Add a "Copy References" button to the references pane in the web app.

In ReferencePanel Component
- Add a "Copy References" button to the `ReferencePanel` component.
- Implement functionality to copy all references (notes, online, and
code) as a markdown bullet list.
- Update the `TeaserReferencesSection` component to include the "Copy
References" button.
- Show copied to clipboard indicator when references copied on button click

Closes khoj-ai#1021

---------

Co-authored-by: Debanjum <debanjum@gmail.com>
## Summary
- Fixes AttributeError: 'str' object has no attribute 'iter_content' in
text_to_speech endpoint
- When `ELEVEN_LABS_API_KEY` is not configured, the function was
returning a string instead of a Response object

## Changes
- Introduced `TextToSpeechError` exception class in `text_to_speech.py`
- Changed `generate_text_to_speech` to raise exception instead of
returning error string
- Updated API endpoint to catch the exception and return HTTP 501 (Not
Implemented)

## Test plan
- [x] Code passes ruff lint check
- [ ] Manual testing with and without Eleven Labs API key configured

Fixes khoj-ai#1049

---------

Signed-off-by: majiayu000 <1835304752@qq.com>
Co-authored-by: Debanjum <debanjum@gmail.com>
Trailing slash in api calls to server doesn't work in production
behind proxy, only in local next.js dev server.
Fix spelling typos in telemetry.py. Corrects 'recieved' to 'received'
and 'equest' to 'request' in comments and error messages.
…khoj-ai#1263)

Remove redundant SDK version check in LauncherActivity since both
branches set the same orientation value. This simplifies the code
without changing behavior

Signed-off-by: Olexandr88 <radole1203@gmail.com>
## Summary

Fix a Python operator precedence bug in the `research()` function that
causes `current_iteration` to be set to a boolean instead of the actual
count of previous iterations.

## Bug

```python
if current_iteration := len(previous_iterations) > 0:
```

Python evaluates this as:
```python
if current_iteration := (len(previous_iterations) > 0):  # assigns True or False
```

So `current_iteration` becomes `True` (1) or `False` (0) regardless of
how many previous iterations exist.

## Fix

```python
if (current_iteration := len(previous_iterations)) > 0:
```

With parentheses, `current_iteration` is correctly set to the count
(e.g. 4), and then compared to 0.

## Impact

When resuming research with previous iterations, the loop counter was
effectively reset to 1 instead of the true count. This allowed the
research loop to run significantly more iterations than `MAX_ITERATIONS`
intended, wasting compute and API calls.

Signed-off-by: JiangNan <1394485448@qq.com>
## Summary

In `extract_from_webpage()`, the `content` parameter is unconditionally
overwritten to `None` on the line before the `is_none_or_empty(content)`
check. This means any pre-fetched content (e.g. text content already
retrieved by the Exa search engine) is always discarded, forcing an
unnecessary re-scrape of the webpage.

## Bug

```python
async def extract_from_webpage(
    url: str,
    subqueries: set[str] = None,
    content: str = None,     # <-- caller passes pre-fetched content
    ...
) -> Tuple[set[str], str, Union[None, str]]:
    content = None            # <-- BUG: immediately overwrites it
    if is_none_or_empty(content):  # always True
        content = await scrape_webpage_with_fallback(url)
```

## Fix

Remove the `content = None` assignment so the passed-in content is used
when available, falling back to scraping only when needed.

This bug was introduced in a refactor and causes:
- Wasted API calls to web scrapers for pages whose content is already
available
- Increased latency for search results that include inline content (e.g.
Exa)

Signed-off-by: JiangNan <1394485448@qq.com>
…ai#1277)

## Problem
When `ChatModel.friendly_name` is `None`, the `__str__` method returns
`None`, causing:
```
TypeError: __str__ returned non-string (type NoneType)
```

## Solution
Fall back to `name` field when `friendly_name` is `None`.

Related issue: khoj-ai#1251

Co-authored-by: 阳虎 <yanghu@yanghudeMacBook-Pro.local>
…g fails (khoj-ai#1292)

When PyMuPDFLoader fails to process an invalid PDF file, the exception
is caught but pdf_entry_by_pages is referenced before assignment, 
causing an UnboundLocalError.

Initialized pdf_entry_by_pages to an empty list before the try block so 
the return statement always has a valid value, even when an exception
occurs.

Verified with both invalid input (returns []) and valid PDFs (returns
extracted text).

Fixes khoj-ai#1289

Co-authored-by: BillionClaw <267901332+BillionClaw@users.noreply.github.com>
## Summary

`src/khoj/processor/content/org_mode/orgnode.py:57` opens a file with
`open(filename, "r")` but never closes it. The file handle leaks for the
lifetime of the returned `Orgnode` list.

## Fix

Replaced bare `open()` with a `with` statement to ensure the file is
closed after `makelist()` finishes reading.

```python
# Before
def makelist_with_filepath(filename):
    f = open(filename, "r")
    return makelist(f, filename)

# After
def makelist_with_filepath(filename):
    with open(filename, "r") as f:
        return makelist(f, filename)
```

This is safe because `makelist()` fully consumes the file during the
call (building the Orgnode list from file contents), so the file handle
is no longer needed after it returns.
Changes (4 files):
- pyproject.toml: authlib 1.6.6 → 1.6.9
- src/interface/web/package.json: dompurify ^3.2.6 → ^3.3.2, eslint-config-next 14.2.3 → 14.2.35
- documentation/package.json: @docusaurus/* → ^3.9.2, added serialize-javascript resolution

And regenerated lock files.

The only resolution override is serialize-javascript in documentation,
which is unavoidable since Docusaurus still pins old
copy-webpack-plugin and css-minimizer-webpack-plugin that depend on
serialize-javascript ^6.x.
- Add missing skipif decorator to test_create_automation
- Change skip condition from 'is None' to 'not' (falsy check) to
  also handle empty string, which happens when GitHub secrets are
  unavailable in fork PRs
Add banner to home, chat, shared chat and settings pages for coverage.
Link to settings account section to export data and mention Khoj
self-host option in banner
Starlette 1.0.0 removed the deprecated TemplateResponse signature
where `name` was the first positional arg and `request` was passed
inside `context`. The new signature requires `request` as the first
positional argument: TemplateResponse(request, name=...).

This caused a 500 error in production on web client endpoints with:
"Jinja2Templates.TemplateResponse() missing 1 required positional
argument: 'name'" (with older Starlette) or "'request'" (with 1.0.0).

Update all TemplateResponse calls in web_client.py to use the new
Starlette 1.0.0 signature: pass `request` as the first positional
arg and `name` as an explicit keyword argument.

Issue didn't trigger locally as uv is used locally and pip in docker
builds. These resolve dependencies including starletter version to
install differently. Locally 0.52.0 was installed while on production
starlette 1.0.0 was used. This is what caused the issue and the
mismatch in expectation
…i#1296)

## Summary
- Add null checks for `config.setting` in `get_chat_model()` and
`aget_chat_model()` to prevent `AttributeError` when memories are
disabled
- When the memory toggle creates a `UserConversationConfig` via
`get_or_create` with `setting=None`, accessing
`config.setting.price_tier` crashes — now falls through to the default
chat model instead

## Root Cause
The "Enable Memories" toggle PATCH endpoint uses `get_or_create` on
`UserConversationConfig`, which can create a config with `setting=None`.
Both `get_chat_model()` and `aget_chat_model()` then crash:
- For subscribed users: `if config:` passes but `return config.setting`
returns `None`, causing downstream crashes
- For non-subscribed users: `config.setting.price_tier` raises
`AttributeError` on `None`

## Fix
Change `if config:` → `if config and config.setting:` (subscribed path)
and add `and config.setting` guard before `.price_tier` access
(non-subscribed path), in both sync and async variants.

## Test plan
- [ ] Toggle memories off with no prior chat model configured — settings
page should still load
- [ ] Chat responses should use default model when setting is None
- [ ] Existing users with configured chat models should be unaffected

Fixes khoj-ai#1287

Signed-off-by: majiayu000 <1835304752@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.