Fix issue 1282 self-hosted model selection when billing is disabled#1318
Open
MujtabaHadi wants to merge 330 commits into
Open
Fix issue 1282 self-hosted model selection when billing is disabled#1318MujtabaHadi wants to merge 330 commits into
MujtabaHadi wants to merge 330 commits into
Conversation
A previous regression resulted in the start llm response event being sent with every (non-thought) message chunk. It should only be sent once after thoughts and before first normal message chunk is streamed. Regression probably introduced with changes to stream thoughts. This should fix the chat streaming latency logs.
This is required by llama.cpp server and is recommended in general for openai compatible models
- Extract llm thoughts from more openai compatible ai api providers
like llama.cpp server vllm and litellm.
- Try structured thought extraction by default
- Try in-stream thought extraction for specific model families like
qwen and deepseek.
- Show thoughts with tool use. For intermediate steps like research
mode from openai compatible models
Some consensus on thought in model response is being reached with
using deepseek style thoughts in structured response (via
"reasoning_content" field) or qwen style thoughts in main
response (i.e <think></think> tags).
Default to try deepseek style structured thought extraction. So the
previous default stream processor isn't required.
Save to conversation in normal flow should only be done if interrupt wasn't triggered. Saving conversations on interrupt is handled completely by the disconnect monitor since the improvements to interrupt. This abort is handled correctly for steps before final response. But not if interrupt occurs while final response is being sent. This changes checks for cancellation after final response send attempt and avoids duplicate chat turn save.
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
## PR Summary This PR resolves the deprecation warnings of the Pydantic library, which you can find in the [CI logs](https://github.com/khoj-ai/khoj/actions/runs/16528997676/job/46749452047#step:9:142): ```python PydanticDeprecatedSince20: The `copy` method is deprecated; use `model_copy` instead. See the docstring of `BaseModel.copy` for details about how to handle `include` and `exclude`. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/ ```
This should avoid the sync_to_async errors thrown by django when calling the /api/agent/conversation API endpoint
- Ask both manager and code gen AI to not run or write unsafe code for some safety improvement (over code exec in sandbox). - Disallow custom agent prompts instructing unsafe code gen
Grok 3 mini at least sends thoughts in reasoning_content field of streamed chunk delta. Extract model thoughts from that when available.
Send larger thought chunks to improve streaming efficiency and reduce rendering load on web client. This rendering load was most evident when using high throughput models or low compute clients. The server side message buffering should result in fewer re-renders, faster streaming and lower compute load on client. Related commit to buffer message content in fc99f8b
Clarify that the tool AI will perform a maximum of X sub-queries for each query passed to it by the manager AI. Avoids the manager AI from trying to directly pass a list of queries to the search tool AI. It should just pass just a single query.
These were used when khoj was configured using khoj.yml file
It is recommended to chat with open-source models by running an open-source server like Ollama, Llama.cpp on your GPU powered machine or use a commercial provider of open-source models like DeepInfra or OpenRouter. These chat model serving options provide a mature Openai compatible API that already works with Khoj. Directly using offline chat models only worked reasonably with pip install on a machine with GPU. Docker setup of khoj had trouble with accessing GPU. And without GPU access offline chat is too slow. Deprecating support for an offline chat provider directly from within Khoj will reduce code complexity and increase developement velocity. Offline models are subsumed to use existing Openai ai model provider.
This stale code was originally used to index files on server file
system directly by server. We currently push files to sync via API.
Server side syncing of remote content like Github and Notion is still
supported. But old, unused code for server side sync of files on
server fs is being cleaned out.
New --log-file cli args allows specifying where khoj server should
store logs on fs. This replaces the --config-file cli arg that was
only being used as a proxy for deciding where to store the log file.
- TODO
- Tests are broken. They were relying on the server side content
syncing for test setup
- Delete tests testing deprecated server side indexing flows
- Delete `Local(Plaintext|Org|Markdown|Pdf)Config' methods, files and
references in tests
- Index test data via new helper method, `get_index_files'
- It is modelled after the old `get_org_files' variants in main app
- It passes the test data in required format to `configure_content'
Allows maintaining the more realistic tests from before while
using new indexing mechanism (rather than the deprecated server
side indexing mechanism
…hoj-ai#1212) ### Overview Make server leaner to increase development speed. Remove old indexing code and the native offline chat which was hard to maintain. - The native offline chat module was written when the local ai model api ecosystem wasn't mature. Now it is. Reuse that. - Offline chat requires GPU for usable speeds. Decoupling offline chat from Khoj server is the recommended way to go for practical inference speeds (e.g Ollama on machine, Khoj in docker etc.) ### Details - Drop old code to index files on server filesystem. Clean cli, init paths. - Drop native offline chat support with llama-cpp-python. Use established local ai APIs like Llama.cpp Server, Ollama, vLLM etc. - Drop old pre 1.0 khoj config migration scripts - Update test setup to index test data after old indexing code removed.
- Use khoj username on khoj's computer - Uv is much faster for builds
…lows It's much faster than pip, includes dependency locks via uv.lock and comes with standard convenience utilities (e.g pipx, venv replacement)
It's faster than yarn and comes with standard convenience utilities
…hoj-ai#1128) - When you type in search modal, and matches the pattern `file:`, you should see list of all files in vault and non-vault - This list is filtered down as you type more letters ### Technical Details - Added file filter mode (`isFileFilterMode` state) to filter search results by specific files - Updated `getSuggestions()` function to search file from vault and non-vault via khoj backend. - Updated the selection behavior to handle both file selection and search result selection Closes khoj-ai#1025 --------- Co-authored-by: Debanjum <debanjum@gmail.com>
Add a "Copy References" button to the references pane in the web app. In ReferencePanel Component - Add a "Copy References" button to the `ReferencePanel` component. - Implement functionality to copy all references (notes, online, and code) as a markdown bullet list. - Update the `TeaserReferencesSection` component to include the "Copy References" button. - Show copied to clipboard indicator when references copied on button click Closes khoj-ai#1021 --------- Co-authored-by: Debanjum <debanjum@gmail.com>
## Summary - Fixes AttributeError: 'str' object has no attribute 'iter_content' in text_to_speech endpoint - When `ELEVEN_LABS_API_KEY` is not configured, the function was returning a string instead of a Response object ## Changes - Introduced `TextToSpeechError` exception class in `text_to_speech.py` - Changed `generate_text_to_speech` to raise exception instead of returning error string - Updated API endpoint to catch the exception and return HTTP 501 (Not Implemented) ## Test plan - [x] Code passes ruff lint check - [ ] Manual testing with and without Eleven Labs API key configured Fixes khoj-ai#1049 --------- Signed-off-by: majiayu000 <1835304752@qq.com> Co-authored-by: Debanjum <debanjum@gmail.com>
Trailing slash in api calls to server doesn't work in production behind proxy, only in local next.js dev server.
Fix spelling typos in telemetry.py. Corrects 'recieved' to 'received' and 'equest' to 'request' in comments and error messages.
…khoj-ai#1263) Remove redundant SDK version check in LauncherActivity since both branches set the same orientation value. This simplifies the code without changing behavior Signed-off-by: Olexandr88 <radole1203@gmail.com>
## Summary Fix a Python operator precedence bug in the `research()` function that causes `current_iteration` to be set to a boolean instead of the actual count of previous iterations. ## Bug ```python if current_iteration := len(previous_iterations) > 0: ``` Python evaluates this as: ```python if current_iteration := (len(previous_iterations) > 0): # assigns True or False ``` So `current_iteration` becomes `True` (1) or `False` (0) regardless of how many previous iterations exist. ## Fix ```python if (current_iteration := len(previous_iterations)) > 0: ``` With parentheses, `current_iteration` is correctly set to the count (e.g. 4), and then compared to 0. ## Impact When resuming research with previous iterations, the loop counter was effectively reset to 1 instead of the true count. This allowed the research loop to run significantly more iterations than `MAX_ITERATIONS` intended, wasting compute and API calls. Signed-off-by: JiangNan <1394485448@qq.com>
## Summary
In `extract_from_webpage()`, the `content` parameter is unconditionally
overwritten to `None` on the line before the `is_none_or_empty(content)`
check. This means any pre-fetched content (e.g. text content already
retrieved by the Exa search engine) is always discarded, forcing an
unnecessary re-scrape of the webpage.
## Bug
```python
async def extract_from_webpage(
url: str,
subqueries: set[str] = None,
content: str = None, # <-- caller passes pre-fetched content
...
) -> Tuple[set[str], str, Union[None, str]]:
content = None # <-- BUG: immediately overwrites it
if is_none_or_empty(content): # always True
content = await scrape_webpage_with_fallback(url)
```
## Fix
Remove the `content = None` assignment so the passed-in content is used
when available, falling back to scraping only when needed.
This bug was introduced in a refactor and causes:
- Wasted API calls to web scrapers for pages whose content is already
available
- Increased latency for search results that include inline content (e.g.
Exa)
Signed-off-by: JiangNan <1394485448@qq.com>
…ai#1277) ## Problem When `ChatModel.friendly_name` is `None`, the `__str__` method returns `None`, causing: ``` TypeError: __str__ returned non-string (type NoneType) ``` ## Solution Fall back to `name` field when `friendly_name` is `None`. Related issue: khoj-ai#1251 Co-authored-by: 阳虎 <yanghu@yanghudeMacBook-Pro.local>
…g fails (khoj-ai#1292) When PyMuPDFLoader fails to process an invalid PDF file, the exception is caught but pdf_entry_by_pages is referenced before assignment, causing an UnboundLocalError. Initialized pdf_entry_by_pages to an empty list before the try block so the return statement always has a valid value, even when an exception occurs. Verified with both invalid input (returns []) and valid PDFs (returns extracted text). Fixes khoj-ai#1289 Co-authored-by: BillionClaw <267901332+BillionClaw@users.noreply.github.com>
## Summary
`src/khoj/processor/content/org_mode/orgnode.py:57` opens a file with
`open(filename, "r")` but never closes it. The file handle leaks for the
lifetime of the returned `Orgnode` list.
## Fix
Replaced bare `open()` with a `with` statement to ensure the file is
closed after `makelist()` finishes reading.
```python
# Before
def makelist_with_filepath(filename):
f = open(filename, "r")
return makelist(f, filename)
# After
def makelist_with_filepath(filename):
with open(filename, "r") as f:
return makelist(f, filename)
```
This is safe because `makelist()` fully consumes the file during the
call (building the Orgnode list from file contents), so the file handle
is no longer needed after it returns.
Changes (4 files): - pyproject.toml: authlib 1.6.6 → 1.6.9 - src/interface/web/package.json: dompurify ^3.2.6 → ^3.3.2, eslint-config-next 14.2.3 → 14.2.35 - documentation/package.json: @docusaurus/* → ^3.9.2, added serialize-javascript resolution And regenerated lock files. The only resolution override is serialize-javascript in documentation, which is unavoidable since Docusaurus still pins old copy-webpack-plugin and css-minimizer-webpack-plugin that depend on serialize-javascript ^6.x.
- Add missing skipif decorator to test_create_automation - Change skip condition from 'is None' to 'not' (falsy check) to also handle empty string, which happens when GitHub secrets are unavailable in fork PRs
Add banner to home, chat, shared chat and settings pages for coverage. Link to settings account section to export data and mention Khoj self-host option in banner
Starlette 1.0.0 removed the deprecated TemplateResponse signature where `name` was the first positional arg and `request` was passed inside `context`. The new signature requires `request` as the first positional argument: TemplateResponse(request, name=...). This caused a 500 error in production on web client endpoints with: "Jinja2Templates.TemplateResponse() missing 1 required positional argument: 'name'" (with older Starlette) or "'request'" (with 1.0.0). Update all TemplateResponse calls in web_client.py to use the new Starlette 1.0.0 signature: pass `request` as the first positional arg and `name` as an explicit keyword argument. Issue didn't trigger locally as uv is used locally and pip in docker builds. These resolve dependencies including starletter version to install differently. Locally 0.52.0 was installed while on production starlette 1.0.0 was used. This is what caused the issue and the mismatch in expectation
…i#1296) ## Summary - Add null checks for `config.setting` in `get_chat_model()` and `aget_chat_model()` to prevent `AttributeError` when memories are disabled - When the memory toggle creates a `UserConversationConfig` via `get_or_create` with `setting=None`, accessing `config.setting.price_tier` crashes — now falls through to the default chat model instead ## Root Cause The "Enable Memories" toggle PATCH endpoint uses `get_or_create` on `UserConversationConfig`, which can create a config with `setting=None`. Both `get_chat_model()` and `aget_chat_model()` then crash: - For subscribed users: `if config:` passes but `return config.setting` returns `None`, causing downstream crashes - For non-subscribed users: `config.setting.price_tier` raises `AttributeError` on `None` ## Fix Change `if config:` → `if config and config.setting:` (subscribed path) and add `and config.setting` guard before `.price_tier` access (non-subscribed path), in both sync and async variants. ## Test plan - [ ] Toggle memories off with no prior chat model configured — settings page should still load - [ ] Chat responses should use default model when setting is None - [ ] Existing users with configured chat models should be unaffected Fixes khoj-ai#1287 Signed-off-by: majiayu000 <1835304752@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix self-hosted model selection so
chat_advanceddoes not override the default chat model path when billing is disabled.Problem
On self-hosted Khoj, users could be treated as subscribed in the default-model selection path. As a result,
chat_advancedcould override normal model selection behavior, including the default chat model and user-selected model behavior.Fix
Update:
ConversationAdapters.get_default_chat_modelConversationAdapters.aget_default_chat_modelso the subscribed/advanced branch only applies when
state.billing_enabledis true.Validation
I reproduced the issue locally in a self-hosted Docker setup using Ollama and OpenAI-compatible routing.
Before the fix:
chat_advancedcontrolled the actual chat response pathAfter the fix:
Related issue
Fixes #1282