Skip to content

feat: add Ollama native API support for vision requests#49

Open
yyfl2zju wants to merge 3 commits intozai-org:mainfrom
yyfl2zju:feat/ollama-native-api-support
Open

feat: add Ollama native API support for vision requests#49
yyfl2zju wants to merge 3 commits intozai-org:mainfrom
yyfl2zju:feat/ollama-native-api-support

Conversation

@yyfl2zju
Copy link

@yyfl2zju yyfl2zju commented Feb 4, 2026

  • Add api_mode config option to support both OpenAI and Ollama formats
  • Implement _convert_to_ollama_generate() to transform requests
  • Update connect() method to test appropriate endpoint based on mode
  • Add model field injection for Ollama/MLX compatibility
  • Update README.md and README_zh.md with Ollama configuration guide
  • Include troubleshooting tips for 502 errors and layout dependencies

This change enables GLM-OCR to work with Ollama's /api/generate endpoint, which provides more stable vision support than the OpenAI-compatible API in some Ollama versions.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for Ollama's native /api/generate endpoint to enable more stable vision support compared to Ollama's OpenAI-compatible API in some versions. The implementation introduces a new api_mode configuration option that allows users to choose between OpenAI-compatible format and Ollama's native format, with automatic request/response conversion.

Changes:

  • Added api_mode configuration field to support "openai" (default) and "ollama_generate" modes
  • Implemented _convert_to_ollama_generate() method to transform OpenAI format requests to Ollama format
  • Updated connect() method to test the appropriate endpoint based on the selected API mode
  • Added comprehensive documentation in both English and Chinese READMEs with configuration examples and troubleshooting tips

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
glmocr/config.py Added api_mode and model fields to OCRApiConfig class
glmocr/config.yaml Added api_mode configuration option with documentation
glmocr/ocr_client.py Implemented Ollama format conversion, updated request/response handling for both API modes
README.md Added Ollama deployment guide with configuration instructions and troubleshooting
README_zh.md Added Chinese version of Ollama deployment guide
Comments suppressed due to low confidence (2)

glmocr/ocr_client.py:273

  • The model field is injected into the request_data twice. Lines 264-265 inject the model field if configured, and then lines 272-273 inject it again with a check for whether it already exists. This duplicate logic should be consolidated into a single injection point to avoid confusion and potential bugs.
            if self.model:
                request_data["model"] = self.model

        headers = {"Content-Type": "application/json", **self.extra_headers}
        if self.api_key:
            headers["Authorization"] = f"Bearer {self.api_key}"

        # Inject model if configured
        if self.model and "model" not in request_data:
            request_data["model"] = self.model

glmocr/config.py:39

  • The field 'model' is defined twice in this class. Line 35 defines it as 'Optional model name (required by Ollama/MLX)' and line 39 defines it again with a comment 'Model name included in API requests.' This creates ambiguity and the second definition will override the first. Remove one of the duplicate definitions.
    model: Optional[str] = None  # Optional model name (required by Ollama/MLX)
    api_key: Optional[str] = None

    # Model name included in API requests.
    model: Optional[str] = None

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 385 to 411
# Extract prompt and images from the last user message
prompt = ""
images = []

for msg in messages:
if msg.get("role") == "user":
content = msg.get("content", "")

if isinstance(content, str):
prompt = content
elif isinstance(content, list):
for item in content:
if item.get("type") == "text":
prompt = item.get("text", "")
elif item.get("type") == "image_url":
# Extract base64 from data URI
image_url = item.get("image_url", "")
if isinstance(image_url, dict):
image_url = image_url.get("url", "")

# Parse data:image/...;base64,<data>
if image_url.startswith("data:"):
parts = image_url.split(",", 1)
if len(parts) == 2:
images.append(parts[1])
else:
images.append(image_url)
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an inconsistency in how messages are processed. The comment on line 385 states "Extract prompt and images from the last user message", but the code accumulates images from all user messages (line 409, 411) while only keeping the text from the last user message (line 394, 398). Consider either: 1) Processing only the last user message for both prompt and images, or 2) Updating the comment to reflect that images are accumulated from all user messages.

Copilot uses AI. Check for mistakes.

# Parse response based on API mode
if self.api_mode == "ollama_generate":
output = result.get("response", "")
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When api_mode is 'ollama_generate', the code expects the response to have a 'response' field. However, if the API returns an error or unexpected response format, result.get('response', '') will return an empty string, which might mask the actual error. Consider checking for error conditions in the response before extracting the 'response' field, similar to how the OpenAI format path handles errors.

Suggested change
output = result.get("response", "")
# Handle possible error or unexpected response format
if isinstance(result, dict) and "error" in result:
logger.error(
"OCR API returned error in ollama_generate mode: %s",
result.get("error"),
)
return {
"error": f"OCR API error: {result.get('error')}"
}, 500
if not isinstance(result, dict) or "response" not in result:
logger.error(
"OCR API returned unexpected response format in "
"ollama_generate mode: %s",
result,
)
return {
"error": "Unexpected OCR API response format",
"response": result,
}, 500
output = result["response"] or ""

Copilot uses AI. Check for mistakes.
Comment on lines 353 to 472
def _convert_to_ollama_generate(self, request_data: Dict) -> Dict:
"""Convert OpenAI chat format to Ollama generate format.

OpenAI format:
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "..."},
{"type": "image_url", "image_url": "data:image/...;base64,..."}
]
}
],
"max_tokens": 100,
...
}

Ollama generate format:
{
"model": "glm-ocr:latest",
"prompt": "...",
"images": ["base64_string"],
"stream": false,
"options": {
"num_predict": 100,
...
}
}
"""
messages = request_data.get("messages", [])

# Extract prompt and images from the last user message
prompt = ""
images = []

for msg in messages:
if msg.get("role") == "user":
content = msg.get("content", "")

if isinstance(content, str):
prompt = content
elif isinstance(content, list):
for item in content:
if item.get("type") == "text":
prompt = item.get("text", "")
elif item.get("type") == "image_url":
# Extract base64 from data URI
image_url = item.get("image_url", "")
if isinstance(image_url, dict):
image_url = image_url.get("url", "")

# Parse data:image/...;base64,<data>
if image_url.startswith("data:"):
parts = image_url.split(",", 1)
if len(parts) == 2:
images.append(parts[1])
else:
images.append(image_url)

# Build Ollama generate request
ollama_request = {
"model": self.model or "glm-ocr:latest",
"prompt": prompt,
"stream": False,
}

if images:
ollama_request["images"] = images

# Map parameters to Ollama options
options = {}
if "max_tokens" in request_data:
options["num_predict"] = request_data["max_tokens"]
if "temperature" in request_data:
options["temperature"] = request_data["temperature"]
if "top_p" in request_data:
options["top_p"] = request_data["top_p"]
if "top_k" in request_data:
options["top_k"] = request_data["top_k"]
if "repetition_penalty" in request_data:
options["repeat_penalty"] = request_data["repetition_penalty"]

if options:
ollama_request["options"] = options

return ollama_request
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Ollama native API support functionality (api_mode, _convert_to_ollama_generate method, and related request/response handling) lacks test coverage. Given the complexity of the format conversion logic and multiple code paths, consider adding unit tests to verify: 1) Correct conversion from OpenAI to Ollama format with various input structures, 2) Proper response parsing for both modes, 3) Model field injection behavior, 4) Edge cases like missing fields or malformed data.

Copilot uses AI. Check for mistakes.
Comment on lines 389 to 394
for msg in messages:
if msg.get("role") == "user":
content = msg.get("content", "")

if isinstance(content, str):
prompt = content
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _convert_to_ollama_generate method only processes messages with role='user' and ignores system or assistant messages. If the OpenAI-format request contains system messages or conversation history, this information will be lost in the conversion. Consider documenting this limitation or adding a warning log if non-user messages are present.

Copilot uses AI. Check for mistakes.
@yyfl2zju yyfl2zju force-pushed the feat/ollama-native-api-support branch from f7e4e4c to 7ac7751 Compare February 8, 2026 02:08
- Add api_mode config option to support both OpenAI and Ollama formats
- Implement _convert_to_ollama_generate() to transform requests
- Update connect() method to test appropriate endpoint based on mode
- Add model field injection for Ollama/MLX compatibility
- Update README.md and README_zh.md with Ollama configuration guide
- Include troubleshooting tips for 502 errors and layout dependencies

This change enables GLM-OCR to work with Ollama's /api/generate endpoint,
which provides more stable vision support than the OpenAI-compatible API
in some Ollama versions.
@yyfl2zju yyfl2zju force-pushed the feat/ollama-native-api-support branch from 7ac7751 to c35c0d9 Compare February 8, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant