feat: add Ollama native API support for vision requests#49
feat: add Ollama native API support for vision requests#49yyfl2zju wants to merge 3 commits intozai-org:mainfrom
Conversation
5650185 to
c956842
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds support for Ollama's native /api/generate endpoint to enable more stable vision support compared to Ollama's OpenAI-compatible API in some versions. The implementation introduces a new api_mode configuration option that allows users to choose between OpenAI-compatible format and Ollama's native format, with automatic request/response conversion.
Changes:
- Added
api_modeconfiguration field to support "openai" (default) and "ollama_generate" modes - Implemented
_convert_to_ollama_generate()method to transform OpenAI format requests to Ollama format - Updated
connect()method to test the appropriate endpoint based on the selected API mode - Added comprehensive documentation in both English and Chinese READMEs with configuration examples and troubleshooting tips
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| glmocr/config.py | Added api_mode and model fields to OCRApiConfig class |
| glmocr/config.yaml | Added api_mode configuration option with documentation |
| glmocr/ocr_client.py | Implemented Ollama format conversion, updated request/response handling for both API modes |
| README.md | Added Ollama deployment guide with configuration instructions and troubleshooting |
| README_zh.md | Added Chinese version of Ollama deployment guide |
Comments suppressed due to low confidence (2)
glmocr/ocr_client.py:273
- The model field is injected into the request_data twice. Lines 264-265 inject the model field if configured, and then lines 272-273 inject it again with a check for whether it already exists. This duplicate logic should be consolidated into a single injection point to avoid confusion and potential bugs.
if self.model:
request_data["model"] = self.model
headers = {"Content-Type": "application/json", **self.extra_headers}
if self.api_key:
headers["Authorization"] = f"Bearer {self.api_key}"
# Inject model if configured
if self.model and "model" not in request_data:
request_data["model"] = self.model
glmocr/config.py:39
- The field 'model' is defined twice in this class. Line 35 defines it as 'Optional model name (required by Ollama/MLX)' and line 39 defines it again with a comment 'Model name included in API requests.' This creates ambiguity and the second definition will override the first. Remove one of the duplicate definitions.
model: Optional[str] = None # Optional model name (required by Ollama/MLX)
api_key: Optional[str] = None
# Model name included in API requests.
model: Optional[str] = None
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
glmocr/ocr_client.py
Outdated
| # Extract prompt and images from the last user message | ||
| prompt = "" | ||
| images = [] | ||
|
|
||
| for msg in messages: | ||
| if msg.get("role") == "user": | ||
| content = msg.get("content", "") | ||
|
|
||
| if isinstance(content, str): | ||
| prompt = content | ||
| elif isinstance(content, list): | ||
| for item in content: | ||
| if item.get("type") == "text": | ||
| prompt = item.get("text", "") | ||
| elif item.get("type") == "image_url": | ||
| # Extract base64 from data URI | ||
| image_url = item.get("image_url", "") | ||
| if isinstance(image_url, dict): | ||
| image_url = image_url.get("url", "") | ||
|
|
||
| # Parse data:image/...;base64,<data> | ||
| if image_url.startswith("data:"): | ||
| parts = image_url.split(",", 1) | ||
| if len(parts) == 2: | ||
| images.append(parts[1]) | ||
| else: | ||
| images.append(image_url) |
There was a problem hiding this comment.
There's an inconsistency in how messages are processed. The comment on line 385 states "Extract prompt and images from the last user message", but the code accumulates images from all user messages (line 409, 411) while only keeping the text from the last user message (line 394, 398). Consider either: 1) Processing only the last user message for both prompt and images, or 2) Updating the comment to reflect that images are accumulated from all user messages.
glmocr/ocr_client.py
Outdated
|
|
||
| # Parse response based on API mode | ||
| if self.api_mode == "ollama_generate": | ||
| output = result.get("response", "") |
There was a problem hiding this comment.
When api_mode is 'ollama_generate', the code expects the response to have a 'response' field. However, if the API returns an error or unexpected response format, result.get('response', '') will return an empty string, which might mask the actual error. Consider checking for error conditions in the response before extracting the 'response' field, similar to how the OpenAI format path handles errors.
| output = result.get("response", "") | |
| # Handle possible error or unexpected response format | |
| if isinstance(result, dict) and "error" in result: | |
| logger.error( | |
| "OCR API returned error in ollama_generate mode: %s", | |
| result.get("error"), | |
| ) | |
| return { | |
| "error": f"OCR API error: {result.get('error')}" | |
| }, 500 | |
| if not isinstance(result, dict) or "response" not in result: | |
| logger.error( | |
| "OCR API returned unexpected response format in " | |
| "ollama_generate mode: %s", | |
| result, | |
| ) | |
| return { | |
| "error": "Unexpected OCR API response format", | |
| "response": result, | |
| }, 500 | |
| output = result["response"] or "" |
| def _convert_to_ollama_generate(self, request_data: Dict) -> Dict: | ||
| """Convert OpenAI chat format to Ollama generate format. | ||
|
|
||
| OpenAI format: | ||
| { | ||
| "messages": [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "text", "text": "..."}, | ||
| {"type": "image_url", "image_url": "data:image/...;base64,..."} | ||
| ] | ||
| } | ||
| ], | ||
| "max_tokens": 100, | ||
| ... | ||
| } | ||
|
|
||
| Ollama generate format: | ||
| { | ||
| "model": "glm-ocr:latest", | ||
| "prompt": "...", | ||
| "images": ["base64_string"], | ||
| "stream": false, | ||
| "options": { | ||
| "num_predict": 100, | ||
| ... | ||
| } | ||
| } | ||
| """ | ||
| messages = request_data.get("messages", []) | ||
|
|
||
| # Extract prompt and images from the last user message | ||
| prompt = "" | ||
| images = [] | ||
|
|
||
| for msg in messages: | ||
| if msg.get("role") == "user": | ||
| content = msg.get("content", "") | ||
|
|
||
| if isinstance(content, str): | ||
| prompt = content | ||
| elif isinstance(content, list): | ||
| for item in content: | ||
| if item.get("type") == "text": | ||
| prompt = item.get("text", "") | ||
| elif item.get("type") == "image_url": | ||
| # Extract base64 from data URI | ||
| image_url = item.get("image_url", "") | ||
| if isinstance(image_url, dict): | ||
| image_url = image_url.get("url", "") | ||
|
|
||
| # Parse data:image/...;base64,<data> | ||
| if image_url.startswith("data:"): | ||
| parts = image_url.split(",", 1) | ||
| if len(parts) == 2: | ||
| images.append(parts[1]) | ||
| else: | ||
| images.append(image_url) | ||
|
|
||
| # Build Ollama generate request | ||
| ollama_request = { | ||
| "model": self.model or "glm-ocr:latest", | ||
| "prompt": prompt, | ||
| "stream": False, | ||
| } | ||
|
|
||
| if images: | ||
| ollama_request["images"] = images | ||
|
|
||
| # Map parameters to Ollama options | ||
| options = {} | ||
| if "max_tokens" in request_data: | ||
| options["num_predict"] = request_data["max_tokens"] | ||
| if "temperature" in request_data: | ||
| options["temperature"] = request_data["temperature"] | ||
| if "top_p" in request_data: | ||
| options["top_p"] = request_data["top_p"] | ||
| if "top_k" in request_data: | ||
| options["top_k"] = request_data["top_k"] | ||
| if "repetition_penalty" in request_data: | ||
| options["repeat_penalty"] = request_data["repetition_penalty"] | ||
|
|
||
| if options: | ||
| ollama_request["options"] = options | ||
|
|
||
| return ollama_request |
There was a problem hiding this comment.
The new Ollama native API support functionality (api_mode, _convert_to_ollama_generate method, and related request/response handling) lacks test coverage. Given the complexity of the format conversion logic and multiple code paths, consider adding unit tests to verify: 1) Correct conversion from OpenAI to Ollama format with various input structures, 2) Proper response parsing for both modes, 3) Model field injection behavior, 4) Edge cases like missing fields or malformed data.
glmocr/ocr_client.py
Outdated
| for msg in messages: | ||
| if msg.get("role") == "user": | ||
| content = msg.get("content", "") | ||
|
|
||
| if isinstance(content, str): | ||
| prompt = content |
There was a problem hiding this comment.
The _convert_to_ollama_generate method only processes messages with role='user' and ignores system or assistant messages. If the OpenAI-format request contains system messages or conversation history, this information will be lost in the conversion. Consider documenting this limitation or adding a warning log if non-user messages are present.
f7e4e4c to
7ac7751
Compare
- Add api_mode config option to support both OpenAI and Ollama formats - Implement _convert_to_ollama_generate() to transform requests - Update connect() method to test appropriate endpoint based on mode - Add model field injection for Ollama/MLX compatibility - Update README.md and README_zh.md with Ollama configuration guide - Include troubleshooting tips for 502 errors and layout dependencies This change enables GLM-OCR to work with Ollama's /api/generate endpoint, which provides more stable vision support than the OpenAI-compatible API in some Ollama versions.
7ac7751 to
c35c0d9
Compare
This change enables GLM-OCR to work with Ollama's /api/generate endpoint, which provides more stable vision support than the OpenAI-compatible API in some Ollama versions.