feat: add Ollama native API support for vision requests

Admin · Admin · commit c9568426d26d · 2026-02-04T23:31:40.000+08:00
- Add api_mode config option to support both OpenAI and Ollama formats
- Implement _convert_to_ollama_generate() to transform requests
- Update connect() method to test appropriate endpoint based on mode
- Add model field injection for Ollama/MLX compatibility
- Update README.md and README_zh.md with Ollama configuration guide
- Include troubleshooting tips for 502 errors and layout dependencies

This change enables GLM-OCR to work with Ollama's /api/generate endpoint,
which provides more stable vision support than the OpenAI-compatible API
in some Ollama versions.
diff --git a/README.md b/README.md
@@ -134,6 +134,60 @@ pipeline:
 
 See the **[MLX Detailed Deployment Guide](examples/mlx-deploy/README.md)** for full setup instructions, including environment isolation and troubleshooting.
 
+#### Option 4: Self-hosting with Ollama
+
+Ollama provides a simple local deployment option, but requires special configuration for vision support.
+
+##### Install and Setup Ollama
+
+1. Install Ollama from https://ollama.ai
+
+2. Pull the GLM-OCR model:
+
+```bash
+ollama pull glm-ocr:latest
+```
+
+3. Start the Ollama service:
+
+```bash
+ollama serve
+```
+
+##### Configure SDK for Ollama
+
+**Important**: Ollama's OpenAI-compatible API (`/v1/chat/completions`) has unstable vision support in some versions and may return 502 errors. We recommend using Ollama's native API.
+
+Configure `config.yaml`:
+
+```yaml
+pipeline:
+  maas:
+    enabled: false
+  ocr_api:
+    api_host: localhost
+    api_port: 11434  # Ollama default port
+    api_path: /api/generate  # Use Ollama native endpoint
+    model: glm-ocr:latest  # Required: specify model name
+    api_mode: ollama_generate  # Required: use Ollama native format
+  enable_layout: false  # Recommended to disable layout mode for testing
+```
+
+**Configuration Notes**:
+- `api_mode: ollama_generate` - Use Ollama's native `/api/generate` endpoint
+- `model: glm-ocr:latest` - Ollama requires model name to be specified
+- `enable_layout: false` - Disable if layout dependencies are not installed
+
+**Troubleshooting**:
+- If you encounter 502 errors, ensure `api_mode: ollama_generate` is set
+- If you see "Layout detection dependencies" errors, set `enable_layout: false`
+- Use `ollama ps` to check if the model is loaded
+- Use `ollama show glm-ocr:latest` to view model information
+
+**Performance Recommendations**:
+- vLLM/SGLang provide better performance and stability for production use
+- Ollama is suitable for quick testing and personal use
+
 ### SDK Usage Guide
 
 #### CLI
diff --git a/README_zh.md b/README_zh.md
@@ -133,6 +133,61 @@ pipeline:
     api_port: 8080
 ```
 
+#### 方式 3: Apple Silicon 使用 mlx-vlm
+
+查看 **[MLX 详细部署指南](examples/mlx-deploy/README.md)** 获取完整的设置说明，包括环境隔离和故障排除。
+
+#### 方式 4: 使用 Ollama 自部署
+
+Ollama 提供简单的本地部署方式，但需要特别配置以支持 vision 功能。
+
+##### 安装和设置 Ollama
+
+1. 从 https://ollama.ai 安装 Ollama
+
+2. 拉取 GLM-OCR 模型：
+
+```bash
+ollama pull glm-ocr:latest
+```
+
+3. 启动 Ollama 服务：
+
+```bash
+ollama serve
+```
+
+##### 配置 SDK 使用 Ollama
+
+**重要**: Ollama 的 OpenAI 兼容 API (`/v1/chat/completions`) 在某些版本中对 vision 请求支持不稳定，可能返回 502 错误。推荐使用 Ollama 原生 API。
+
+配置 `config.yaml`：
+
+```yaml
+pipeline:
+  maas:
+    enabled: false
+  ocr_api:
+    api_host: localhost
+    api_port: 11434  # Ollama 默认端口
+    api_path: /api/generate  # 使用 Ollama 原生端点
+    model: glm-ocr:latest  # 必需: 指定模型名称
+    api_mode: ollama_generate  # 必需: 使用 Ollama 原生格式
+  enable_layout: false  # 推荐先禁用 layout 模式测试
+```
+
+**配置说明**：
+- `api_mode: ollama_generate` - 使用 Ollama 原生 `/api/generate` 端点
+- `model: glm-ocr:latest` - Ollama 要求必须指定模型名称
+- `enable_layout: false` - 如果没有安装 layout 依赖，需要禁用
+
+**故障排除**：
+- 如果遇到 502 错误，确认已设置 `api_mode: ollama_generate`
+- 如果遇到 "Layout detection dependencies" 错误，设置 `enable_layout: false`
+- 使用 `ollama ps` 检查模型是否已加载
+- 使用 `ollama show glm-ocr:latest` 查看模型信息
+
+
 ### SDK 使用指南
 
 #### CLI
diff --git a/glmocr/config.py b/glmocr/config.py
@@ -32,13 +32,18 @@ class OCRApiConfig(_BaseConfig):
     api_scheme: Optional[str] = None
     api_path: str = "/v1/chat/completions"
     api_url: Optional[str] = None
+    model: Optional[str] = None  # Optional model name (required by Ollama/MLX)
     api_key: Optional[str] = None
 
     # Model name included in API requests.
     model: Optional[str] = None
     headers: Dict[str, str] = Field(default_factory=dict)
     verify_ssl: bool = False
 
+    # API mode: "openai" (default) or "ollama_generate"
+    # Use "ollama_generate" for Ollama's native /api/generate endpoint
+    api_mode: str = "openai"
+
     connect_timeout: int = 300
     request_timeout: int = 300
 
diff --git a/glmocr/config.yaml b/glmocr/config.yaml
@@ -57,23 +57,31 @@ pipeline:
   # - You need offline/air-gapped operation
   # - You want to customize the pipeline (layout detection, prompts, etc.)
 
-  # OCR API client configuration (for self-hosted vLLM/SGLang)
+  # OCR API client configuration (for self-hosted vLLM/SGLang/Ollama)
   ocr_api:
     # Basic connection
     api_host: 127.0.0.1
     api_port: 8080
 
-    # Model name included in API requests.
-    # Required for mlx_vlm.server (e.g. "mlx-community/GLM-OCR-bf16").
-    # Set to null when using vLLM/SGLang (model is selected at server startup).
-    model: null
-
     # URL construction: {api_scheme}://{api_host}:{api_port}{api_path}
     # Or set api_url directly to override
     api_scheme: null # null = auto (https if port 443, else http)
     api_path: /v1/chat/completions
+    # Note: If using Ollama and encountering 502 errors with vision requests,
+    # try switching to "ollama_generate" mode with api_path: /api/generate
     api_url: null # full URL override (optional)
 
+    # Model name (optional)
+    # Required by some runtimes (e.g., Ollama/MLX). Not required for vLLM/SGLang.
+    # For mlx_vlm.server, use e.g. "mlx-community/GLM-OCR-bf16".
+    # If not set, the "model" field will not be included in the request payload.
+    model: null
+
+    # API mode: "openai" (default) or "ollama_generate"
+    # - "openai": Use OpenAI-compatible /v1/chat/completions endpoint (vLLM/SGLang/Ollama)
+    # - "ollama_generate": Use Ollama's native /api/generate endpoint
+    api_mode: openai
+
     # Authentication (for MaaS providers like Zhipu, OpenAI, etc.)
     api_key: null # or set GLMOCR_API_KEY env var
     headers: {} # additional HTTP headers
diff --git a/glmocr/ocr_client.py b/glmocr/ocr_client.py
@@ -54,6 +54,10 @@ def __init__(self, config: "OCRApiConfig"):
 
         self.api_key = config.api_key or os.getenv("GLMOCR_API_KEY")
         self.extra_headers = config.headers or {}
+        self.model = config.model  # Optional model name
+
+        # API mode: "openai" or "ollama_generate"
+        self.api_mode = getattr(config, "api_mode", "openai")
 
         # SSL verification
         self.verify_ssl = config.verify_ssl
@@ -168,18 +172,33 @@ def connect(self):
                     sock.settimeout(10)
                     result = sock.connect_ex((self.api_host, self.api_port))
                     if result == 0:
-                        # Send a test request to the chat/completions endpoint
+                        # Send a test request
                         try:
-                            test_payload = {
-                                "messages": [
-                                    {
-                                        "role": "user",
-                                        "content": [{"type": "text", "text": "hello"}],
-                                    }
-                                ],
-                                "max_tokens": 10,
-                                "temperature": 0.1,
-                            }
+                            # Build test payload based on API mode
+                            if self.api_mode == "ollama_generate":
+                                test_payload = {
+                                    "model": self.model or "glm-ocr:latest",
+                                    "prompt": "hello",
+                                    "stream": False,
+                                    "options": {"num_predict": 10},
+                                }
+                            else:
+                                test_payload = {
+                                    "messages": [
+                                        {
+                                            "role": "user",
+                                            "content": [
+                                                {"type": "text", "text": "hello"}
+                                            ],
+                                        }
+                                    ],
+                                    "max_tokens": 10,
+                                    "temperature": 0.1,
+                                }
+                                # Inject model field if configured (required by Ollama/MLX)
+                                if self.model:
+                                    test_payload["model"] = self.model
+
                             headers = {
                                 "Content-Type": "application/json",
                                 **self.extra_headers,
@@ -237,6 +256,14 @@ def process(self, request_data: Dict) -> Tuple[Dict, int]:
         if self._session is None:
             self._session = self._make_session()
 
+        # Convert request format based on API mode
+        if self.api_mode == "ollama_generate":
+            request_data = self._convert_to_ollama_generate(request_data)
+        else:
+            # Inject model field if configured (required by Ollama/MLX)
+            if self.model:
+                request_data["model"] = self.model
+
         headers = {"Content-Type": "application/json", **self.extra_headers}
         if self.api_key:
             headers["Authorization"] = f"Bearer {self.api_key}"
@@ -261,7 +288,13 @@ def process(self, request_data: Dict) -> Tuple[Dict, int]:
 
                 if response.status_code == 200:
                     result = response.json()
-                    output = result["choices"][0]["message"]["content"]
+
+                    # Parse response based on API mode
+                    if self.api_mode == "ollama_generate":
+                        output = result.get("response", "")
+                    else:
+                        output = result["choices"][0]["message"]["content"]
+
                     return {"choices": [{"message": {"content": output.strip()}}]}, 200
 
                 status = int(response.status_code)
@@ -316,3 +349,91 @@ def process(self, request_data: Dict) -> Tuple[Dict, int]:
             "error": f"API request failed after {total_attempts} attempts",
             "detail": last_error,
         }, 500
+
+    def _convert_to_ollama_generate(self, request_data: Dict) -> Dict:
+        """Convert OpenAI chat format to Ollama generate format.
+
+        OpenAI format:
+        {
+            "messages": [
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": "..."},
+                        {"type": "image_url", "image_url": "data:image/...;base64,..."}
+                    ]
+                }
+            ],
+            "max_tokens": 100,
+            ...
+        }
+
+        Ollama generate format:
+        {
+            "model": "glm-ocr:latest",
+            "prompt": "...",
+            "images": ["base64_string"],
+            "stream": false,
+            "options": {
+                "num_predict": 100,
+                ...
+            }
+        }
+        """
+        messages = request_data.get("messages", [])
+
+        # Extract prompt and images from the last user message
+        prompt = ""
+        images = []
+
+        for msg in messages:
+            if msg.get("role") == "user":
+                content = msg.get("content", "")
+
+                if isinstance(content, str):
+                    prompt = content
+                elif isinstance(content, list):
+                    for item in content:
+                        if item.get("type") == "text":
+                            prompt = item.get("text", "")
+                        elif item.get("type") == "image_url":
+                            # Extract base64 from data URI
+                            image_url = item.get("image_url", "")
+                            if isinstance(image_url, dict):
+                                image_url = image_url.get("url", "")
+
+                            # Parse data:image/...;base64,<data>
+                            if image_url.startswith("data:"):
+                                parts = image_url.split(",", 1)
+                                if len(parts) == 2:
+                                    images.append(parts[1])
+                            else:
+                                images.append(image_url)
+
+        # Build Ollama generate request
+        ollama_request = {
+            "model": self.model or "glm-ocr:latest",
+            "prompt": prompt,
+            "stream": False,
+        }
+
+        if images:
+            ollama_request["images"] = images
+
+        # Map parameters to Ollama options
+        options = {}
+        if "max_tokens" in request_data:
+            options["num_predict"] = request_data["max_tokens"]
+        if "temperature" in request_data:
+            options["temperature"] = request_data["temperature"]
+        if "top_p" in request_data:
+            options["top_p"] = request_data["top_p"]
+        if "top_k" in request_data:
+            options["top_k"] = request_data["top_k"]
+        if "repetition_penalty" in request_data:
+            options["repeat_penalty"] = request_data["repetition_penalty"]
+
+        if options:
+            ollama_request["options"] = options
+
+        return ollama_request