bytedance · tonycai · Apr 23, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,32 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Environment
+- Language: Python
+- Task: UI-TARS is a multimodal agent for GUI interaction
+
+## Commands
+- No explicit build/lint/test commands found in the codebase
+- For coordinate processing: `python coordinate_processing_script.py`
+- For visualization: Use matplotlib to display coordinate outputs
+
+## Code Style
+- Indent: 4 spaces
+- Quotes: Double quotes for strings
+- Imports: Standard library first, then third-party, then local imports
+- Error handling: Use specific exceptions with descriptive messages
+- Naming: snake_case for functions/variables, UPPER_CASE for constants
+- Documentation: Docstrings for functions (as seen in smart_resize)
+- Comments: Descriptive comments for complex operations
+
+## Dependencies
+- PIL/Pillow for image processing
+- matplotlib for visualization
+- re for parsing model outputs
+- Other common imports: json, math, io
+
+## Model-Specific Notes
+- Coordinates are processed with IMAGE_FACTOR=28
+- Model outputs need to be rescaled to original dimensions
+- Parse model action outputs carefully for accurate coordinate extraction
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,189 @@
+FROM python:3.10-slim
+
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    git \
+    wget \
+    curl \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set up virtual environment
+RUN python -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# Install PyTorch with CUDA support
+RUN pip install --upgrade pip && \
+    pip install --no-cache-dir torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118
+
+# Install UI-TARS dependencies
+RUN pip install --no-cache-dir \
+    transformers==4.35.0 \
+    accelerate==0.23.0 \
+    bitsandbytes==0.41.1 \
+    pillow==10.0.1 \
+    matplotlib==3.7.3 \
+    numpy==1.24.3 \
+    sentencepiece==0.1.99 \
+    openai==1.0.0 \
+    requests==2.31.0 \
+    pydantic==2.5.1 \
+    safetensors==0.4.0 \
+    scipy==1.11.3 \
+    vllm==0.6.1
+
+# Copy project files
+COPY . /app/
+
+# Create directories for model and data
+RUN mkdir -p /app/model /app/data
+
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV HF_MODEL_ID="ByteDance-Seed/UI-TARS-1.5-7B"
+ENV HF_HOME="/app/model"
+ENV TRANSFORMERS_CACHE="/app/model"
+
+# Download UI-TARS model from Hugging Face (comment out if you want to download separately)
+RUN echo "Starting model download..." && \
+    python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
+    print('Downloading tokenizer...'); \
+    tokenizer = AutoTokenizer.from_pretrained('${HF_MODEL_ID}', trust_remote_code=True); \
+    print('Tokenizer downloaded successfully'); \
+    # If you have enough memory and want to download the model directly, uncomment the next line \
+    # model = AutoModelForCausalLM.from_pretrained('${HF_MODEL_ID}', trust_remote_code=True, device_map='auto'); \
+    # print('Model downloaded successfully');" || echo "Model will be downloaded at runtime"
+
+# Create model server script
+RUN echo '#!/usr/bin/env python3\n\
+import os\n\
+import torch\n\
+from vllm import LLM, SamplingParams\n\
+from vllm.entrypoints.openai.api_server import serve_vllm_api_server\n\
+from transformers import AutoTokenizer\n\
+\n\
+def main():\n\
+    model_id = os.environ.get("HF_MODEL_ID", "ByteDance-Seed/UI-TARS-1.5-7B")\n\
+    print(f"Starting server with model: {model_id}")\n\
+    \n\
+    # Load tokenizer\n\
+    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)\n\
+    \n\
+    # Start vLLM server\n\
+    serve_vllm_api_server(\n\
+        model=model_id,\n\
+        tensor_parallel_size=1,  # Change based on available GPUs\n\
+        gpu_memory_utilization=0.9,\n\
+        trust_remote_code=True,\n\
+        dtype="bfloat16",  # Use float16 if bfloat16 is not supported\n\
+        host="0.0.0.0",\n\
+        port=8000\n\
+    )\n\
+\n\
+if __name__ == "__main__":\n\
+    main()\n\
+' > /app/server.py && chmod +x /app/server.py
+
+# Create inference script
+RUN echo '#!/usr/bin/env python3\n\
+import os\n\
+import sys\n\
+import json\n\
+import base64\n\
+import argparse\n\
+from PIL import Image\n\
+from io import BytesIO\n\
+import requests\n\
+\n\
+def encode_image(image_path):\n\
+    with open(image_path, "rb") as image_file:\n\
+        return base64.b64encode(image_file.read()).decode("utf-8")\n\
+\n\
+def query_model(image_path, instruction, server_url="http://localhost:8000/v1/chat/completions"):\n\
+    # Encode the image\n\
+    base64_image = encode_image(image_path)\n\
+    \n\
+    # Prepare the messages with system prompt from prompts.py\n\
+    with open("/app/prompts.py", "r") as f:\n\
+        prompts_content = f.read()\n\
+    \n\
+    # Extract computer use prompt\n\
+    import re\n\
+    computer_prompt = re.search(r\'COMPUTER_USE = \"\"\"(.+?)\"\"\"\', prompts_content, re.DOTALL)\n\
+    if computer_prompt:\n\
+        system_prompt = computer_prompt.group(1).replace("{language}", "English").replace("{instruction}", instruction)\n\
+    else:\n\
+        system_prompt = "You are a GUI agent. You are given a task and your action history, with screenshots."\n\
+    \n\
+    # Prepare the API request\n\
+    headers = {"Content-Type": "application/json"}\n\
+    payload = {\n\
+        "model": "UI-TARS",\n\
+        "messages": [\n\
+            {"role": "system", "content": system_prompt},\n\
+            {"role": "user", "content": [\n\
+                {"type": "text", "text": instruction},\n\
+                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}\n\
+            ]}\n\
+        ],\n\
+        "temperature": 0.01,\n\
+        "max_tokens": 512\n\
+    }\n\
+    \n\
+    # Make the API request\n\
+    try:\n\
+        response = requests.post(server_url, headers=headers, json=payload)\n\
+        response.raise_for_status()\n\
+        result = response.json()\n\
+        return result["choices"][0]["message"]["content"]\n\
+    except Exception as e:\n\
+        return f"Error: {str(e)}"\n\
+\n\
+def main():\n\
+    parser = argparse.ArgumentParser(description="UI-TARS Model Inference")\n\
+    parser.add_argument("--image", required=True, help="Path to screenshot image")\n\
+    parser.add_argument("--instruction", required=True, help="Task instruction")\n\
+    parser.add_argument("--server", default="http://localhost:8000/v1/chat/completions", help="Server URL")\n\
+    \n\
+    args = parser.parse_args()\n\
+    \n\
+    result = query_model(args.image, args.instruction, args.server)\n\
+    print(result)\n\
+\n\
+if __name__ == "__main__":\n\
+    main()\n\
+' > /app/inference.py && chmod +x /app/inference.py
+
+# Create entrypoint script
+RUN echo '#!/bin/bash\n\
+if [ "$1" = "serve" ]; then\n\
+    echo "Starting UI-TARS server..."\n\
+    python /app/server.py\n\
+elif [ "$1" = "infer" ]; then\n\
+    echo "Running inference..."\n\
+    python /app/inference.py --image "$2" --instruction "$3"\n\
+elif [ "$1" = "process-coordinates" ]; then\n\
+    echo "Processing coordinates..."\n\
+    python /app/coordinate_processing_script.py --image "$2" --model-output "$3" --output "$4"\n\
+elif [ "$1" = "analyze-webpage" ]; then\n\
+    echo "Analyzing webpage..."\n\
+    python /app/webpage_analyzer.py --image "$2" ${3:+--output "$3"}\n\
+else\n\
+    echo "UI-TARS Docker container"\n\
+    echo "Usage:"\n\
+    echo "  serve                   - Start the model server"\n\
+    echo "  infer IMAGE INSTRUCTION - Run inference on an image"\n\
+    echo "  process-coordinates IMAGE MODEL_OUTPUT OUTPUT - Process and visualize coordinates"\n\
+    echo "  analyze-webpage IMAGE [OUTPUT_FILE] - Analyze a webpage screenshot and output description"\n\
+    echo "Environment:"\n\
+    echo "  HF_MODEL_ID - HuggingFace model ID (default: ByteDance-Seed/UI-TARS-1.5-7B)"\n\
+fi\n\
+' > /app/entrypoint.sh && chmod +x /app/entrypoint.sh
+
+ENTRYPOINT ["/app/entrypoint.sh"]
+CMD ["help"]
diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@
 We also offer a **UI-TARS-desktop** version, which can operate on your **local personal device**. To use it, please visit [https://github.com/bytedance/UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop). To use UI-TARS in web automation, you may refer to the open-source project [Midscene.js](https://github.com/web-infra-dev/Midscene).
 
 ## Updates
+- 🐳 2025.04.24: Added Docker containerization, coordinate processing tools, and webpage analysis features. See [Docker Deployment Guide](README_docker.md).
 - 🌟 2025.04.16: We shared the latest progress of the UI-TARS-1.5 model in our [blog](https://seed-tars.com/1.5), which excels in playing games and performing GUI tasks, and we open-sourced the [UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B).
 - ✨ 2025.03.23: We updated the OSWorld inference scripts from the original official [OSWorld repository](https://github.com/xlang-ai/OSWorld/blob/main/run_uitars.py). Now, you can use the OSWorld official inference scripts to reproduce our results.
 
@@ -34,13 +35,63 @@ Leveraging the foundational architecture introduced in [our recent paper](https:
     </video>
 <p>
 
-## Deployment
-- See the deploy guide <a href="README_deploy.md">here</a>.
-- For coordinates processing, refer to <a href="README_coordinates.md">here</a>.
-- For full action space parsing, refer to [OSWorld uitars_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/uitars_agent.py)
+## Core Features
+
+UI-TARS provides several key capabilities:
+
+1. **GUI Interaction**: Automatically interact with graphical user interfaces through vision-language models
+2. **Multi-Platform Support**: Works with desktop, mobile, and web interfaces
+3. **Action Generation**: Produces precise interface actions (clicks, typing, scrolling) with coordinate mapping
+4. **Visual Understanding**: Comprehends interface elements, their relationships, and functions
+5. **Webpage Analysis**: Converts UI screenshots to structured plaintext descriptions
+6. **Coordinate Processing**: Maps model output coordinates to actual screen positions
+
+## System Architecture
+
+UI-TARS consists of the following components:
+
+1. **Vision-Language Model**: Processes screenshots to understand interface elements
+2. **Action Space**: Defines possible interactions (click, drag, type, etc.)
+3. **Coordinate System**: Maps model outputs to actual screen positions
+4. **Prompt System**: Configures model behavior for different platforms and tasks
+5. **API Interface**: Provides OpenAI-compatible endpoints for integration
+
+## Deployment Options
+
+### 1. Docker Container (Recommended)
+- Comprehensive Docker setup with GPU support
+- See the [Docker Deployment Guide](README_docker.md)
+- Includes web analysis and coordinate processing tools
+
+### 2. HuggingFace Inference Endpoints
+- Cloud-based deployment on HuggingFace infrastructure
+- See the [HuggingFace deploy guide](README_deploy.md)
+
+### 3. Local Development
+- For coordinates processing, refer to [Coordinates Guide](README_coordinates.md)
+- For action space parsing, refer to [OSWorld uitars_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/uitars_agent.py)
+
+## Usage Examples
+
+### GUI Interaction
+```python
+# Example of using UI-TARS for GUI interaction
+response = ui_tars.process_screenshot(
+    image_path="screenshot.png",
+    instruction="Click on the search button"
+)
+# Response: Action: click(start_box='(197,525)')
+```
+
+### Webpage Analysis
+```bash
+# Docker container command for webpage analysis
+docker-compose exec ui-tars /app/entrypoint.sh analyze-webpage \
+  /app/data/webpage_screenshot.png /app/data/analysis.txt
+```
 
 ## System Prompts
-- Refer to <a href="./prompts.py">prompts.py</a>
+- Refer to <a href="./prompts.py">prompts.py</a> for system prompt templates
 
 
 ## Performance