Skip to content

Add Docker containerization and webpage analysis #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Environment
- Language: Python
- Task: UI-TARS is a multimodal agent for GUI interaction

## Commands
- No explicit build/lint/test commands found in the codebase
- For coordinate processing: `python coordinate_processing_script.py`
- For visualization: Use matplotlib to display coordinate outputs

## Code Style
- Indent: 4 spaces
- Quotes: Double quotes for strings
- Imports: Standard library first, then third-party, then local imports
- Error handling: Use specific exceptions with descriptive messages
- Naming: snake_case for functions/variables, UPPER_CASE for constants
- Documentation: Docstrings for functions (as seen in smart_resize)
- Comments: Descriptive comments for complex operations

## Dependencies
- PIL/Pillow for image processing
- matplotlib for visualization
- re for parsing model outputs
- Other common imports: json, math, io

## Model-Specific Notes
- Coordinates are processed with IMAGE_FACTOR=28
- Model outputs need to be rescaled to original dimensions
- Parse model action outputs carefully for accurate coordinate extraction
189 changes: 189 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
git \
wget \
curl \
libgl1-mesa-glx \
libglib2.0-0 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Set up virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install PyTorch with CUDA support
RUN pip install --upgrade pip && \
pip install --no-cache-dir torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118

# Install UI-TARS dependencies
RUN pip install --no-cache-dir \
transformers==4.35.0 \
accelerate==0.23.0 \
bitsandbytes==0.41.1 \
pillow==10.0.1 \
matplotlib==3.7.3 \
numpy==1.24.3 \
sentencepiece==0.1.99 \
openai==1.0.0 \
requests==2.31.0 \
pydantic==2.5.1 \
safetensors==0.4.0 \
scipy==1.11.3 \
vllm==0.6.1

# Copy project files
COPY . /app/

# Create directories for model and data
RUN mkdir -p /app/model /app/data

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV HF_MODEL_ID="ByteDance-Seed/UI-TARS-1.5-7B"
ENV HF_HOME="/app/model"
ENV TRANSFORMERS_CACHE="/app/model"

# Download UI-TARS model from Hugging Face (comment out if you want to download separately)
RUN echo "Starting model download..." && \
python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
print('Downloading tokenizer...'); \
tokenizer = AutoTokenizer.from_pretrained('${HF_MODEL_ID}', trust_remote_code=True); \
print('Tokenizer downloaded successfully'); \
# If you have enough memory and want to download the model directly, uncomment the next line \
# model = AutoModelForCausalLM.from_pretrained('${HF_MODEL_ID}', trust_remote_code=True, device_map='auto'); \
# print('Model downloaded successfully');" || echo "Model will be downloaded at runtime"

# Create model server script
RUN echo '#!/usr/bin/env python3\n\
import os\n\
import torch\n\
from vllm import LLM, SamplingParams\n\
from vllm.entrypoints.openai.api_server import serve_vllm_api_server\n\
from transformers import AutoTokenizer\n\
\n\
def main():\n\
model_id = os.environ.get("HF_MODEL_ID", "ByteDance-Seed/UI-TARS-1.5-7B")\n\
print(f"Starting server with model: {model_id}")\n\
\n\
# Load tokenizer\n\
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)\n\
\n\
# Start vLLM server\n\
serve_vllm_api_server(\n\
model=model_id,\n\
tensor_parallel_size=1, # Change based on available GPUs\n\
gpu_memory_utilization=0.9,\n\
trust_remote_code=True,\n\
dtype="bfloat16", # Use float16 if bfloat16 is not supported\n\
host="0.0.0.0",\n\
port=8000\n\
)\n\
\n\
if __name__ == "__main__":\n\
main()\n\
' > /app/server.py && chmod +x /app/server.py

# Create inference script
RUN echo '#!/usr/bin/env python3\n\
import os\n\
import sys\n\
import json\n\
import base64\n\
import argparse\n\
from PIL import Image\n\
from io import BytesIO\n\
import requests\n\
\n\
def encode_image(image_path):\n\
with open(image_path, "rb") as image_file:\n\
return base64.b64encode(image_file.read()).decode("utf-8")\n\
\n\
def query_model(image_path, instruction, server_url="http://localhost:8000/v1/chat/completions"):\n\
# Encode the image\n\
base64_image = encode_image(image_path)\n\
\n\
# Prepare the messages with system prompt from prompts.py\n\
with open("/app/prompts.py", "r") as f:\n\
prompts_content = f.read()\n\
\n\
# Extract computer use prompt\n\
import re\n\
computer_prompt = re.search(r\'COMPUTER_USE = \"\"\"(.+?)\"\"\"\', prompts_content, re.DOTALL)\n\
if computer_prompt:\n\
system_prompt = computer_prompt.group(1).replace("{language}", "English").replace("{instruction}", instruction)\n\
else:\n\
system_prompt = "You are a GUI agent. You are given a task and your action history, with screenshots."\n\
\n\
# Prepare the API request\n\
headers = {"Content-Type": "application/json"}\n\
payload = {\n\
"model": "UI-TARS",\n\
"messages": [\n\
{"role": "system", "content": system_prompt},\n\
{"role": "user", "content": [\n\
{"type": "text", "text": instruction},\n\
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}\n\
]}\n\
],\n\
"temperature": 0.01,\n\
"max_tokens": 512\n\
}\n\
\n\
# Make the API request\n\
try:\n\
response = requests.post(server_url, headers=headers, json=payload)\n\
response.raise_for_status()\n\
result = response.json()\n\
return result["choices"][0]["message"]["content"]\n\
except Exception as e:\n\
return f"Error: {str(e)}"\n\
\n\
def main():\n\
parser = argparse.ArgumentParser(description="UI-TARS Model Inference")\n\
parser.add_argument("--image", required=True, help="Path to screenshot image")\n\
parser.add_argument("--instruction", required=True, help="Task instruction")\n\
parser.add_argument("--server", default="http://localhost:8000/v1/chat/completions", help="Server URL")\n\
\n\
args = parser.parse_args()\n\
\n\
result = query_model(args.image, args.instruction, args.server)\n\
print(result)\n\
\n\
if __name__ == "__main__":\n\
main()\n\
' > /app/inference.py && chmod +x /app/inference.py

# Create entrypoint script
RUN echo '#!/bin/bash\n\
if [ "$1" = "serve" ]; then\n\
echo "Starting UI-TARS server..."\n\
python /app/server.py\n\
elif [ "$1" = "infer" ]; then\n\
echo "Running inference..."\n\
python /app/inference.py --image "$2" --instruction "$3"\n\
elif [ "$1" = "process-coordinates" ]; then\n\
echo "Processing coordinates..."\n\
python /app/coordinate_processing_script.py --image "$2" --model-output "$3" --output "$4"\n\
elif [ "$1" = "analyze-webpage" ]; then\n\
echo "Analyzing webpage..."\n\
python /app/webpage_analyzer.py --image "$2" ${3:+--output "$3"}\n\
else\n\
echo "UI-TARS Docker container"\n\
echo "Usage:"\n\
echo " serve - Start the model server"\n\
echo " infer IMAGE INSTRUCTION - Run inference on an image"\n\
echo " process-coordinates IMAGE MODEL_OUTPUT OUTPUT - Process and visualize coordinates"\n\
echo " analyze-webpage IMAGE [OUTPUT_FILE] - Analyze a webpage screenshot and output description"\n\
echo "Environment:"\n\
echo " HF_MODEL_ID - HuggingFace model ID (default: ByteDance-Seed/UI-TARS-1.5-7B)"\n\
fi\n\
' > /app/entrypoint.sh && chmod +x /app/entrypoint.sh

ENTRYPOINT ["/app/entrypoint.sh"]
CMD ["help"]
61 changes: 56 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
We also offer a **UI-TARS-desktop** version, which can operate on your **local personal device**. To use it, please visit [https://github.com/bytedance/UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop). To use UI-TARS in web automation, you may refer to the open-source project [Midscene.js](https://github.com/web-infra-dev/Midscene).

## Updates
- 🐳 2025.04.24: Added Docker containerization, coordinate processing tools, and webpage analysis features. See [Docker Deployment Guide](README_docker.md).
- 🌟 2025.04.16: We shared the latest progress of the UI-TARS-1.5 model in our [blog](https://seed-tars.com/1.5), which excels in playing games and performing GUI tasks, and we open-sourced the [UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B).
- ✨ 2025.03.23: We updated the OSWorld inference scripts from the original official [OSWorld repository](https://github.com/xlang-ai/OSWorld/blob/main/run_uitars.py). Now, you can use the OSWorld official inference scripts to reproduce our results.

Expand All @@ -34,13 +35,63 @@ Leveraging the foundational architecture introduced in [our recent paper](https:
</video>
<p>

## Deployment
- See the deploy guide <a href="README_deploy.md">here</a>.
- For coordinates processing, refer to <a href="README_coordinates.md">here</a>.
- For full action space parsing, refer to [OSWorld uitars_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/uitars_agent.py)
## Core Features

UI-TARS provides several key capabilities:

1. **GUI Interaction**: Automatically interact with graphical user interfaces through vision-language models
2. **Multi-Platform Support**: Works with desktop, mobile, and web interfaces
3. **Action Generation**: Produces precise interface actions (clicks, typing, scrolling) with coordinate mapping
4. **Visual Understanding**: Comprehends interface elements, their relationships, and functions
5. **Webpage Analysis**: Converts UI screenshots to structured plaintext descriptions
6. **Coordinate Processing**: Maps model output coordinates to actual screen positions

## System Architecture

UI-TARS consists of the following components:

1. **Vision-Language Model**: Processes screenshots to understand interface elements
2. **Action Space**: Defines possible interactions (click, drag, type, etc.)
3. **Coordinate System**: Maps model outputs to actual screen positions
4. **Prompt System**: Configures model behavior for different platforms and tasks
5. **API Interface**: Provides OpenAI-compatible endpoints for integration

## Deployment Options

### 1. Docker Container (Recommended)
- Comprehensive Docker setup with GPU support
- See the [Docker Deployment Guide](README_docker.md)
- Includes web analysis and coordinate processing tools

### 2. HuggingFace Inference Endpoints
- Cloud-based deployment on HuggingFace infrastructure
- See the [HuggingFace deploy guide](README_deploy.md)

### 3. Local Development
- For coordinates processing, refer to [Coordinates Guide](README_coordinates.md)
- For action space parsing, refer to [OSWorld uitars_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/uitars_agent.py)

## Usage Examples

### GUI Interaction
```python
# Example of using UI-TARS for GUI interaction
response = ui_tars.process_screenshot(
image_path="screenshot.png",
instruction="Click on the search button"
)
# Response: Action: click(start_box='(197,525)')
```

### Webpage Analysis
```bash
# Docker container command for webpage analysis
docker-compose exec ui-tars /app/entrypoint.sh analyze-webpage \
/app/data/webpage_screenshot.png /app/data/analysis.txt
```

## System Prompts
- Refer to <a href="./prompts.py">prompts.py</a>
- Refer to <a href="./prompts.py">prompts.py</a> for system prompt templates


## Performance
Expand Down
Loading