Skip to content

Commit b08aa85

Browse files
authored
Switch AI service from Gemini to OpenRouter (#4)
* feat: switch AI service from Gemini to OpenRouter, updating configurations and UI. * docs: Update README to reflect OpenRouter integration for LLMs and vision, refine project status, and simplify technical architecture details. * bump version
1 parent 3d86819 commit b08aa85

19 files changed

+954
-1351
lines changed

README.md

Lines changed: 47 additions & 126 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
# NextDesk
22

3-
**NextDesk** is an intelligent desktop automation application powered by Google's Gemini AI that uses the **ReAct (Reasoning + Acting)** framework to understand and execute complex computer tasks through natural language commands.
3+
**NextDesk** is an intelligent desktop automation application powered by **LLMs via OpenRouter** (using advanced models like Google's Gemini 3.0) that uses the **ReAct (Reasoning + Acting)** framework to understand and execute complex computer tasks through natural language commands.
44

5-
> ⚠️ **UNDER ACTIVE DEVELOPMENT**
6-
> This project is currently in active development and **not ready for production use**.
7-
> The vision-based element detection tool (`detectElementPosition`) is particularly **unreliable and not recommended** for use at this time.
8-
> We recommend using keyboard shortcuts (`pressKeys`) and the `getShortcuts` tool instead for more reliable automation.
5+
> ⚠️ **UNDER DEVELOPMENT**
6+
> This project is currently in development and **not ready for production use**.
7+
> The vision-based element detection tool (`detectElementPosition`) is experimental.
8+
> We recommend using keyboard shortcuts (`pressKeys`) and the `getShortcuts` tool for more reliable automation.
99
1010
## 🌟 Overview
1111

@@ -24,7 +24,7 @@ This Flutter desktop application combines AI reasoning with keyboard automation
2424
| **User Interaction** | ✅ Working | Agent can ask user questions via dialog |
2525
| **Task Persistence** | ✅ Working | Isar database for task history |
2626

27-
**Current Focus:** Improving vision detection accuracy and reliability.
27+
**Current Focus:** Improving vision detection accuracy and reliability using newer vision models.
2828

2929
## 🖥️ Platform Support
3030

@@ -64,7 +64,7 @@ nextdesk/
6464
│ │ ├── detection_result.dart # UI element detection results
6565
│ │ └── react_agent_state.dart # ReAct agent state
6666
│ ├── services/
67-
│ │ ├── gemini_service.dart # Gemini AI model initialization
67+
│ │ ├── openrouter_service.dart # OpenRouter AI integration
6868
│ │ ├── vision_service.dart # AI-powered UI element detection
6969
│ │ ├── automation_service.dart # All automation functions
7070
│ │ └── shortcuts_service.dart # AI-powered keyboard shortcuts
@@ -93,8 +93,8 @@ The application follows **separation of concerns** with a clean modular architec
9393
- `ReActAgentState`: State management for the ReAct reasoning cycle
9494

9595
#### 2. **Services** (`lib/services/`)
96-
- `GeminiService`: Initializes and configures Gemini AI model with function calling
97-
- `VisionService`: AI-powered UI element detection using Gemini or Qwen Vision API
96+
- `OpenRouterService`: Initializes and configures AI models via OpenRouter API with function calling support
97+
- `VisionService`: AI-powered UI element detection using OpenRouter Vision API
9898
- `AutomationService`: Wrapper for all automation capabilities (mouse, keyboard, screen)
9999

100100
#### 3. **Providers** (`lib/providers/`)
@@ -147,6 +147,7 @@ The agent executes one of the available automation functions:
147147
- `typeText(text)`: Types text via keyboard
148148
- `pressKeys(keys)`: Presses keyboard shortcuts
149149
- `wait(seconds)`: Waits for a specified duration
150+
- `getShortcuts(query)`: Dynamically fetches app shortcuts
150151

151152
#### 3. **OBSERVATION** (Feedback Phase)
152153
The agent receives feedback from the action:
@@ -159,71 +160,36 @@ This cycle repeats until the task is complete or max iterations (20) is reached.
159160

160161
## 🔧 Technical Architecture
161162

162-
### 1. AI Integration (Gemini 2.5 Flash)
163+
### 1. AI Integration (OpenRouter)
163164

164-
The application uses Google's Gemini AI with **function calling** capabilities:
165+
The application uses **OpenRouter** to access powerful LLMs (like Google Gemini 3.0 Flash/Pro) with **function calling** capabilities.
165166

166-
```dart
167-
GenerativeModel(
168-
model: 'gemini-2.5-flash',
169-
apiKey: apiKey,
170-
tools: [
171-
captureScreenshotTool,
172-
detectElementTool,
173-
moveMouseTool,
174-
clickMouseTool,
175-
typeTextTool,
176-
pressKeysTool,
177-
waitTool,
178-
],
179-
)
180-
```
181-
182-
The AI can:
183-
- Understand natural language instructions
184-
- Reason about multi-step tasks
185-
- Call automation functions with appropriate parameters
186-
- Process visual information from screenshots
167+
The service handles:
168+
- Chat session management
169+
- System prompts for ReAct behavior
170+
- Tool/Function definition and execution signatures
171+
- Response parsing and JSON handling
187172

188173
### 2. Computer Vision (UI Element Detection)
189174

190-
The `VisionService` supports **two vision providers** for UI element detection:
191-
192-
#### **Gemini Vision API** (Default)
193-
- Uses Google's Gemini 2.5 Flash model
194-
- Integrated with Google AI Studio
195-
- Fast and reliable for most use cases
196-
197-
#### **Qwen Vision API** (Alternative)
198-
- Uses Alibaba Cloud's Qwen 2.5 VL 72B Instruct model
199-
- OpenAI-compatible API format
200-
- Provides image size detection and confidence scores
201-
- Configurable resolution parameters
175+
The `VisionService` leverages the **OpenRouter Vision API** for UI element detection. It sends screenshots to a vision-capable model (e.g., Gemini 3.0 Flash) to identify pixel coordinates of described elements.
202176

203177
**How it works:**
204178
1. Takes a screenshot of the current screen
205-
2. Sends the image + element description to the selected vision API
206-
3. AI analyzes the image and returns pixel coordinates
179+
2. Sends the image + element description to the OpenRouter API
180+
3. AI analyzes the image and returns pixel coordinates via JSON
207181
4. Returns a `DetectionResult` with x, y coordinates and confidence score
208182

209183
Example:
210184
```dart
211185
final result = await VisionService.detectElementPosition(
212186
imageBytes,
213187
"blue Submit button",
188+
config,
214189
);
215190
// Returns: {x: 450, y: 320, confidence: 0.95}
216191
```
217192

218-
**Switching Providers:**
219-
Edit `lib/config/app_config.dart`:
220-
```dart
221-
static const String visionProvider = 'qwen'; // or 'gemini'
222-
static const String qwenApiKey = 'sk-your-qwen-api-key';
223-
```
224-
225-
See [QWEN_INTEGRATION.md](QWEN_INTEGRATION.md) for detailed setup instructions.
226-
227193
### 3. Input Automation
228194

229195
Uses the `bixat_key_mouse` package (custom Rust-based FFI) for:
@@ -258,7 +224,7 @@ class Task {
258224
## 📦 Dependencies
259225

260226
### Core AI & Automation
261-
- **google_generative_ai** (^0.4.3): Gemini AI integration with function calling
227+
- **http** (^1.2.0): For making API requests to OpenRouter
262228
- **bixat_key_mouse**: Custom Rust-based FFI package for mouse/keyboard control
263229
- **screen_capturer** (^0.2.1): Cross-platform screen capture functionality
264230

@@ -285,7 +251,7 @@ class Task {
285251

286252
### Prerequisites
287253
- Flutter SDK (>=3.0.0)
288-
- Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey)
254+
- OpenRouter API key from [OpenRouter](https://openrouter.ai/keys)
289255
- **macOS desktop environment** (Windows and Linux not yet supported - see [Platform Support](#️-platform-support))
290256

291257
### Installation
@@ -310,14 +276,16 @@ class Task {
310276

311277
4. **Configure API key**
312278

313-
Copy the example config file and add your API key:
279+
You can configure the API key directly in the app settings, or set it via environment variable.
280+
281+
Copy the example config file:
314282
```bash
315283
cp lib/config/app_config.dart.example lib/config/app_config.dart
316284
```
317285

318-
Then open `lib/config/app_config.dart` and replace the API key:
286+
Then open `lib/config/app_config.dart` and replace the API key (optional if using Settings UI):
319287
```dart
320-
static const String geminiApiKey = 'YOUR_GEMINI_API_KEY_HERE';
288+
static const String openRouterApiKey = 'YOUR_OPENROUTER_API_KEY_HERE';
321289
```
322290

323291
5. **Generate Isar database code**
@@ -330,8 +298,6 @@ class Task {
330298
flutter run -d macos # or windows/linux
331299
```
332300

333-
334-
335301
## 💡 Usage Examples
336302

337303
### Example 1: Simple Web Search
@@ -374,23 +340,13 @@ ACTION: pressKeys(['enter'])
374340
OBSERVATION: Task complete
375341
```
376342

377-
### Example 2: File Operations
378-
```
379-
Input: "Create a new text file named 'notes.txt' on the desktop"
380-
```
381-
382-
### Example 3: Application Control
383-
```
384-
Input: "Take a screenshot and save it"
385-
```
386-
387343
## 🎯 Key Features
388344

389345
### ✅ Implemented
390346

391347
- ✅ Natural language task understanding
392348
- ✅ ReAct reasoning framework (Thought → Action → Observation)
393-
- ✅ AI-powered UI element detection using computer vision
349+
- ✅ AI-powered UI element detection using computer vision (OpenRouter)
394350
- ✅ Mouse and keyboard automation
395351
- ✅ Screenshot capture and analysis
396352
- ✅ Task history and persistence (Isar database)
@@ -404,76 +360,40 @@ Input: "Take a screenshot and save it"
404360
- [ ] Voice command input
405361
- [ ] Task scheduling and automation
406362
- [ ] Error recovery and retry logic
407-
- [ ] Performance optimization
408363
- [ ] Plugin system for custom actions
409-
- [ ] Cloud sync for task history
410-
- [ ] Dark/Light theme toggle
411364
- [ ] Export task history to JSON/CSV
412365

413-
## 🏛️ Code Organization
414-
415-
The project follows a clean, modular architecture with clear separation of concerns:
416-
417-
- **Models**: Data structures for tasks, detection results, and agent state
418-
- **Services**: AI integration, vision processing, and automation functions
419-
- **Providers**: State management using Provider pattern
420-
- **Screens**: Main UI with responsive layout
421-
- **Widgets**: Reusable UI components
422-
- **Config**: Centralized theme and design system
423-
424-
## 🔒 Security & Privacy
425-
426-
- **API Key**: Store your Gemini API key securely (use environment variables in production)
427-
- **Local Processing**: All automation runs locally on your machine
428-
- **Data Storage**: Task history is stored locally using Isar database
429-
- **Screenshots**: Temporary screenshots are kept in memory and not persisted
430-
- **No Telemetry**: No data is sent to external servers except Gemini API calls
431-
- **Permissions**: Requires accessibility permissions for automation (user-controlled)
432-
433-
434-
435366
## ⚠️ Known Limitations
436367

437-
### ⚠️ Vision-Based Element Detection (NOT READY)
438-
The `detectElementPosition` function uses AI vision to locate UI elements, but it is **currently unreliable and NOT recommended for use**:
368+
### ⚠️ Vision-Based Element Detection (Experimental)
369+
The `detectElementPosition` function uses AI vision to locate UI elements. While modern models like Gemini 3.0 are powerful, detection may still be imprecise in some contexts:
439370

440-
- **❌ Not Production Ready**: This feature is experimental and under active development
441-
- **❌ Accuracy Issues**: Detection may be off by several pixels or fail entirely
442-
- **❌ Inconsistent Results**: Same element may be detected differently across runs
443-
- **❌ Complex UIs**: Elements in dense or overlapping layouts are very difficult to detect
444-
- **❌ Similar Elements**: May confuse similar-looking buttons or icons
445-
- **❌ Performance**: Vision API calls are slow and may timeout
371+
- **Accuracy**: Detection may be off by several pixels depending on the model's interpretation.
372+
- **Performance**: Vision API calls can have latency.
373+
- **Complex UIs**: Very dense UIs can still challenge current vision models.
446374

447375
**✅ RECOMMENDED APPROACH:**
448-
- **Use keyboard shortcuts** (`pressKeys`) whenever possible - much more reliable
449-
- **Use `getShortcuts` tool** to dynamically fetch keyboard shortcuts for applications
450-
- **Avoid vision-based detection** until this feature is stabilized in future releases
451-
452-
This is a known limitation of the current implementation and AI vision models. We are actively working on improving this feature.
376+
- **Use keyboard shortcuts** (`pressKeys`) whenever possible - much more reliable.
377+
- **Use `getShortcuts` tool** to dynamically fetch keyboard shortcuts for applications.
378+
- Use vision detection as a fallback when no keyboard shortcut is available.
453379

454380
## 🐛 Troubleshooting
455381

456382
### Common Issues
457383

458384
1. **"Failed to detect element"**
459-
- **Note**: Element detection is not always precise and may fail
460-
- Use keyboard shortcuts instead of mouse clicks when possible
461-
- Ensure the element description is very clear and specific
462-
- Try taking a screenshot first to verify the UI state
463-
- Check that the element is visible on screen
464-
- Improve description with more details (e.g., "blue Submit button in bottom right corner with white text")
385+
- Ensure the element description is very clear and specific.
386+
- Try taking a screenshot first to verify visibility.
387+
- Use keyboard shortcuts instead of mouse clicks when possible.
465388

466389
2. **"API key error"**
467-
- Verify your Gemini API key is valid
468-
- Check your internet connection
469-
- Ensure you haven't exceeded API quotas
470-
- Update the API key in `lib/services/gemini_service.dart`
390+
- Verify your OpenRouter API key is valid.
391+
- Update the API key in the app Settings or `lib/config/app_config.dart`.
471392

472393
3. **Mouse/keyboard not working**
473-
- Grant accessibility permissions to the app (System Preferences → Security & Privacy)
474-
- Check that `bixat_key_mouse` package is properly installed
475-
- Verify platform-specific permissions
476-
- Restart the application after granting permissions
394+
- Grant accessibility permissions to the app (System Preferences → Security & Privacy).
395+
- Check that `bixat_key_mouse` package is properly installed.
396+
- Restart the application after granting permissions.
477397

478398

479399
## 🤝 Contributing
@@ -486,4 +406,5 @@ https://bixat.dev
486406

487407
---
488408

489-
**Built with ❤️ using Flutter and Google Gemini AI**
409+
410+
**Built with ❤️ using Flutter and OpenRouter**

devtools_options.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
description: This file stores settings for Dart & Flutter DevTools.
2+
documentation: https://docs.flutter.dev/tools/devtools/extensions#configure-extension-enablement-states
3+
extensions:

lib/config/app_config.dart

Lines changed: 4 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,10 @@
33
/// Store your API keys and configuration here.
44
/// For production, use environment variables or secure storage.
55
class AppConfig {
6-
/// Gemini API Key
7-
/// Get your API key from: https://makersuite.google.com/app/apikey
8-
static const String geminiApiKey = String.fromEnvironment("GEMINI_API_KEY");
9-
10-
/// Qwen API Key (Dashscope)
11-
/// Get your API key from: https://dashscope.console.aliyun.com/
12-
static const String qwenApiKey = String.fromEnvironment("QWEN_API_KEY");
13-
14-
/// Vision provider: 'gemini' or 'qwen'
15-
static const String visionProvider = 'gemini';
16-
17-
/// Shortcuts provider: 'gemini' or 'qwen'
18-
/// Used by getShortcuts tool to fetch keyboard shortcuts
19-
static const String shortcutsProvider = 'gemini';
6+
/// OpenRouter API Key
7+
/// Get your API key from: https://openrouter.ai/keys
8+
static const String openRouterApiKey =
9+
String.fromEnvironment("OPENROUTER_API_KEY");
2010

2111
/// Maximum iterations for ReAct agent
2212
static const int maxIterations = 20;

lib/config/app_theme.dart

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ class AppTheme {
2121
// Accent colors
2222
static const Color accentGreen = Color(0xFF06FFA5);
2323
static const Color accentGreenDark = Color(0xFF00D97E);
24+
static const Color successGreen = Color(0xFF00D97E);
2425

2526
static const Color errorRed = Color(0xFFFF006E);
2627
static const Color warningOrange = Color(0xFFFFBE0B);

0 commit comments

Comments
 (0)