You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat: switch AI service from Gemini to OpenRouter, updating configurations and UI.
* docs: Update README to reflect OpenRouter integration for LLMs and vision, refine project status, and simplify technical architecture details.
* bump version
**NextDesk** is an intelligent desktop automation application powered by Google's Gemini AI that uses the **ReAct (Reasoning + Acting)** framework to understand and execute complex computer tasks through natural language commands.
3
+
**NextDesk** is an intelligent desktop automation application powered by **LLMs via OpenRouter** (using advanced models like Google's Gemini 3.0) that uses the **ReAct (Reasoning + Acting)** framework to understand and execute complex computer tasks through natural language commands.
4
4
5
-
> ⚠️ **UNDER ACTIVE DEVELOPMENT**
6
-
> This project is currently in active development and **not ready for production use**.
7
-
> The vision-based element detection tool (`detectElementPosition`) is particularly **unreliable and not recommended** for use at this time.
8
-
> We recommend using keyboard shortcuts (`pressKeys`) and the `getShortcuts` tool instead for more reliable automation.
5
+
> ⚠️ **UNDER DEVELOPMENT**
6
+
> This project is currently in development and **not ready for production use**.
7
+
> The vision-based element detection tool (`detectElementPosition`) is experimental.
8
+
> We recommend using keyboard shortcuts (`pressKeys`) and the `getShortcuts` tool for more reliable automation.
9
9
10
10
## 🌟 Overview
11
11
@@ -24,7 +24,7 @@ This Flutter desktop application combines AI reasoning with keyboard automation
24
24
|**User Interaction**| ✅ Working | Agent can ask user questions via dialog |
25
25
|**Task Persistence**| ✅ Working | Isar database for task history |
26
26
27
-
**Current Focus:** Improving vision detection accuracy and reliability.
27
+
**Current Focus:** Improving vision detection accuracy and reliability using newer vision models.
28
28
29
29
## 🖥️ Platform Support
30
30
@@ -64,7 +64,7 @@ nextdesk/
64
64
│ │ ├── detection_result.dart # UI element detection results
65
65
│ │ └── react_agent_state.dart # ReAct agent state
66
66
│ ├── services/
67
-
│ │ ├── gemini_service.dart # Gemini AI model initialization
67
+
│ │ ├── openrouter_service.dart # OpenRouter AI integration
68
68
│ │ ├── vision_service.dart # AI-powered UI element detection
69
69
│ │ ├── automation_service.dart # All automation functions
@@ -159,71 +160,36 @@ This cycle repeats until the task is complete or max iterations (20) is reached.
159
160
160
161
## 🔧 Technical Architecture
161
162
162
-
### 1. AI Integration (Gemini 2.5 Flash)
163
+
### 1. AI Integration (OpenRouter)
163
164
164
-
The application uses Google's Gemini AI with **function calling** capabilities:
165
+
The application uses **OpenRouter** to access powerful LLMs (like Google Gemini 3.0 Flash/Pro) with **function calling** capabilities.
165
166
166
-
```dart
167
-
GenerativeModel(
168
-
model: 'gemini-2.5-flash',
169
-
apiKey: apiKey,
170
-
tools: [
171
-
captureScreenshotTool,
172
-
detectElementTool,
173
-
moveMouseTool,
174
-
clickMouseTool,
175
-
typeTextTool,
176
-
pressKeysTool,
177
-
waitTool,
178
-
],
179
-
)
180
-
```
181
-
182
-
The AI can:
183
-
- Understand natural language instructions
184
-
- Reason about multi-step tasks
185
-
- Call automation functions with appropriate parameters
186
-
- Process visual information from screenshots
167
+
The service handles:
168
+
- Chat session management
169
+
- System prompts for ReAct behavior
170
+
- Tool/Function definition and execution signatures
171
+
- Response parsing and JSON handling
187
172
188
173
### 2. Computer Vision (UI Element Detection)
189
174
190
-
The `VisionService` supports **two vision providers** for UI element detection:
191
-
192
-
#### **Gemini Vision API** (Default)
193
-
- Uses Google's Gemini 2.5 Flash model
194
-
- Integrated with Google AI Studio
195
-
- Fast and reliable for most use cases
196
-
197
-
#### **Qwen Vision API** (Alternative)
198
-
- Uses Alibaba Cloud's Qwen 2.5 VL 72B Instruct model
199
-
- OpenAI-compatible API format
200
-
- Provides image size detection and confidence scores
201
-
- Configurable resolution parameters
175
+
The `VisionService` leverages the **OpenRouter Vision API** for UI element detection. It sends screenshots to a vision-capable model (e.g., Gemini 3.0 Flash) to identify pixel coordinates of described elements.
202
176
203
177
**How it works:**
204
178
1. Takes a screenshot of the current screen
205
-
2. Sends the image + element description to the selected vision API
206
-
3. AI analyzes the image and returns pixel coordinates
179
+
2. Sends the image + element description to the OpenRouter API
180
+
3. AI analyzes the image and returns pixel coordinates via JSON
207
181
4. Returns a `DetectionResult` with x, y coordinates and confidence score
208
182
209
183
Example:
210
184
```dart
211
185
final result = await VisionService.detectElementPosition(
212
186
imageBytes,
213
187
"blue Submit button",
188
+
config,
214
189
);
215
190
// Returns: {x: 450, y: 320, confidence: 0.95}
216
191
```
217
192
218
-
**Switching Providers:**
219
-
Edit `lib/config/app_config.dart`:
220
-
```dart
221
-
static const String visionProvider = 'qwen'; // or 'gemini'
- ✅ AI-powered UI element detection using computer vision
349
+
- ✅ AI-powered UI element detection using computer vision (OpenRouter)
394
350
- ✅ Mouse and keyboard automation
395
351
- ✅ Screenshot capture and analysis
396
352
- ✅ Task history and persistence (Isar database)
@@ -404,76 +360,40 @@ Input: "Take a screenshot and save it"
404
360
-[ ] Voice command input
405
361
-[ ] Task scheduling and automation
406
362
-[ ] Error recovery and retry logic
407
-
-[ ] Performance optimization
408
363
-[ ] Plugin system for custom actions
409
-
-[ ] Cloud sync for task history
410
-
-[ ] Dark/Light theme toggle
411
364
-[ ] Export task history to JSON/CSV
412
365
413
-
## 🏛️ Code Organization
414
-
415
-
The project follows a clean, modular architecture with clear separation of concerns:
416
-
417
-
-**Models**: Data structures for tasks, detection results, and agent state
418
-
-**Services**: AI integration, vision processing, and automation functions
419
-
-**Providers**: State management using Provider pattern
420
-
-**Screens**: Main UI with responsive layout
421
-
-**Widgets**: Reusable UI components
422
-
-**Config**: Centralized theme and design system
423
-
424
-
## 🔒 Security & Privacy
425
-
426
-
-**API Key**: Store your Gemini API key securely (use environment variables in production)
427
-
-**Local Processing**: All automation runs locally on your machine
428
-
-**Data Storage**: Task history is stored locally using Isar database
429
-
-**Screenshots**: Temporary screenshots are kept in memory and not persisted
430
-
-**No Telemetry**: No data is sent to external servers except Gemini API calls
431
-
-**Permissions**: Requires accessibility permissions for automation (user-controlled)
432
-
433
-
434
-
435
366
## ⚠️ Known Limitations
436
367
437
-
### ⚠️ Vision-Based Element Detection (NOT READY)
438
-
The `detectElementPosition` function uses AI vision to locate UI elements, but it is **currently unreliable and NOT recommended for use**:
368
+
### ⚠️ Vision-Based Element Detection (Experimental)
369
+
The `detectElementPosition` function uses AI vision to locate UI elements. While modern models like Gemini 3.0 are powerful, detection may still be imprecise in some contexts:
439
370
440
-
-**❌ Not Production Ready**: This feature is experimental and under active development
441
-
-**❌ Accuracy Issues**: Detection may be off by several pixels or fail entirely
442
-
-**❌ Inconsistent Results**: Same element may be detected differently across runs
443
-
-**❌ Complex UIs**: Elements in dense or overlapping layouts are very difficult to detect
444
-
-**❌ Similar Elements**: May confuse similar-looking buttons or icons
445
-
-**❌ Performance**: Vision API calls are slow and may timeout
371
+
-**Accuracy**: Detection may be off by several pixels depending on the model's interpretation.
372
+
-**Performance**: Vision API calls can have latency.
373
+
-**Complex UIs**: Very dense UIs can still challenge current vision models.
446
374
447
375
**✅ RECOMMENDED APPROACH:**
448
-
-**Use keyboard shortcuts** (`pressKeys`) whenever possible - much more reliable
449
-
-**Use `getShortcuts` tool** to dynamically fetch keyboard shortcuts for applications
450
-
-**Avoid vision-based detection** until this feature is stabilized in future releases
451
-
452
-
This is a known limitation of the current implementation and AI vision models. We are actively working on improving this feature.
376
+
-**Use keyboard shortcuts** (`pressKeys`) whenever possible - much more reliable.
377
+
-**Use `getShortcuts` tool** to dynamically fetch keyboard shortcuts for applications.
378
+
- Use vision detection as a fallback when no keyboard shortcut is available.
453
379
454
380
## 🐛 Troubleshooting
455
381
456
382
### Common Issues
457
383
458
384
1.**"Failed to detect element"**
459
-
-**Note**: Element detection is not always precise and may fail
460
-
- Use keyboard shortcuts instead of mouse clicks when possible
461
-
- Ensure the element description is very clear and specific
462
-
- Try taking a screenshot first to verify the UI state
463
-
- Check that the element is visible on screen
464
-
- Improve description with more details (e.g., "blue Submit button in bottom right corner with white text")
385
+
- Ensure the element description is very clear and specific.
386
+
- Try taking a screenshot first to verify visibility.
387
+
- Use keyboard shortcuts instead of mouse clicks when possible.
465
388
466
389
2.**"API key error"**
467
-
- Verify your Gemini API key is valid
468
-
- Check your internet connection
469
-
- Ensure you haven't exceeded API quotas
470
-
- Update the API key in `lib/services/gemini_service.dart`
390
+
- Verify your OpenRouter API key is valid.
391
+
- Update the API key in the app Settings or `lib/config/app_config.dart`.
471
392
472
393
3.**Mouse/keyboard not working**
473
-
- Grant accessibility permissions to the app (System Preferences → Security & Privacy)
474
-
- Check that `bixat_key_mouse` package is properly installed
475
-
- Verify platform-specific permissions
476
-
- Restart the application after granting permissions
394
+
- Grant accessibility permissions to the app (System Preferences → Security & Privacy).
395
+
- Check that `bixat_key_mouse` package is properly installed.
396
+
- Restart the application after granting permissions.
477
397
478
398
479
399
## 🤝 Contributing
@@ -486,4 +406,5 @@ https://bixat.dev
486
406
487
407
---
488
408
489
-
**Built with ❤️ using Flutter and Google Gemini AI**
0 commit comments