Project 1:-- Build a GUI Agent with local LLM/VLM and OpenVINO #34948
Replies: 6 comments 4 replies
-
|
Hey @openvino-dev-samples and @zhuo-yoyowz, A very thanks for your time and i am very exited as THIS IS the project that i have been working on from past year so it is a Dream come true For me so kindly if you please guide me I will be really grateful for that. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @Rishabh-Sahni-0809 Thanks for your engagement.
3. I would recommend you to focus some specific scenarios in this project, and optimize the user experience of them with local model. Do not expect a local model can do everything |
Beta Was this translation helpful? Give feedback.
-
|
Hey @openvino-dev-samples and @zhuo-yoyowz, Now kindly tell me which more parts need fine tuning |
Beta Was this translation helpful? Give feedback.
-
|
Hello @openvino-dev-samples and @zhuo-yoyowz, |
Beta Was this translation helpful? Give feedback.
-
|
Hello @openvino-dev-samples and @zhuo-yoyowz, 1.Are we looking for a completly local agent or can it search online for large tasks as i have feed many scenarios into the agent i have build already like opening notepad and typing, opening spotify and playing musin and also booking reservation but there might be some new tasks and local agent even if small like ollama phi might take excessive time. 2.are we building this agent available for everyone or this might come with some specific intel models as if local agent in normal old pc might take a lot of time to answer a simple question Also i am just polishing my approach and have already build a working MVP so i will start working on it more after the examination period as suggested by mentors. Is this fine or can i still work on MVP and i have created and solved 2 PR's mentioned below should i be doing more issues.... @adrianboguszewski openvinotoolkit/openvino_build_deploy#501 |
Beta Was this translation helpful? Give feedback.
-
|
Hello @openvino-dev-samples and @adrianboguszewski and @zhuo-yoyowz, I am still working on some pr and issues and working on openvino codebase so is that okay? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @openvino-dev-samples and @zhuo-yoyowz ,
I've been following this thread and wanted to share both my understanding of the project and what I've already built toward it.
My understanding of the project:
At its core, this is about building a desktop agent that perceives the screen as a human does — not through accessibility trees or hardcoded selectors, but by visually understanding the UI through a VLM. The LLM handles reasoning and goal decomposition while the VLM provides grounded visual perception. Together they enable an iterative Plan → Execute → Observe → Update loop that can handle dynamic, real-world interfaces. The OpenVINO requirement means all inference must run locally on CPU — no cloud APIs, reproducible and low-latency.
What I have already built (Digital Dave):
I have a working Python desktop agent I've been developing for the past year, evolving from a basic voice assistant into a full agentic system. It currently includes:
I've attached an architecture diagram showing the current pipeline and exactly where the OpenVINO VLM layer and new modules slot in.
Core proposal (GSoC deliverables):
The OCR-based perception layer is the key bottleneck I want to replace. My plan is to swap Tesseract for an OpenVINO-accelerated VLM (SmolVLM or nanoLLaVA, quantized INT8, CPU-only), outputting structured JSON with element labels, roles, and grounded (x, y) coordinates. On top of this I will build a proper agentic planner that decomposes a natural language goal into subtasks, executes them, verifies success by re-capturing the screen, and re-plans if the expected state was not reached.
Deliverables:
Stretch goals (if time permits):
I believe starting from a working codebase rather than from scratch puts me in a position to deliver meaningfully more during the coding period. I am happy to share the full repository and walk through any part of the design.
Two questions for the mentors:
Below is the ScreenShots of current UI and app which i am trying to improve and a youtube video link from the same project BUT a previous phase so you can also view the features> As Before it was a normal voice Assistant But now i have made it run locally and made many improvement as shown in below Github repo
Please do correct me if I've misunderstood anything or if you'd prefer a different direction.
Github:-[GitHub repository link]
Youtube:-https://youtu.be/9NFLYU2uIxU?si=lm5ZJdIOi6vDRVVs
Beta Was this translation helpful? Give feedback.
All reactions