vlm-caption

vlm-caption is a Go application that uses a Vision Language Model (VLM) to show live captions from your webcam in your browser all running entirely on your local machine!

It uses yzma to perform local inference using llama.cpp and GoCV for the video processing.

Installation

yzma

You must install yzma and llama.cpp to run this program.

See https://github.com/hybridgroup/yzma/blob/main/INSTALL.md

GoCV

You must also install OpenCV and GoCV, which unlike yzma require CGo.

See https://gocv.io/getting-started/

Although yzma does not use CGo itself, you can also use it in Go applications that use CGo.

Models

You will need a Vision Language Model (VLM). Download the model and projector files from Hugging Face in .gguf format.

For example, you can use the Qwen3-VL-2B-Instruct model.

https://huggingface.co/ggml-org/Qwen3-VL-2B-Instruct-GGUF

Building

go build .

Running

./vlm-caption 0 localhost:8080 ~/models/Qwen3-VL-2B-Instruct-Q8_0.gguf ~/models/mmproj-Qwen3-VL-2B-Instruct-Q8_0.gguf "Give a very brief description of what is going on."

Now open your web browser pointed to http://localhost:8080/

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
html		html
images		images
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
server.go		server.go
video.go		video.go
vlm.go		vlm.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vlm-caption

Installation

yzma

GoCV

Models

Building

Running

About

Uh oh!

Releases

Packages

Languages

License

hybridgroup/vlm-caption

Folders and files

Latest commit

History

Repository files navigation

vlm-caption

Installation

yzma

GoCV

Models

Building

Running

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages