MVP1/2: Experiments with finding separators in videos using a video language model

This task is to report my observations on the experiments I ran while working on MVP 2 to better understand the capabilities of Qwen3-VL to one-shot elaborate tasks on longer video files. This is also very relevant to MVP 1.

I've tried Qwen3-VL 4b, 8b and 32b on sample videos both as part of a testing script and as part of FrameSense. 

### Main take-aways

1. Best scenario so far: after a bit of prompt refinement, [Qwen3-VL-4b can perfectly detect the separators in a 16 minute video and another 10 minute video](https://github.com/kingsdigitallab/issa/blob/main/experiments/qwen3-vl/separators.md), with a JSON output including the time code, the title on the separator, the year and producer. 
2. Experiments on other or [longer videos](https://github.com/kingsdigitallab/issa/blob/main/experiments/qwen3-vl/DVC43313-35m-qwen3vl-32b-outer-separators.md) show repetitive loops and hallucinated separators. 
3. TODO: add more

### TODO

1. Redo best tests with different seeds to check robustness of the setup; ideally we want something that works well on most seed.
2. Test best prompt with 32b model on a variety of long videos.
3. Test with 4/8b models with FrameSense on all sample videos and analyse the results
### Sub-Problem 1: difficulty to run larger model with FrameSense

FrameSense is very portable, can scale to any number of videos and will only process when needed. It uses Singularity containers. I could not process a 30 minute video with Qwen3-VL-32B. But that task can be achieved with the test script outside Singularity. Why?

Another oddity is that 4b model can process some video and not others despite being approximately the same length.

### Sub-Problem 2: difficulty to use quantised models

Quantised models should be very helpful as they will run faster with less VRAM. Unfortunately FP8 or AWQ variants of Qwen3-VL are not supported by Transfomers python framework yet. Quantised models are available on Ollama platform but Ollama API doesn't support video input. I had no luck with vLLM or llama-cpp engines.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVP1/2: Experiments with finding separators in videos using a video language model #1

Main take-aways

TODO

Sub-Problem 1: difficulty to run larger model with FrameSense

Sub-Problem 2: difficulty to use quantised models

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MVP1/2: Experiments with finding separators in videos using a video language model #1

Description

Main take-aways

TODO

Sub-Problem 1: difficulty to run larger model with FrameSense

Sub-Problem 2: difficulty to use quantised models

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions