-
Notifications
You must be signed in to change notification settings - Fork 1
Description
This task is to report my observations on the experiments I ran while working on MVP 2 to better understand the capabilities of Qwen3-VL to one-shot elaborate tasks on longer video files. This is also very relevant to MVP 1.
I've tried Qwen3-VL 4b, 8b and 32b on sample videos both as part of a testing script and as part of FrameSense.
Main take-aways
- Best scenario so far: after a bit of prompt refinement, Qwen3-VL-4b can perfectly detect the separators in a 16 minute video and another 10 minute video, with a JSON output including the time code, the title on the separator, the year and producer.
- Experiments on other or longer videos show repetitive loops and hallucinated separators.
- TODO: add more
TODO
- Redo best tests with different seeds to check robustness of the setup; ideally we want something that works well on most seed.
- Test best prompt with 32b model on a variety of long videos.
- Test with 4/8b models with FrameSense on all sample videos and analyse the results
Sub-Problem 1: difficulty to run larger model with FrameSense
FrameSense is very portable, can scale to any number of videos and will only process when needed. It uses Singularity containers. I could not process a 30 minute video with Qwen3-VL-32B. But that task can be achieved with the test script outside Singularity. Why?
Another oddity is that 4b model can process some video and not others despite being approximately the same length.
Sub-Problem 2: difficulty to use quantised models
Quantised models should be very helpful as they will run faster with less VRAM. Unfortunately FP8 or AWQ variants of Qwen3-VL are not supported by Transfomers python framework yet. Quantised models are available on Ollama platform but Ollama API doesn't support video input. I had no luck with vLLM or llama-cpp engines.