Hi,
Congrats on your amazing work & thank you for releasing the code!
I have a question about the ControlNet conditioning.
In the paper, it says
“2D trajectories Gaussian heatmap and concatenate the trajectories, instance points, and depth points to serve as control signal, which is injected into the Stable Video Diffusion (SVD) [5] using ControlNet.”
and also,
in the Gradio demo code (gradio_run.py), I see extraction of depth maps (DepthAnythingV2) and instance masks (SAM).
However, in the main pipeline (pipeline_stable_video_diffusion_mask_control.py L452–457), only the 2D trajectory Gaussian heatmap appears to be passed as controlnet_cond.
Could you clarify how the depth and instance signals are injected into ControlNet?
Thanks in advance.