[wwb] Update model and add metrics for video generation simularity (openvinotoolkit#3286)

sbalandi · web-flow · commit df1990639d37 · 2026-02-10T19:17:09.000Z
## Description - llava-hf/LLaVA-NeXT-Video-7B-hf -> google/vivit-b-16x2 - add SSIM, LPIPS, temporal LPIPS as additional metrics - default num_frames was 25 become 33 - add info to README - restore try/except around subprocess call, as it print error code, but don't print error message (as part of discussion from [pr with test updates](openvinotoolkit#3139 (comment))) Results: HF vs optimum-intel FP32: 0.984067 HF vs optimum-intel FP16: 0.902521 HF vs optimum-intel INT8: 0.884154 ``` INFO:__main__:TOP WORST RESULTS INFO:__main__:======================================================================================================= INFO:__main__:Top-1 example: INFO:__main__: similarity 0.9603809714317322 SSIM (higher is better, 1.0 best) 0.9669264730508552 LPIPS (lower is better, 0 best) 0.017310822353465483 tLPIPS (lower is better, 0 best) 0.044467662431059346 prompts A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage. source_model ltx_video_hf_33/target/video_7.mp4 optimized_model ltx_video_opt_fp32_33/target/video_7.mp4 INFO:__main__:======================================================================================================= INFO:__main__:Top-2 example: INFO:__main__: similarity 0.9789813756942749 SSIM (higher is better, 1.0 best) 0.975887835252474 LPIPS (lower is better, 0 best) 0.012982440443011 tLPIPS (lower is better, 0 best) 0.09715526142427998 prompts A woman with light skin, wearing a blue jacket and a black hat with a veil, looks down and to her right, then back up as she speaks; she has brown hair styled in an updo, light brown eyebrows, and is wearing a white collared shirt under her jacket; the camera remains stationary on her face as she speaks; the background is out of focus, but shows trees and people in period clothing; the scene is captured in real-life footage. source_model ltx_video_hf_33/target/video_0.mp4 optimized_model ltx_video_opt_fp32_33/target/video_0.mp4 INFO:__main__:======================================================================================================= INFO:__main__:Top-3 example: INFO:__main__: similarity 0.980428159236908 SSIM (higher is better, 1.0 best) 0.987034910704742 LPIPS (lower is better, 0 best) 0.009701590897748247 tLPIPS (lower is better, 0 best) 0.2595951638875469 prompts A woman with blonde hair styled up, wearing a black dress with sequins and pearl earrings, looks down with a sad expression on her face. The camera remains stationary, focused on the woman's face. The lighting is dim, casting soft shadows on her face. The scene appears to be from a movie or TV show. source_model ltx_video_hf_33/target/video_1.mp4 optimized_model ltx_video_opt_fp32_33/target/video_1.mp4 ```  CVS-###  Fixes #(issue) ## Checklist: - [ ] This PR follows GenAI Contributing guidelines.  - [ ] Tests have been updated or added to cover the new code.  - [ ] This PR fully addresses the ticket.  - [ ] I have made corresponding changes to the documentation.
diff --git a/tools/who_what_benchmark/README.md b/tools/who_what_benchmark/README.md
@@ -128,6 +128,22 @@ wwb --base-model BAAI/bge-small-en-v1.5 --gt-data embed_test/gt.csv --model-type
 wwb --target-model ./bge-small-en-v1.5 --gt-data embed_test/gt.csv --model-type text-embedding --embeds_pooling mean --embeds_normalize --embeds_padding_side "left" --genai
 ```
 
+### Compare Text-to-video models
+```sh
+# Export model to OpenVINO, you can specify weight format with --weight-format option, for example --weight-format fp32/fp16/int8
+optimum-cli export openvino -m Lightricks/LTX-Video --weight-format fp32 ltx-video-model
+# Collect the references and save the mapping in the .csv file.
+# Reference videos will be stored in the "reference" subfolder under the same path with .csv.
+wwb --base-model Lightricks/LTX-Video --gt-data video_gen_test/gt.csv --model-type text-to-video --hf
+# Compute the metric
+# Target video will be stored in the "target" subfolder under the same path with .csv.
+# you can also specify the parameter: --output [custom folder], then the target videos and the corresponding CSV files with metrics will be saved to that folder.
+# compute metrics with optimum-intel
+wwb --target-model ltx-video-model --gt-data video_gen_test/gt.csv --model-type text-to-video --output ltx_video_optimum
+# compute metrics with GenAI
+wwb --target-model ltx-video-model --gt-data video_gen_test/gt.csv --model-type text-to-video --genai --output ltx_video_genai
+```
+
 ### API
 The API provides a way to access to investigate the worst generated text examples.
 
diff --git a/tools/who_what_benchmark/requirements.txt b/tools/who_what_benchmark/requirements.txt
@@ -20,3 +20,5 @@ soundfile
 librosa
 vocos
 vector_quantize_pytorch
+lpips
+scikit-image
diff --git a/tools/who_what_benchmark/tests/conftest.py b/tools/who_what_benchmark/tests/conftest.py
@@ -96,7 +96,12 @@ def convert(temp_path: Path) -> None:
         logger.info(f"Conversion command: {' '.join(command)}")
         retry_request(lambda: subprocess.run(command, check=True, text=True, capture_output=True))
 
-    manager.execute(convert)
+    try:
+        manager.execute(convert)
+    except subprocess.CalledProcessError as error:
+        logger.exception(f"optimum-cli returned {error.returncode}. Output:\n{error.stderr}")
+        raise
+
     return str(model_path)
 
 
@@ -124,9 +129,13 @@ def run_wwb(args: list[str], env=None):
     if env:
         base_env.update(env)
 
-    return subprocess.check_output(
-        command,
-        stderr=subprocess.STDOUT,
-        encoding="utf-8",
-        env=base_env,
-    )
+    try:
+        return subprocess.check_output(
+            command,
+            stderr=subprocess.STDOUT,
+            encoding="utf-8",
+            env=base_env,
+        )
+    except subprocess.CalledProcessError as error:
+        logger.error(f"'{' '.join(map(str, command))}' returned {error.returncode}. Output:\n{error.output}")
+        raise
diff --git a/tools/who_what_benchmark/whowhatbench/text2video_evaluator.py b/tools/who_what_benchmark/whowhatbench/text2video_evaluator.py
@@ -16,7 +16,7 @@
 
 @register_evaluator("text-to-video")
 class Text2VideoEvaluator(BaseEvaluator):
-    DEF_NUM_FRAMES = 25
+    DEF_NUM_FRAMES = 33
     DEF_NUM_INF_STEPS = 25
     DEF_FRAME_RATE = 25
     DEF_WIDTH = 704
@@ -31,7 +31,7 @@ def __init__(
         test_data: Union[str, list] = None,
         metrics="similarity",
         num_inference_steps=25,
-        num_frames=25,
+        num_frames=33,
         crop_prompts=True,
         num_samples=None,
         gen_video_fn=None,
@@ -96,9 +96,11 @@ def worst_examples(self, top_k: int = 5, metric="similarity"):
         assert self.last_cmp is not None
 
         res = self.last_cmp.nsmallest(top_k, metric)
-        res = list(row for idx, row in res.iterrows())
+        formatted_res = []
+        for _, row in res.iterrows():
+            formatted_res.append("\n".join(f"{col:<40} {val}" for col, val in row.items()))
 
-        return res
+        return formatted_res
 
     def collect_default_data(self):
         from importlib.resources import files
diff --git a/tools/who_what_benchmark/whowhatbench/whowhat_metrics.py b/tools/who_what_benchmark/whowhatbench/whowhat_metrics.py
@@ -8,10 +8,12 @@
 import torch
 import torch.nn.functional as F
 
+import cv2
 import numpy as np
 from sentence_transformers import SentenceTransformer, util
 from transformers import CLIPImageProcessor, CLIPModel
 from tqdm import tqdm
+from skimage.metrics import structural_similarity
 from sklearn.metrics.pairwise import cosine_similarity
 
 
@@ -236,55 +238,153 @@ def evaluate(self, data_gold, data_prediction):
 
 class VideoSimilarity:
     def __init__(self) -> None:
-        from transformers import LlavaNextVideoProcessor, LlavaNextVideoModel
+        from transformers import VivitImageProcessor, VivitModel
+
+        self.embeds_model = "google/vivit-b-16x2"
+        self.embeds_model_frame_num = 32
+        self.processor = VivitImageProcessor.from_pretrained(self.embeds_model)
+        self.model = VivitModel.from_pretrained(self.embeds_model).eval()
+
+        import lpips
+
+        # alex - faster; vgg - more rigorous assessments; to check when collecting statistics
+        self.lpips_model = lpips.LPIPS(net="alex").to("cpu")
+
+    def load_video_frames(self, video_path: str, num_frames: int | None = None):
+        cap = cv2.VideoCapture(video_path)
+        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+
+        # Adjust frame count to match required num_frames:
+        # interpolate if video has less frames, truncate if it has more
+        frame_idxs = np.arange(total_frames)
+        if num_frames and num_frames > total_frames:
+            frame_idxs = np.linspace(0, total_frames - 1, num_frames).astype(int)
+        elif num_frames and num_frames < total_frames:
+            frame_idxs = np.arange(num_frames)
+
+        frames = []
+        for i in range(total_frames):
+            ret, frame = cap.read()
+            if not ret:
+                break
+
+            frame_count = np.count_nonzero(frame_idxs == i)
+            if frame_count == 0:
+                continue
+            # if total_frames is less than required num_frames, duplicate some of them
+            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+            for _ in range(frame_count):
+                frames.append(frame)
+
+        cap.release()
+        return np.stack(frames)
+
+    def get_embedding(self, gold_video: np.ndarray, predicted_video: np.ndarray):
+        gold_inputs = self.processor(list(gold_video), return_tensors="pt")
+        with torch.no_grad():
+            gold_emb = self.model(**gold_inputs).last_hidden_state[:, 0, :]
 
-        self.processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
-        self.model = LlavaNextVideoModel.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf").eval()
+        predicted_inputs = self.processor(list(predicted_video), return_tensors="pt")
+        predicted_emb = self.model(**predicted_inputs).last_hidden_state[:, 0, :]
 
-    def get_pixel_values_videos(self, video):
-        # according to pre processing of inputs in get_video_features of LlavaNextVideoModel
-        # https://github.com/huggingface/transformers/blob/v4.53.2/src/transformers/models/llava_next_video/modular_llava_next_video.py#L381
-        inputs = self.processor.video_processor(videos=video, return_tensors="pt")["pixel_values_videos"]
-        batch_size, frames, channels, height, width = inputs.shape
-        pixel_values_videos = inputs.reshape(batch_size * frames, channels, height, width)
-        return pixel_values_videos
+        cos_sim_all = cosine_similarity(gold_emb.detach().numpy(), predicted_emb.detach().numpy())
+        return np.mean(np.diag(cos_sim_all))
 
-    def get_video_features(self, pixel_values_videos):
-        layer_idx = self.model.config.vision_feature_layer
-        with torch.no_grad():
-            # output shape (batch, patches, hidden_dim)
-            outputs = self.model.vision_tower(pixel_values_videos, output_hidden_states=True)
-        # according to post processing of outputs in get_video_features of LlavaNextVideoModel
-        # https://github.com/huggingface/transformers/blob/v4.53.2/src/transformers/models/llava_next_video/modular_llava_next_video.py#L387
-        outputs = outputs.hidden_states[layer_idx][:, 1:]
-        return outputs.mean(dim=2)
+    def convert_frame_to_lpips_tensor(self, frame: np.ndarray) -> torch.Tensor:
+        tensor = torch.from_numpy(frame).float().permute(2, 0, 1) / 255.0
+        # [0, 1] -> [-1, 1]
+        tensor = tensor * 2 - 1
+        return tensor.unsqueeze(0)
 
-    def load_video_frames(self, video_path):
-        import imageio.v3 as iio
+    def get_lpips(self, gold_video: np.ndarray, pred_video: np.ndarray):
+        lpips_scores = []
+        for i, gold_frame in enumerate(gold_video):
+            pred_frame = pred_video[i]
+            pred_lpips_frame = self.convert_frame_to_lpips_tensor(pred_frame)
+            gold_lpips_frame = self.convert_frame_to_lpips_tensor(gold_frame)
 
-        frames = iio.imread(video_path, plugin="pyav")
-        return [Image.fromarray(frame).convert("RGB") for frame in frames]
+            with torch.no_grad():
+                score = self.lpips_model(gold_lpips_frame, pred_lpips_frame)
 
-    def evaluate(self, gt, prediction):
-        videos_gold = gt["videos"].values
-        videos_prediction = prediction["videos"].values
+            lpips_scores.append(score.item())
+
+        return np.mean(lpips_scores)
+
+    def get_frame_differences_for_tlpips(self, video: np.ndarray) -> np.ndarray:
+        differences = []
+        for i in range(len(video) - 1):
+            diff = video[i + 1].astype(np.float32) - video[i].astype(np.float32)
+            differences.append(diff)
+
+        # Convert to tensor [-1, 1] range as LPIPS expects
+        differences = torch.Tensor(np.array(differences))
+        max_diff = differences.abs().max()
+        # if no changes occurred, max_diff will be 0; no normalization is required
+        if max_diff.item() != 0.0:
+            differences = differences / max_diff
+        return differences.permute(0, 3, 1, 2)
+
+    def get_temporal_lpips(self, gold_video: np.ndarray, pred_video: np.ndarray):
+        """
+        Temporal LPIPS: compares MOTION (frame differences) between videos
+        """
+
+        gold_video_diff = self.get_frame_differences_for_tlpips(gold_video)
+        pred_video_diff = self.get_frame_differences_for_tlpips(pred_video)
+
+        temporal_lpips_scores = []
+        for i in range(len(gold_video_diff)):
+            gold_tensor = gold_video_diff[i].unsqueeze(0)
+            pred_tensor = pred_video_diff[i].unsqueeze(0)
+
+            with torch.no_grad():
+                score = self.lpips_model(gold_tensor, pred_tensor)
+
+            temporal_lpips_scores.append(score.item())
+
+        return np.mean(temporal_lpips_scores)
+
+    def get_ssim(self, gold_video: np.ndarray, pred_video: np.ndarray):
+        ssim_vals = []
+        for i, gold_frame in enumerate(gold_video):
+            gf_gray = cv2.cvtColor(gold_frame, cv2.COLOR_RGB2GRAY)
+            pred_frame = pred_video[i]
+            pf_gray = cv2.cvtColor(pred_frame, cv2.COLOR_RGB2GRAY)
+
+            ssim_vals.append(structural_similarity(gf_gray, pf_gray))
+
+        return ssim_vals
+
+    def evaluate(self, data_gold, data_prediction):
+        videos_gold = data_gold["videos"].values
+        videos_prediction = data_prediction["videos"].values
 
         metric_per_video = []
-        metric_per_frames_per_video = []
-        for gold, pred in tqdm(zip(videos_gold, videos_prediction), desc="Video Similarity evaluation"):
-            gold_video = self.load_video_frames(gold)
-            prediction_video = self.load_video_frames(pred)
+        ssim_per_video = []
+        lpips_per_video = []
+        tlpips_per_video = []
+        for gold, prediction in tqdm(zip(videos_gold, videos_prediction), desc="Video Similarity evaluation"):
+            # vivit requires 32 frames
+            gold_video = self.load_video_frames(str(gold), num_frames=self.embeds_model_frame_num)
+            predicted_video = self.load_video_frames(str(prediction), num_frames=self.embeds_model_frame_num)
+
+            cos_sim_mean = self.get_embedding(gold_video, predicted_video)
 
-            gold_inputs_pixel_values = self.get_pixel_values_videos(gold_video)
-            prediction_inputs_pixel_values = self.get_pixel_values_videos(prediction_video)
+            ssim_vals = self.get_ssim(gold_video, predicted_video)
+            ssim_avg = sum(ssim_vals) / len(ssim_vals)
 
-            gold_outputs = self.get_video_features(gold_inputs_pixel_values)
-            prediction_outputs = self.get_video_features(prediction_inputs_pixel_values)
+            lpips = self.get_lpips(gold_video, predicted_video)
+            tlpips = self.get_temporal_lpips(gold_video, predicted_video)
 
-            cos_sim_all = cosine_similarity(prediction_outputs, gold_outputs)
-            cos_sim_frames = np.array([cos_sim_all[i, i] for i in range(len(gold_video))])
-            metric_per_video.append(np.mean(cos_sim_frames))
-            metric_per_frames_per_video.append(cos_sim_frames)
+            metric_per_video.append(cos_sim_mean)
+            ssim_per_video.append(ssim_avg)
+            lpips_per_video.append(lpips)
+            tlpips_per_video.append(tlpips)
 
         metric_dict = {"similarity": np.mean(metric_per_video)}
-        return metric_dict, {"similarity": metric_per_video, "per_frame": metric_per_frames_per_video}
+        return metric_dict, {
+            "similarity": metric_per_video,
+            "SSIM (higher is better, 1.0 best)": ssim_per_video,
+            "LPIPS (lower is better, 0 best)": lpips_per_video,
+            "tLPIPS (lower is better, 0 best)": tlpips_per_video,
+        }
diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py
@@ -796,18 +796,20 @@ def print_image_results(evaluator):
     pd.set_option('display.max_colwidth', None)
     worst_examples = evaluator.worst_examples(
         top_k=5, metric=metric_of_interest)
+    logger.info("TOP WORST RESULTS")
     for i, e in enumerate(worst_examples):
         logger.info(
             "======================================================================================================="
         )
         logger.info(f"Top-{i+1} example:")
-        logger.info(e)
+        logger.info(f"\n{e}")
 
 
 def print_embeds_results(evaluator):
     metric_of_interest = "similarity"
     worst_examples = evaluator.worst_examples(
         top_k=5, metric=metric_of_interest)
+    logger.info("TOP WORST RESULTS")
     for i, e in enumerate(worst_examples):
         logger.info(
             "======================================================================================================="
@@ -821,6 +823,7 @@ def print_rag_results(evaluator):
     metric_of_interest = "similarity"
     worst_examples = evaluator.worst_examples(
         top_k=5, metric=metric_of_interest)
+    logger.info("TOP WORST RESULTS")
     for i, e in enumerate(worst_examples):
         logger.info(
             "======================================================================================================="