Skip to content

Commit df19906

Browse files
authored
[wwb] Update model and add metrics for video generation simularity (openvinotoolkit#3286)
<!-- Keep your pull requests (PRs) as atomic as possible. That increases the likelihood that an individual PR won't be stuck because of adjacent problems, merge conflicts, or code review. Your merged PR is going to appear in the automatically generated release notes on GitHub. So the clearer the title the better. --> ## Description - llava-hf/LLaVA-NeXT-Video-7B-hf -> google/vivit-b-16x2 - add SSIM, LPIPS, temporal LPIPS as additional metrics - default num_frames was 25 become 33 - add info to README - restore try/except around subprocess call, as it print error code, but don't print error message (as part of discussion from [pr with test updates](openvinotoolkit#3139 (comment))) Results: HF vs optimum-intel FP32: 0.984067 HF vs optimum-intel FP16: 0.902521 HF vs optimum-intel INT8: 0.884154 ``` INFO:__main__:TOP WORST RESULTS INFO:__main__:======================================================================================================= INFO:__main__:Top-1 example: INFO:__main__: similarity 0.9603809714317322 SSIM (higher is better, 1.0 best) 0.9669264730508552 LPIPS (lower is better, 0 best) 0.017310822353465483 tLPIPS (lower is better, 0 best) 0.044467662431059346 prompts A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage. source_model ltx_video_hf_33/target/video_7.mp4 optimized_model ltx_video_opt_fp32_33/target/video_7.mp4 INFO:__main__:======================================================================================================= INFO:__main__:Top-2 example: INFO:__main__: similarity 0.9789813756942749 SSIM (higher is better, 1.0 best) 0.975887835252474 LPIPS (lower is better, 0 best) 0.012982440443011 tLPIPS (lower is better, 0 best) 0.09715526142427998 prompts A woman with light skin, wearing a blue jacket and a black hat with a veil, looks down and to her right, then back up as she speaks; she has brown hair styled in an updo, light brown eyebrows, and is wearing a white collared shirt under her jacket; the camera remains stationary on her face as she speaks; the background is out of focus, but shows trees and people in period clothing; the scene is captured in real-life footage. source_model ltx_video_hf_33/target/video_0.mp4 optimized_model ltx_video_opt_fp32_33/target/video_0.mp4 INFO:__main__:======================================================================================================= INFO:__main__:Top-3 example: INFO:__main__: similarity 0.980428159236908 SSIM (higher is better, 1.0 best) 0.987034910704742 LPIPS (lower is better, 0 best) 0.009701590897748247 tLPIPS (lower is better, 0 best) 0.2595951638875469 prompts A woman with blonde hair styled up, wearing a black dress with sequins and pearl earrings, looks down with a sad expression on her face. The camera remains stationary, focused on the woman's face. The lighting is dim, casting soft shadows on her face. The scene appears to be from a movie or TV show. source_model ltx_video_hf_33/target/video_1.mp4 optimized_model ltx_video_opt_fp32_33/target/video_1.mp4 ``` <!-- Jira ticket number (e.g., 123). Delete if there's no ticket. --> CVS-### <!-- Remove if not applicable --> Fixes #(issue) ## Checklist: - [ ] This PR follows GenAI Contributing guidelines. <!-- Always follow https://github.com/openvinotoolkit/openvino.genai?tab=contributing-ov-file#contributing. If there are deviations, explain what and why. --> - [ ] Tests have been updated or added to cover the new code. <!-- Specify exactly which tests were added or updated. If the change isn't maintenance related, update the tests at https://github.com/openvinotoolkit/openvino.genai/tree/master/tests or explain in the description why the tests don't need an update. --> - [ ] This PR fully addresses the ticket. <!--- If not, explain clearly what is covered and what is not. If follow-up pull requests are needed, specify in the description. --> - [ ] I have made corresponding changes to the documentation. <!-- Run github.com/\<username>/openvino.genai/actions/workflows/deploy_gh_pages.yml on your fork with your branch as a parameter to deploy a test version with the updated content. Replace this comment with the link to the built docs. If the documentation is updated in a separate PR, clearly specify it. -->
1 parent 539d108 commit df19906

File tree

6 files changed

+183
-51
lines changed

6 files changed

+183
-51
lines changed

tools/who_what_benchmark/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,22 @@ wwb --base-model BAAI/bge-small-en-v1.5 --gt-data embed_test/gt.csv --model-type
128128
wwb --target-model ./bge-small-en-v1.5 --gt-data embed_test/gt.csv --model-type text-embedding --embeds_pooling mean --embeds_normalize --embeds_padding_side "left" --genai
129129
```
130130

131+
### Compare Text-to-video models
132+
```sh
133+
# Export model to OpenVINO, you can specify weight format with --weight-format option, for example --weight-format fp32/fp16/int8
134+
optimum-cli export openvino -m Lightricks/LTX-Video --weight-format fp32 ltx-video-model
135+
# Collect the references and save the mapping in the .csv file.
136+
# Reference videos will be stored in the "reference" subfolder under the same path with .csv.
137+
wwb --base-model Lightricks/LTX-Video --gt-data video_gen_test/gt.csv --model-type text-to-video --hf
138+
# Compute the metric
139+
# Target video will be stored in the "target" subfolder under the same path with .csv.
140+
# you can also specify the parameter: --output [custom folder], then the target videos and the corresponding CSV files with metrics will be saved to that folder.
141+
# compute metrics with optimum-intel
142+
wwb --target-model ltx-video-model --gt-data video_gen_test/gt.csv --model-type text-to-video --output ltx_video_optimum
143+
# compute metrics with GenAI
144+
wwb --target-model ltx-video-model --gt-data video_gen_test/gt.csv --model-type text-to-video --genai --output ltx_video_genai
145+
```
146+
131147
### API
132148
The API provides a way to access to investigate the worst generated text examples.
133149

tools/who_what_benchmark/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,5 @@ soundfile
2020
librosa
2121
vocos
2222
vector_quantize_pytorch
23+
lpips
24+
scikit-image

tools/who_what_benchmark/tests/conftest.py

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,12 @@ def convert(temp_path: Path) -> None:
9696
logger.info(f"Conversion command: {' '.join(command)}")
9797
retry_request(lambda: subprocess.run(command, check=True, text=True, capture_output=True))
9898

99-
manager.execute(convert)
99+
try:
100+
manager.execute(convert)
101+
except subprocess.CalledProcessError as error:
102+
logger.exception(f"optimum-cli returned {error.returncode}. Output:\n{error.stderr}")
103+
raise
104+
100105
return str(model_path)
101106

102107

@@ -124,9 +129,13 @@ def run_wwb(args: list[str], env=None):
124129
if env:
125130
base_env.update(env)
126131

127-
return subprocess.check_output(
128-
command,
129-
stderr=subprocess.STDOUT,
130-
encoding="utf-8",
131-
env=base_env,
132-
)
132+
try:
133+
return subprocess.check_output(
134+
command,
135+
stderr=subprocess.STDOUT,
136+
encoding="utf-8",
137+
env=base_env,
138+
)
139+
except subprocess.CalledProcessError as error:
140+
logger.error(f"'{' '.join(map(str, command))}' returned {error.returncode}. Output:\n{error.output}")
141+
raise

tools/who_what_benchmark/whowhatbench/text2video_evaluator.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
@register_evaluator("text-to-video")
1818
class Text2VideoEvaluator(BaseEvaluator):
19-
DEF_NUM_FRAMES = 25
19+
DEF_NUM_FRAMES = 33
2020
DEF_NUM_INF_STEPS = 25
2121
DEF_FRAME_RATE = 25
2222
DEF_WIDTH = 704
@@ -31,7 +31,7 @@ def __init__(
3131
test_data: Union[str, list] = None,
3232
metrics="similarity",
3333
num_inference_steps=25,
34-
num_frames=25,
34+
num_frames=33,
3535
crop_prompts=True,
3636
num_samples=None,
3737
gen_video_fn=None,
@@ -96,9 +96,11 @@ def worst_examples(self, top_k: int = 5, metric="similarity"):
9696
assert self.last_cmp is not None
9797

9898
res = self.last_cmp.nsmallest(top_k, metric)
99-
res = list(row for idx, row in res.iterrows())
99+
formatted_res = []
100+
for _, row in res.iterrows():
101+
formatted_res.append("\n".join(f"{col:<40} {val}" for col, val in row.items()))
100102

101-
return res
103+
return formatted_res
102104

103105
def collect_default_data(self):
104106
from importlib.resources import files

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py

Lines changed: 139 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,12 @@
88
import torch
99
import torch.nn.functional as F
1010

11+
import cv2
1112
import numpy as np
1213
from sentence_transformers import SentenceTransformer, util
1314
from transformers import CLIPImageProcessor, CLIPModel
1415
from tqdm import tqdm
16+
from skimage.metrics import structural_similarity
1517
from sklearn.metrics.pairwise import cosine_similarity
1618

1719

@@ -236,55 +238,153 @@ def evaluate(self, data_gold, data_prediction):
236238

237239
class VideoSimilarity:
238240
def __init__(self) -> None:
239-
from transformers import LlavaNextVideoProcessor, LlavaNextVideoModel
241+
from transformers import VivitImageProcessor, VivitModel
242+
243+
self.embeds_model = "google/vivit-b-16x2"
244+
self.embeds_model_frame_num = 32
245+
self.processor = VivitImageProcessor.from_pretrained(self.embeds_model)
246+
self.model = VivitModel.from_pretrained(self.embeds_model).eval()
247+
248+
import lpips
249+
250+
# alex - faster; vgg - more rigorous assessments; to check when collecting statistics
251+
self.lpips_model = lpips.LPIPS(net="alex").to("cpu")
252+
253+
def load_video_frames(self, video_path: str, num_frames: int | None = None):
254+
cap = cv2.VideoCapture(video_path)
255+
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
256+
257+
# Adjust frame count to match required num_frames:
258+
# interpolate if video has less frames, truncate if it has more
259+
frame_idxs = np.arange(total_frames)
260+
if num_frames and num_frames > total_frames:
261+
frame_idxs = np.linspace(0, total_frames - 1, num_frames).astype(int)
262+
elif num_frames and num_frames < total_frames:
263+
frame_idxs = np.arange(num_frames)
264+
265+
frames = []
266+
for i in range(total_frames):
267+
ret, frame = cap.read()
268+
if not ret:
269+
break
270+
271+
frame_count = np.count_nonzero(frame_idxs == i)
272+
if frame_count == 0:
273+
continue
274+
# if total_frames is less than required num_frames, duplicate some of them
275+
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
276+
for _ in range(frame_count):
277+
frames.append(frame)
278+
279+
cap.release()
280+
return np.stack(frames)
281+
282+
def get_embedding(self, gold_video: np.ndarray, predicted_video: np.ndarray):
283+
gold_inputs = self.processor(list(gold_video), return_tensors="pt")
284+
with torch.no_grad():
285+
gold_emb = self.model(**gold_inputs).last_hidden_state[:, 0, :]
240286

241-
self.processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
242-
self.model = LlavaNextVideoModel.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf").eval()
287+
predicted_inputs = self.processor(list(predicted_video), return_tensors="pt")
288+
predicted_emb = self.model(**predicted_inputs).last_hidden_state[:, 0, :]
243289

244-
def get_pixel_values_videos(self, video):
245-
# according to pre processing of inputs in get_video_features of LlavaNextVideoModel
246-
# https://github.com/huggingface/transformers/blob/v4.53.2/src/transformers/models/llava_next_video/modular_llava_next_video.py#L381
247-
inputs = self.processor.video_processor(videos=video, return_tensors="pt")["pixel_values_videos"]
248-
batch_size, frames, channels, height, width = inputs.shape
249-
pixel_values_videos = inputs.reshape(batch_size * frames, channels, height, width)
250-
return pixel_values_videos
290+
cos_sim_all = cosine_similarity(gold_emb.detach().numpy(), predicted_emb.detach().numpy())
291+
return np.mean(np.diag(cos_sim_all))
251292

252-
def get_video_features(self, pixel_values_videos):
253-
layer_idx = self.model.config.vision_feature_layer
254-
with torch.no_grad():
255-
# output shape (batch, patches, hidden_dim)
256-
outputs = self.model.vision_tower(pixel_values_videos, output_hidden_states=True)
257-
# according to post processing of outputs in get_video_features of LlavaNextVideoModel
258-
# https://github.com/huggingface/transformers/blob/v4.53.2/src/transformers/models/llava_next_video/modular_llava_next_video.py#L387
259-
outputs = outputs.hidden_states[layer_idx][:, 1:]
260-
return outputs.mean(dim=2)
293+
def convert_frame_to_lpips_tensor(self, frame: np.ndarray) -> torch.Tensor:
294+
tensor = torch.from_numpy(frame).float().permute(2, 0, 1) / 255.0
295+
# [0, 1] -> [-1, 1]
296+
tensor = tensor * 2 - 1
297+
return tensor.unsqueeze(0)
261298

262-
def load_video_frames(self, video_path):
263-
import imageio.v3 as iio
299+
def get_lpips(self, gold_video: np.ndarray, pred_video: np.ndarray):
300+
lpips_scores = []
301+
for i, gold_frame in enumerate(gold_video):
302+
pred_frame = pred_video[i]
303+
pred_lpips_frame = self.convert_frame_to_lpips_tensor(pred_frame)
304+
gold_lpips_frame = self.convert_frame_to_lpips_tensor(gold_frame)
264305

265-
frames = iio.imread(video_path, plugin="pyav")
266-
return [Image.fromarray(frame).convert("RGB") for frame in frames]
306+
with torch.no_grad():
307+
score = self.lpips_model(gold_lpips_frame, pred_lpips_frame)
267308

268-
def evaluate(self, gt, prediction):
269-
videos_gold = gt["videos"].values
270-
videos_prediction = prediction["videos"].values
309+
lpips_scores.append(score.item())
310+
311+
return np.mean(lpips_scores)
312+
313+
def get_frame_differences_for_tlpips(self, video: np.ndarray) -> np.ndarray:
314+
differences = []
315+
for i in range(len(video) - 1):
316+
diff = video[i + 1].astype(np.float32) - video[i].astype(np.float32)
317+
differences.append(diff)
318+
319+
# Convert to tensor [-1, 1] range as LPIPS expects
320+
differences = torch.Tensor(np.array(differences))
321+
max_diff = differences.abs().max()
322+
# if no changes occurred, max_diff will be 0; no normalization is required
323+
if max_diff.item() != 0.0:
324+
differences = differences / max_diff
325+
return differences.permute(0, 3, 1, 2)
326+
327+
def get_temporal_lpips(self, gold_video: np.ndarray, pred_video: np.ndarray):
328+
"""
329+
Temporal LPIPS: compares MOTION (frame differences) between videos
330+
"""
331+
332+
gold_video_diff = self.get_frame_differences_for_tlpips(gold_video)
333+
pred_video_diff = self.get_frame_differences_for_tlpips(pred_video)
334+
335+
temporal_lpips_scores = []
336+
for i in range(len(gold_video_diff)):
337+
gold_tensor = gold_video_diff[i].unsqueeze(0)
338+
pred_tensor = pred_video_diff[i].unsqueeze(0)
339+
340+
with torch.no_grad():
341+
score = self.lpips_model(gold_tensor, pred_tensor)
342+
343+
temporal_lpips_scores.append(score.item())
344+
345+
return np.mean(temporal_lpips_scores)
346+
347+
def get_ssim(self, gold_video: np.ndarray, pred_video: np.ndarray):
348+
ssim_vals = []
349+
for i, gold_frame in enumerate(gold_video):
350+
gf_gray = cv2.cvtColor(gold_frame, cv2.COLOR_RGB2GRAY)
351+
pred_frame = pred_video[i]
352+
pf_gray = cv2.cvtColor(pred_frame, cv2.COLOR_RGB2GRAY)
353+
354+
ssim_vals.append(structural_similarity(gf_gray, pf_gray))
355+
356+
return ssim_vals
357+
358+
def evaluate(self, data_gold, data_prediction):
359+
videos_gold = data_gold["videos"].values
360+
videos_prediction = data_prediction["videos"].values
271361

272362
metric_per_video = []
273-
metric_per_frames_per_video = []
274-
for gold, pred in tqdm(zip(videos_gold, videos_prediction), desc="Video Similarity evaluation"):
275-
gold_video = self.load_video_frames(gold)
276-
prediction_video = self.load_video_frames(pred)
363+
ssim_per_video = []
364+
lpips_per_video = []
365+
tlpips_per_video = []
366+
for gold, prediction in tqdm(zip(videos_gold, videos_prediction), desc="Video Similarity evaluation"):
367+
# vivit requires 32 frames
368+
gold_video = self.load_video_frames(str(gold), num_frames=self.embeds_model_frame_num)
369+
predicted_video = self.load_video_frames(str(prediction), num_frames=self.embeds_model_frame_num)
370+
371+
cos_sim_mean = self.get_embedding(gold_video, predicted_video)
277372

278-
gold_inputs_pixel_values = self.get_pixel_values_videos(gold_video)
279-
prediction_inputs_pixel_values = self.get_pixel_values_videos(prediction_video)
373+
ssim_vals = self.get_ssim(gold_video, predicted_video)
374+
ssim_avg = sum(ssim_vals) / len(ssim_vals)
280375

281-
gold_outputs = self.get_video_features(gold_inputs_pixel_values)
282-
prediction_outputs = self.get_video_features(prediction_inputs_pixel_values)
376+
lpips = self.get_lpips(gold_video, predicted_video)
377+
tlpips = self.get_temporal_lpips(gold_video, predicted_video)
283378

284-
cos_sim_all = cosine_similarity(prediction_outputs, gold_outputs)
285-
cos_sim_frames = np.array([cos_sim_all[i, i] for i in range(len(gold_video))])
286-
metric_per_video.append(np.mean(cos_sim_frames))
287-
metric_per_frames_per_video.append(cos_sim_frames)
379+
metric_per_video.append(cos_sim_mean)
380+
ssim_per_video.append(ssim_avg)
381+
lpips_per_video.append(lpips)
382+
tlpips_per_video.append(tlpips)
288383

289384
metric_dict = {"similarity": np.mean(metric_per_video)}
290-
return metric_dict, {"similarity": metric_per_video, "per_frame": metric_per_frames_per_video}
385+
return metric_dict, {
386+
"similarity": metric_per_video,
387+
"SSIM (higher is better, 1.0 best)": ssim_per_video,
388+
"LPIPS (lower is better, 0 best)": lpips_per_video,
389+
"tLPIPS (lower is better, 0 best)": tlpips_per_video,
390+
}

tools/who_what_benchmark/whowhatbench/wwb.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -796,18 +796,20 @@ def print_image_results(evaluator):
796796
pd.set_option('display.max_colwidth', None)
797797
worst_examples = evaluator.worst_examples(
798798
top_k=5, metric=metric_of_interest)
799+
logger.info("TOP WORST RESULTS")
799800
for i, e in enumerate(worst_examples):
800801
logger.info(
801802
"======================================================================================================="
802803
)
803804
logger.info(f"Top-{i+1} example:")
804-
logger.info(e)
805+
logger.info(f"\n{e}")
805806

806807

807808
def print_embeds_results(evaluator):
808809
metric_of_interest = "similarity"
809810
worst_examples = evaluator.worst_examples(
810811
top_k=5, metric=metric_of_interest)
812+
logger.info("TOP WORST RESULTS")
811813
for i, e in enumerate(worst_examples):
812814
logger.info(
813815
"======================================================================================================="
@@ -821,6 +823,7 @@ def print_rag_results(evaluator):
821823
metric_of_interest = "similarity"
822824
worst_examples = evaluator.worst_examples(
823825
top_k=5, metric=metric_of_interest)
826+
logger.info("TOP WORST RESULTS")
824827
for i, e in enumerate(worst_examples):
825828
logger.info(
826829
"======================================================================================================="

0 commit comments

Comments
 (0)