In get_nexusscore.py, line 379-382, the final calculation of Nexus Score is defined as
if len(retrieval_score_list) != 0:
nexus_score = (
torch.mean(torch.tensor(retrieval_score_list)).item() / frame_obj
)
I am wondering why the mean of retrieval score has to be further divided by "frame_obj"? The ‘torch.mean' operation has already consider both the number of prompt images (N) and number of frames (T), so this division seems redundant.
Also, this division means that if an object consistently appears across the video and the YOLO detector can always detect it, the Nexus score will be lower than the situation where it only appears in a few frames, which means flickering may lead to higher Nexus score. This seems unreasnable, or is it something that I misunderstand? I hope you can help me with my questions. Much thanks!
In get_nexusscore.py, line 379-382, the final calculation of Nexus Score is defined as
I am wondering why the mean of retrieval score has to be further divided by "frame_obj"? The ‘torch.mean' operation has already consider both the number of prompt images (N) and number of frames (T), so this division seems redundant.
Also, this division means that if an object consistently appears across the video and the YOLO detector can always detect it, the Nexus score will be lower than the situation where it only appears in a few frames, which means flickering may lead to higher Nexus score. This seems unreasnable, or is it something that I misunderstand? I hope you can help me with my questions. Much thanks!