Skip to content

feat(annotators): enhance label annotators with frame boundary adjust… #1820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

hidara2000
Copy link

@hidara2000 hidara2000 commented Apr 15, 2025

🚀 Enhance label annotators with frame boundary adjustments and new base class

Description

This PR adds the ability to ensure labels stay within frame boundaries through a new ensure_in_frame parameter. When enabled, this functionality guarantees that text labels for bounding boxes near image edges remain visible by adjusting their position to fit within the frame.

The key improvements include:

  • ✅ Text labels near edges now properly positioned within frame boundaries
  • ✅ Implemented as an optional parameter (default: False to maintain backward compatibility)
  • ✅ Works alongside existing smart_position functionality with complementary behavior

While there may be occasional label overlaps in very busy frames when both smart_position and ensure_in_frame are enabled, running the smart positioning algorithm first typically yields better results overall.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested?

I tested this change with various image scenarios that have bounding boxes positioned near frame edges. The implementation was verified by:

  1. Comparing output images with and without the ensure_in_frame parameter enabled
  2. Testing cases with multiple objects near edges to ensure proper positioning
  3. Validating behavior when used in combination with smart_position

Example test code:

import cv2
import numpy as np
import supervision as sv
from supervision.annotators.core import LabelAnnotator, BoxAnnotator
from PIL import Image, ImageDraw
import os
from typing import Optional, Tuple, List


def generate_mock_yolo_output(image_shape: Tuple[int, int, int]) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Generates mock bounding box detections, confidence scores, and class predictions
    for a given image shape.  The function creates a set of detections, including
    one that covers the whole image.

    Args:
        image_shape (Tuple[int, int, int]): The shape of the image (height, width, channels).
            This is used to determine the boundaries for the generated bounding boxes.

    Returns:
        Tuple[np.ndarray, np.ndarray, np.ndarray]: A tuple containing:
            - A NumPy array of bounding boxes (N, 4), where N is the number of detections.
              Each box is defined as [xmin, ymin, xmax, ymax].
            - A NumPy array of confidence scores (N,).
            - A NumPy array of class labels (N,).
    """
    image_height, image_width, _ = image_shape
    num_detections = 100
    
    # Generate random bounding boxes
    xmin = np.random.randint(0, image_width, num_detections)
    ymin = np.random.randint(0, image_height, num_detections)
    xmax = np.random.randint(xmin + 20, image_width + 50, num_detections)
    ymax = np.random.randint(ymin + 20, image_height + 50, num_detections)
    bounding_boxes = np.stack([xmin, ymin, xmax, ymax], axis=1).astype(np.float32)

    # Add a box that covers the whole image
    full_image_box = np.array([0, 0, image_width, image_height], dtype=np.float32).reshape(1, 4)
    bounding_boxes = np.concatenate([bounding_boxes, full_image_box], axis=0)
    num_detections += 1

    # Generate random confidence scores
    confidence_scores = np.random.uniform(0.5, 0.95, num_detections).astype(np.float32)
    confidence_scores[-1] = 0.99  # High confidence for the full image box

    # Generate random class labels
    class_labels = np.random.randint(0, 2, num_detections).astype(np.int32)
    class_labels[-1] = 0  # Assign a class to the full image box

    return bounding_boxes, confidence_scores, class_labels



def process_image_with_supervision(
    image: np.ndarray,
    display_image: bool = True,
    text_position: sv.Position = sv.Position.TOP_LEFT,
    smart_position: bool = False,
    detections: Optional[sv.Detections] = None,
) -> None:
    """
    Processes an image by simulating YOLO detection and using Supervision to annotate it.
    The function generates two annotated images (with and without `ensure_in_frame`)
    and stacks them vertically, adding headers and white boundaries for clarity.

    Args:
        image (np.ndarray): The input image as a NumPy array in BGR format.
        display_image (bool, optional): Flag to control whether to display the image.
            If True, it attempts to display the image. If False, it saves the
            image to a file. Defaults to True.
        text_position (sv.Position, optional): The position of the text label
            relative to the bounding box.  Defaults to sv.Position.TOP_LEFT.
        smart_position (bool, optional): Flag to enable smart position adjustment of labels
            to keep them within the image frame. Defaults to False.
        detections (sv.Detections, optional): Pre-calculated detections.
            If provided, the function uses these detections instead of generating new ones.
            Defaults to None.

    Returns:
        None (displays or saves the stacked annotated image).
    """
    # 1. Simulate YOLO model output or use provided
    if detections is None:
        bounding_boxes, confidence_scores, class_labels = generate_mock_yolo_output(image.shape)
        detections = sv.Detections(
            xyxy=bounding_boxes,
            confidence=confidence_scores,
            class_id=class_labels,
        )

    # 2. Create annotators
    box_annotator = BoxAnnotator(thickness=2)
    class_names = ["car", "person"]

    label_annotator_in_frame = LabelAnnotator(
        text_scale=0.5,
        text_thickness=1,
        text_padding=5,
        ensure_in_frame=True,
        text_position=text_position,
        smart_position=smart_position,
    )
    label_annotator_out_of_frame = LabelAnnotator(
        text_scale=0.5,
        text_thickness=1,
        text_padding=5,
        ensure_in_frame=False,
        text_position=text_position,
        smart_position=smart_position,
    )

    # 4. Annotate the image with the detections using both annotators.
    annotated_image_in_frame = box_annotator.annotate(image.copy(), detections=detections)
    labels_in_frame = [
        f"{class_names[int(class_id)]} {confidence:.2f}"  # Corrected f-string
        for _, _, confidence, class_id, *_ in detections
    ]
    annotated_image_in_frame = label_annotator_in_frame.annotate(
        annotated_image_in_frame, detections=detections, labels=labels_in_frame
    )

    annotated_image_out_of_frame = box_annotator.annotate(image.copy(), detections=detections)
    labels_out_of_frame = [
        f"{class_names[int(class_id)]} {confidence:.2f}"  # Corrected f-string
        for _, _, confidence, class_id, *_ in detections
    ]
    annotated_image_out_of_frame = label_annotator_out_of_frame.annotate(
        annotated_image_out_of_frame, detections=detections, labels=labels_out_of_frame
    )

    # 5. Add white boundaries around the images
    border_width = 3
    annotated_image_in_frame = cv2.copyMakeBorder(
        annotated_image_in_frame,
        border_width,
        border_width,
        border_width,
        border_width,
        cv2.BORDER_CONSTANT,
        value=(255, 255, 255),
    )
    annotated_image_out_of_frame = cv2.copyMakeBorder(
        annotated_image_out_of_frame,
        border_width,
        border_width,
        border_width,
        border_width,
        cv2.BORDER_CONSTANT,
        value=(255, 255, 255),
    )

    # 6. Add headers to each image
    header_height = 30
    header_color = (255, 255, 255)
    text_color = (0, 0, 0)
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 0.7
    font_thickness = 2

    # Create header images for each annotated image
    header_image_in_frame = np.zeros(
        (header_height, annotated_image_in_frame.shape[1], 3), dtype=np.uint8
    )
    header_image_in_frame[:] = header_color
    text_size_in_frame = cv2.getTextSize("Enabled", font, font_scale, font_thickness)[0]
    text_x_in_frame = annotated_image_in_frame.shape[1] - text_size_in_frame[0] - 10
    text_y_in_frame = (header_height + text_size_in_frame[1]) // 2
    cv2.putText(
        header_image_in_frame,
        "Enabled",
        (text_x_in_frame, text_y_in_frame),
        font,
        font_scale,
        text_color,
        font_thickness,
        cv2.LINE_AA,
    )

    header_image_out_of_frame = np.zeros(
        (header_height, annotated_image_out_of_frame.shape[1], 3), dtype=np.uint8
    )
    header_image_out_of_frame[:] = header_color
    text_size_out_of_frame = cv2.getTextSize("Not Enabled", font, font_scale, font_thickness)[0]
    text_x_out_of_frame = (header_image_out_of_frame.shape[1] - text_size_out_of_frame[0]) // 2
    text_y_out_of_frame = (header_height + text_size_out_of_frame[1]) // 2
    cv2.putText(
        header_image_out_of_frame,
        "Not Enabled",
        (text_x_out_of_frame, text_y_out_of_frame),
        font,
        font_scale,
        text_color,
        font_thickness,
        cv2.LINE_AA,
    )

    # Stack the headers and the images
    annotated_image_in_frame_with_header = np.vstack(
        (header_image_in_frame, annotated_image_in_frame)
    )
    annotated_image_out_of_frame_with_header = np.vstack(
        (header_image_out_of_frame, annotated_image_out_of_frame)
    )

    # 7. Stack the two images vertically
    stacked_image = np.vstack(
        (annotated_image_in_frame_with_header, annotated_image_out_of_frame_with_header)
    )

    # Add position text to the top-left corner
    cv2.putText(
        stacked_image,
        str(text_position) + f", smart_pos={smart_position}",
        (10, 20),
        cv2.FONT_HERSHEY_SIMPLEX,
        0.7,
        (0, 0, 0),
        2,
        cv2.LINE_AA,
    )

    # 8. Display the annotated image.
    if display_image:
        try:
            pil_image = Image.fromarray(cv2.cvtColor(stacked_image, cv2.COLOR_BGR2RGB))
            pil_image.show()
            pil_image.close()
        except OSError as e:
            print(f"Error displaying image: {e}. Saving image instead.")
            cv2.imwrite(f"annotated_image_{text_position}_smart_{smart_position}.jpg", stacked_image)
    else:
        cv2.imwrite(f"annotated_image_{text_position}_smart_{smart_position}.jpg", stacked_image)
        print(f"Annotated image saved to annotated_image_{text_position}_smart_{smart_position}.jpg")



def main(image_path: str = "example.jpg") -> None:
    """
    Main function to run the image processing and annotation with different label positions
    and smart position settings.

    Args:
        image_path (str, optional): Path to the image file. Defaults to "example.jpg".
    """
    # Create a dummy image
    image = np.zeros((600, 800, 3), dtype=np.uint8)
    cv2.imwrite(image_path, image)

    # 1. Generate Detections once - Moved inside process_image_with_supervision
    # mock_bounding_boxes, mock_confidence_scores, mock_class_labels = generate_mock_yolo_output(image.shape)
    # detections = sv.Detections(
    #      xyxy=mock_bounding_boxes,
    #      confidence=confidence_scores,
    #      class_id=mock_class_labels,
    # )

    # 2. Loop through positions with smart_position=False
    positions = [
        sv.Position.TOP_LEFT,
        sv.Position.CENTER_LEFT,
        sv.Position.BOTTOM_RIGHT,
        sv.Position.CENTER_RIGHT,
    ]
    for position in positions:
        print(f"Processing image with text position: {position}, smart_position=False")
        process_image_with_supervision(image, display_image=False, text_position=position, smart_position=False)  # Removed detections

    # 3. Loop through positions with smart_position=True, using the same detections
    for position in positions:
        print(f"Processing image with text position: {position}, smart_position=True")
        process_image_with_supervision(image, display_image=False, text_position=position, smart_position=True)  # Removed detections

    os.remove(image_path)



if __name__ == "__main__":
    main()

image

image

image

image

Any specific deployment considerations

No special deployment considerations are needed. This feature is implemented as an optional parameter that defaults to False, ensuring backward compatibility with existing code.

Docs

  • Docs updated? What were the changes:
    No changes to docs as functionality is similar to smart_position and the only entry for this in the docs was in the changelog. I can update the documentation to include this new parameter in the appropriate class references if desired, just let me know where and the format.

…ments and new base class

- ensures labels are within frame
- May have a few overlaps at edges,in very busy frames, when smart_pos is enabled. but running smart_pos  first yields better results
@CLAassistant
Copy link

CLAassistant commented Apr 15, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@onuralpszr onuralpszr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @hidara2000 thank you for this awesome PR

I made my first initials quick comments about certain change, Let me also test as well.

@hidara2000
Copy link
Author

Hello @hidara2000 thank you for this awesome PR

I made my first initials quick comments about certain change, Let me also test as well.

Makes sense. Changes ticked off. Cheers for a great tool!

@hidara2000 hidara2000 mentioned this pull request Apr 15, 2025
2 tasks
@SkalskiP
Copy link
Collaborator

Hi @hidara2000 👋🏻 Huge thanks for deciding to submit a PR to introduce this change! I have a couple of points I'd like to discuss before I dive deeper into the PR review:

Wouldn't it be a better approach to keep the smart_position flag and simply add this extra behavior when smart_position=True? I understand that these two features could be seen as separate operations, but I'm still leaning towards maintaining a simple API:

  • smart_position=False - raw, unprocessed label positions
  • smart_position=True - we do everything we can to make them as visible as possible

For some time now, I've wanted to add support for multiline labels / label wrapping. Considering you're completely rewriting both label annotators, would you be willing to add support for multiline labels / label wrapping as part of this PR?

Screenshot 2025-03-31 at 12 22 16

@hidara2000
Copy link
Author

📝 Add Multiline Text Support to Label Annotators

🔄 Updates to Previous PR

This extends my previous PR that added frame boundary adjustments by incorporating support for multiline text in label annotators. The implementation now properly handles both newlines in text and automatic text wrapping.

✨ New Features

  • 🔤 Multiline Text Support: Labels now properly render text with newlines (\n)
  • 📏 Auto Text Wrapping: New max_line_length parameter controls automatic text wrapping
  • 🧠 Enhanced Smart Positioning: Improved algorithm to prevent overlapping multiline labels
  • 🔄 Two-Phase Spreading: More effective label distribution with size-aware positioning

🛠️ Implementation Details

  • Added max_line_length parameter to existing annotator classes
  • Used Python's textwrap library for robust text wrapping functionality
  • Enhanced smart positioning to better handle varying text box sizes
  • Properly calculated dimensions for multiline text boxes
  • Implemented size-aware box spreading to reduce overlaps

📊 Before/After Comparison

image

image

📚 Usage Example

# Create a label annotator with multiline text support
label_annotator = sv.LabelAnnotator(
    text_padding=10,
    smart_position=True,  # Works with existing smart positioning
    max_line_length=20  # Enable text wrapping at 20 characters
)

# Labels can have manual newlines or will auto-wrap
labels = [
    "Car\nLicense: ABC-123",  # Manual newlines
    "This is a very long label that will be wrapped automatically"  # Auto-wrapped
]

# Use as normal
annotated_image = label_annotator.annotate(
    scene=image,
    detections=detections,
    labels=labels
)

🧪 Test Code

Here's the code I used to test the multiline text support:

def process_image_with_supervision(
    image: np.ndarray,
    display_image: bool = True,
    text_position: sv.Position = sv.Position.TOP_LEFT,
    smart_position: bool = False,
    detections: Optional[sv.Detections] = None,
) -> None:
    # 1. Simulate YOLO model output or use provided
    if detections is None:
        bounding_boxes, confidence_scores, class_labels = generate_mock_yolo_output(
            image.shape
        )
        detections = sv.Detections(
            xyxy=bounding_boxes,
            confidence=confidence_scores,
            class_id=class_labels,
        )

    # 2. Create annotators
    box_annotator = BoxAnnotator(thickness=2)
    class_names = ["This is\na\ncar", "This is a really really really long label"]

    label_annotator_smart = LabelAnnotator(
        text_scale=0.5,
        text_thickness=1,
        text_padding=5,
        text_position=text_position,
        smart_position=True,
        max_line_length=12,  # Enable text wrapping at 12 characters
    )
    label_annotator_not_smart = LabelAnnotator(
        text_scale=0.5,
        text_thickness=1,
        text_padding=5,
        text_position=text_position,
        smart_position=False,
    )

    # 3. Annotate the image with both configurations
    annotated_image_smart = box_annotator.annotate(image.copy(), detections=detections)
    labels_smart = [
        f"{class_names[int(class_id)]} {confidence:.2f}"
        for _, _, confidence, class_id, *_ in detections
    ]
    annotated_image_smart = label_annotator_smart.annotate(
        annotated_image_smart, detections=detections, labels=labels_smart
    )

    annotated_image_not_smart = box_annotator.annotate(
        image.copy(), detections=detections
    )
    labels_not_smart = [
        f"{class_names[int(class_id)]} {confidence:.2f}"
        for _, _, confidence, class_id, *_ in detections
    ]
    annotated_image_not_smart = label_annotator_not_smart.annotate(
        annotated_image_not_smart, detections=detections, labels=labels_not_smart
    )

    # 4. Create comparison image and save
    # ... (display and saving code omitted for brevity)

I tested with various text positions:

positions = [
    sv.Position.TOP_LEFT,
    sv.Position.CENTER_LEFT,
    sv.Position.BOTTOM_RIGHT,
    sv.Position.CENTER_RIGHT,
]
for position in positions:
    process_image_with_supervision(
        image, display_image=False, text_position=position, smart_position=True
    )

🔍 Performance Note

The enhanced smart positioning uses a two-phase approach that maintains good performance in most real-world scenarios. For scenes with many labels, the visual improvement in label placement is well worth the minimal additional processing time.

🔄 Compatibility

This change is backward compatible. The max_line_length parameter is optional (default: None), so existing code will continue to work without modification.

Returns:
List[str]: A list of text lines after wrapping.
"""
import textwrap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s move this import to the top of the file instead of placing it here.

Copy link
Author

@hidara2000 hidara2000 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 oops

else: # CENTER, CENTER_LEFT, CENTER_RIGHT
return (y1 + y2) / 2

def _wrap_text(self, text: str) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d prefer this not to be a private class method—let’s move it to supervision/annotators/utils.py instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 307 to 341
import textwrap

if not text:
return [""]

if self.max_line_length is None:
return text.splitlines() or [""]

# Split the text by existing newlines first
paragraphs = text.split("\n")
all_lines = []

for paragraph in paragraphs:
if not paragraph:
# Keep empty lines
all_lines.append("")
continue

# Wrap each paragraph separately
wrapped = textwrap.wrap(
paragraph,
width=self.max_line_length,
break_long_words=True,
replace_whitespace=False,
drop_whitespace=True,
)

# Add the wrapped lines for this paragraph
if wrapped:
all_lines.extend(wrapped)
else:
# If wrap returns an empty list (e.g., for whitespace-only input)
all_lines.append("")

return all_lines if all_lines else [""]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here seems pretty easy to follow. Let's remove python comments here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 158 to 159
frame_width: int,
frame_height: int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the supervision codebase, we usually pass a resolution_wh tuple instead of separate frame width and height values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have two other functions clip_boxes and pad_boxes. I recommend:

  • renaming this function to snap_boxes
  • drop part of the logic that flips (we can add it in the future, but I want to keep it out of this PR)
  • make it vectorized to process all boxes at once without looping
  • wrap frame_width and frame_height into single resolution_wh argument.

here's clip_boxes for reference

def clip_boxes(xyxy: np.ndarray, resolution_wh: Tuple[int, int]) -> np.ndarray:
    """
    Clips bounding boxes coordinates to fit within the frame resolution.

    Args:
        xyxy (np.ndarray): A numpy array of shape `(N, 4)` where each
            row corresponds to a bounding box in
            the format `(x_min, y_min, x_max, y_max)`.
        resolution_wh (Tuple[int, int]): A tuple of the form `(width, height)`
            representing the resolution of the frame.

    Returns:
        np.ndarray: A numpy array of shape `(N, 4)` where each row
            corresponds to a bounding box with coordinates clipped to fit
            within the frame resolution.

    Examples:
        ```python
        import numpy as np
        import supervision as sv

        xyxy = np.array([
            [10, 20, 300, 200],
            [15, 25, 350, 450],
            [-10, -20, 30, 40]
        ])

        sv.clip_boxes(xyxy=xyxy, resolution_wh=(320, 240))
        # array([
        #     [ 10,  20, 300, 200],
        #     [ 15,  25, 320, 240],
        #     [  0,   0,  30,  40]
        # ])
        ```
    """
    result = np.copy(xyxy)
    width, height = resolution_wh
    result[:, [0, 2]] = result[:, [0, 2]].clip(0, width)
    result[:, [1, 3]] = result[:, [1, 3]].clip(0, height)
    return result

I generated this. We would need to make sure it works:

def snap_boxes(xyxy: np.ndarray, resolution_wh: Tuple[int, int]) -> np.ndarray:
    """
    Shifts bounding boxes into the frame so that they are fully contained
    within the given resolution. Unlike `clip_boxes`, this function does not crop boxes.
    It moves them entirely if they exceed the frame boundaries.

    Args:
        xyxy (np.ndarray): A numpy array of shape `(N, 4)` where each
            row corresponds to a bounding box in the format
            `(x_min, y_min, x_max, y_max)`.
        resolution_wh (Tuple[int, int]): A tuple `(width, height)`
            representing the resolution of the frame.

    Returns:
        np.ndarray: A numpy array of shape `(N, 4)` with boxes shifted into frame.

    Examples:
        ```python
        import numpy as np
        import supervision as sv

        xyxy = np.array([
            [-10, 10, 30, 50],
            [310, 200, 350, 250],
            [100, -20, 150, 30],
            [200, 220, 250, 270]
        ])

        sv.snap_boxes(xyxy=xyxy, resolution_wh=(320, 240))
        # array([
        #     [  0,  10,  40,  50],
        #     [280, 200, 320, 250],
        #     [100,   0, 150,  50],
        #     [200, 190, 250, 240]
        # ])
        ```
    """
    result = np.copy(xyxy)
    width, height = resolution_wh

    box_w = result[:, 2] - result[:, 0]
    box_h = result[:, 3] - result[:, 1]

    shift_x1 = np.where(result[:, 0] < 0, -result[:, 0], 0)
    shift_x2 = np.where(result[:, 2] > width, width - result[:, 2], 0)
    shift_x = shift_x1 + shift_x2

    result[:, 0] += shift_x
    result[:, 2] += shift_x

    shift_y1 = np.where(result[:, 1] < 0, -result[:, 1], 0)
    shift_y2 = np.where(result[:, 3] > height, height - result[:, 3], 0)
    shift_y = shift_y1 + shift_y2

    result[:, 1] += shift_y
    result[:, 3] += shift_y

    return result

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Might be worth double checking that I understood you properly here

Comment on lines 179 to 196
if x1 < 0:
shift = -x1
x1 += shift
x2 += shift
elif x2 > frame_width:
shift = frame_width - x2
x1 += shift
x2 += shift

# Adjust y-coordinate to stay within frame
if y1 < 0:
shift = -y1
y1 += shift
y2 += shift
elif y2 > frame_height:
shift = frame_height - y2
y1 += shift
y2 += shift
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be possible to vectorize this and run on all boxes at once. Without looping.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 198 to 212
# Check if label should be flipped to above the box
if check_flip_label and text_anchor is not None:
box_height = y2 - y1

# Check anchor position to see if we can flip it
anchor_y = self._get_anchor_y_for_adjustment(
np.array([y1, y2]), text_anchor
)

# If we're at the bottom, try moving to the top
if anchor_y >= y2 - 5: # Near bottom edge
# Check if there's room at the top
if y1 - box_height >= 0:
y2 = y1
y1 = y2 - box_height
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove that logic from scope of this PR? I'm not sure I want to add it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 270 to 294
@staticmethod
def _get_anchor_y_for_adjustment(bbox_y: np.ndarray, anchor: Position) -> float:
"""
Calculates the anchor y-coordinate for label adjustment based on the text anchor
position.

Args:
bbox_y (np.ndarray): An array containing the y1 and y2 coordinates of the
bounding box.
anchor (Position): The desired text anchor position.

Returns:
float: The anchor y-coordinate.
"""
y1, y2 = bbox_y
if anchor in [Position.TOP_LEFT, Position.TOP_CENTER, Position.TOP_RIGHT]:
return y1
elif anchor in [
Position.BOTTOM_LEFT,
Position.BOTTOM_CENTER,
Position.BOTTOM_RIGHT,
]:
return y2
else: # CENTER, CENTER_LEFT, CENTER_RIGHT
return (y1 + y2) / 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I'd like to keep this part of the logic out of scope for this PR. We can go ahead and remove this method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.smart_position = smart_position
self.max_line_length: Optional[int] = max_line_length

def _validate_labels(self, labels: Optional[List[str]], detections: Detections):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this method to supervision/annotators/utils.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)

@staticmethod
def _get_labels_text(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this method to supervision/annotators/utils.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# First, make sure the boxes don't go outside the frame
for i in range(len(labels)):
# Adjust box to stay within frame
adjusted_properties[i, :4] = self._ensure_box_in_frame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are getting rid of flipping for now, do we need to call _ensure_box_in_frame (snap_boxes) twice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 1280 to 1281
force_scale: float = 10.0,
consider_size: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hidara2000 I'm curious—what was the reason for introducing those two arguments? I'm a bit concerned they might lead to unstable label positions during video processing, where small changes in initial position could cause disproportionately large shifts in the final output.

Copy link
Author

@hidara2000 hidara2000 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SkalskiP
The reason for introducing force_scale and consider_size was primarily to offer more granular control over how the spreading algorithm resolves overlaps in static images (during testing). force_scale allows tuning the overall repulsion strength, and consider_size was an attempt to see if factoring in the label box dimensions could lead to a more visually pleasing distribution, especially in complex overlap scenarios. force_vectors *= 10 was already in the original code an I was toying with the idea of letting a user set these values to suit their scenario. ie less force for videos and more for static busy scenes.

You're absolutely right to be concerned about video stability. Iterative algorithms like spread_out_boxes can be sensitive to small frame-to-frame variations in detection positions. Parameters like force_scale (especially if set high) and consider_size can amplify these small variations into larger, potentially noticeable jumps or jitter in the label positions across consecutive video frames.

Given this valid concern and the potential for these parameters to introduce instability, I've reverted the spread_out_boxes function in the PR back to the original version that doesn't have these parameters.

I'm still interested to know your thoughts though – do you think there's a viable way to use the version of the function (below) with force_scale and consider_size without causing instability (perhaps with very conservative default values)? Or do the added parameters introduce unnecessary complexity that would require users to tune them during class instantiation, which might not be ideal for a general-purpose annotator?

Looking forward to your feedback!

def alternative_spread_out_boxes(
    xyxy: np.ndarray,
    max_iterations: int = 50,  # Moderate default iterations
    force_scale: float = 5.0,  # Moderate default force scale
    consider_size: bool = False, # Default to False for better video stability
    min_force_magnitude: float = 2.0 # Make minimum force tunable
) -> np.ndarray:
    """
    Spread out boxes that overlap with each other, optimized for a balance
    between overlap resolution and video stability.

    Args:
        xyxy: Numpy array of shape (N, 4) where N is the number of boxes.
        max_iterations: Maximum number of iterations to run the algorithm for.
                        Lower values may improve performance and stability
                        but could leave some overlaps unresolved.
        force_scale: Scale factor for the repulsion forces. Lower values result
                     in less aggressive spreading, which can improve video stability.
        consider_size: Whether to consider box size when calculating forces.
                       Setting to True might yield better static layouts but can
                       increase jitter in video due to fluctuating box sizes.
                       Defaults to False for better video stability.
        min_force_magnitude: Minimum magnitude for calculated force vectors.
                             Ensures slight overlaps still result in movement.

    Returns:
        np.ndarray: A numpy array of shape (N, 4) with adjusted box positions.
    """
    if len(xyxy) == 0:
        return xyxy

    # Add a small padding to ensure boxes that are just touching are considered for overlap
    xyxy_padded = pad_boxes(xyxy, px=1)

    # Calculate box areas if we're considering size (only done once)
    size_factors = np.ones(len(xyxy_padded))
    if consider_size:
        box_areas = (xyxy_padded[:, 2] - xyxy_padded[:, 0]) * (
            xyxy_padded[:, 3] - xyxy_padded[:, 1]
        )
        # Calculate the size factors (normalize by mean size), handle empty box_areas
        if len(box_areas) > 0 and np.mean(np.sqrt(box_areas)) != 0:
             size_factors = np.sqrt(box_areas) / np.mean(np.sqrt(box_areas))
             # Clip to avoid extreme values influencing forces too much
             size_factors = np.clip(size_factors, 0.5, 2.0)


    for _ in range(max_iterations):
        # Calculate IoU between all pairs of boxes (NxN matrix)
        iou = box_iou_batch(xyxy_padded, xyxy_padded)
        np.fill_diagonal(iou, 0)  # Eliminate self-interactions (a box doesn't overlap with itself)

        # If there are no overlaps, we are done
        if np.all(iou == 0):
            break

        overlap_mask = iou > 0

        # Calculate centers of the boxes (Nx2)
        centers = (xyxy_padded[:, :2] + xyxy_padded[:, 2:]) / 2

        # Calculate vectors pointing from each box center to every other box center (NxNx2)
        delta_centers = centers[:, np.newaxis, :] - centers[np.newaxis, :, :]
        # Only consider deltas for overlapping boxes
        delta_centers *= overlap_mask[:, :, np.newaxis]

        # Sum the delta vectors for each box to get the total push direction (Nx2)
        delta_sum = np.sum(delta_centers, axis=1)

        # Normalize the sum of deltas to get direction vectors (unit vectors)
        delta_magnitude = np.linalg.norm(delta_sum, axis=1, keepdims=True)
        direction_vectors = np.divide(
            delta_sum,
            delta_magnitude,
            out=np.zeros_like(delta_sum), # Use zeros where magnitude is zero to avoid NaNs
            where=delta_magnitude != 0,
        )

        # Calculate the base force magnitude based on total overlap (sum of IoUs)
        base_force_magnitude = np.sum(iou, axis=1)
        force_vectors = base_force_magnitude[:, np.newaxis] * direction_vectors

        # Apply size-based scaling if enabled
        if consider_size:
             force_vectors *= size_factors[:, np.newaxis]

        # Apply the general force scale
        force_vectors *= force_scale

        # Ensure minimum force for small overlaps to guarantee separation
        current_force_magnitudes = np.linalg.norm(force_vectors, axis=1, keepdims=True)
        small_force_mask = (current_force_magnitudes > 0) & (current_force_magnitudes < min_force_magnitude)

        if np.any(small_force_mask):
             # Rescale small force vectors to have the minimum magnitude
             force_directions_for_small = force_vectors / np.where(
                 current_force_magnitudes > 0, current_force_magnitudes, 1
             )
             force_vectors = np.where(
                 small_force_mask, force_directions_for_small * min_force_magnitude, force_vectors
             )

        # Convert displacement vectors to integers for pixel-based movement
        force_vectors = force_vectors.astype(int)

        # Apply forces to update box positions (shift both corners by the same vector)
        xyxy_padded[:, [0, 1]] += force_vectors
        xyxy_padded[:, [2, 3]] += force_vectors

    # Remove the padding before returning
    return pad_boxes(xyxy_padded, px=-1)

@SkalskiP
Copy link
Collaborator

Hi @hidara2000, sorry it took me a while to get back to you. I'm currently juggling work across 3–4 repositories, so my time is a bit stretched. I’ve now gone through your PR carefully and you’ve done an excellent job—really impressive work! Don’t be discouraged by the number of comments I left—they’re all meant to help polish things up. Once we merge this PR, it’ll take Supervision’s text annotators to the next level!

@hidara2000
Copy link
Author

Hi @hidara2000, sorry it took me a while to get back to you. I'm currently juggling work across 3–4 repositories, so my time is a bit stretched. I’ve now gone through your PR carefully and you’ve done an excellent job—really impressive work! Don’t be discouraged by the number of comments I left—they’re all meant to help polish things up. Once we merge this PR, it’ll take Supervision’s text annotators to the next level!

I appreciate you going through it, and I agree with all the comments. Changes made as per advice and results from test below.

image
image
image
image

@hidara2000 hidara2000 requested a review from SkalskiP April 25, 2025 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants