Skip to content

Pose detect_for_video with GPU delegate crashes when segmentation enabled on Mac #5788

@demirhere

Description

@demirhere

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

Yes

OS Platform and Distribution

Apple M3 Sequoia 15.1.1

MediaPipe Tasks SDK version

3.10.0

Task name (e.g. Image classification, Gesture recognition etc.)

Pose landmarker

Programming Language and version (e.g. C++, Python, Java)

Python 3.10

Describe the actual behavior

Detection crashes application when segmentation is enabled. If I turn off segmentation mask, detect_for_video runs fine. CPU delegate runs fine for all cases.

Describe the expected behaviour

It should not crash.

Standalone code/steps you may have used to try to get what you need

Basic pose detection code using the latest API, with GPU as delegate on Mac M3.

Other info / Complete Logs

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1734586092.124030 9722600 gl_context.cc:369] GL version: 2.1 (2.1 Metal - 89.3), renderer: Apple M3
INFO: Created TensorFlow Lite delegate for Metal.
2024-12-18 21:28:12 - root - INFO - Model loaded successfully
W0000 00:00:1734586092.236510 9722705 landmark_projection_calculator.cc:186] Using NORM_RECT without IMAGE_DIMENSIONS is only supported for the square ROI. Provide IMAGE_DIMENSIONS or use PROJECTION_MATRIX.
E0000 00:00:1734586092.239868 9722700 shader_util.cc:99] Failed to compile shader:
 1 #version 330 
 2 #ifdef GL_ES 
 3 #define DEFAULT_PRECISION(p, t) precision p t; 
 4 #else 
 5 #define DEFAULT_PRECISION(p, t) 
 6 #define lowp 
 7 #define mediump 
 8 #define highp 
 9 #endif  // defined(GL_ES) 
10 #if __VERSION__ < 130
11 #define in attribute
12 #define out varying
13 #endif  // __VERSION__ < 130
14 in vec4 position; in mediump vec4 texture_coordinate; out mediump vec2 sample_coordinate; void main() { gl_Position = position; sample_coordinate = texture_coordinate.xy; }
E0000 00:00:1734586092.239882 9722700 shader_util.cc:106] Error message: ERROR: 0:1: '' :  version '330' is not supported

E0000 00:00:1734586092.239922 9722600 calculator_graph.cc:928] INTERNAL: CalculatorGraph::Run() failed: 
Calculator::Process() for node "mediapipe_tasks_vision_pose_landmarker_poselandmarkergraph__mediapipe_tasks_vision_pose_landmarker_multipleposelandmarksdetectorgraph__mediapipe_tasks_vision_pose_landmarker_singleposelandmarksdetectorgraph__TensorsToSegmentationCalculator" failed: ; RET_CHECK failure (mediapipe/calculators/tensor/tensors_to_segmentation_converter_metal.cc:217) upsample_program_Problem initializing the program.

This happens when detect_for_video is called.

def load_model(self):
        logging.info(f"Loading MediaPipe Pose model from {self.model_path}")

        # No GPU support on Windows yet
        # Why no GPU with segmentation on Mac?
        delegate = python.BaseOptions.Delegate.CPU

        #if platform.system() == "Darwin" and not self.enable_segmentation:
        if platform.system() == "Darwin":
            delegate = python.BaseOptions.Delegate.GPU

        base_options = python.BaseOptions(
            model_asset_path=self.model_path,
            delegate=delegate
        )

        options = vision.PoseLandmarkerOptions(
            base_options=base_options,
            num_poses=self.num_poses,
            min_pose_detection_confidence=self.min_pose_detection_confidence,
            min_pose_presence_confidence=self.min_pose_presence_confidence,
            min_tracking_confidence=self.min_tracking_confidence,
            output_segmentation_masks=self.enable_segmentation,
            running_mode=vision.RunningMode.VIDEO
        )
        self.pose_detector = vision.PoseLandmarker.create_from_options(options)
        logging.info("Model loaded successfully")

    def unload_model(self):
        logging.info("Unloading MediaPipe Pose model")
        """Unload the pose detection model and free resources"""
        if self.pose_detector is not None:
            self.pose_detector.close()
            self.pose_detector = None
        logging.info("Model unloaded successfully")


    def pose(self, image, timestamp_ms):
        if self.pose_detector is None:
            self.load_model()

        mp_image = mp.Image(image_format=mp.ImageFormat.SRGBA, data=cv2.cvtColor(image, cv2.COLOR_BGR2RGBA))
        detection_result = self.pose_detector.detect_for_video(mp_image, int(round(timestamp_ms)))
        result = {
            'poses': [],
            'world_poses': [],
            'segmentation_mask': None
        }
        
        if detection_result.pose_landmarks:
            for idx, (pose_landmarks, world_landmarks) in enumerate(zip(
                detection_result.pose_landmarks, 
                detection_result.pose_world_landmarks)):
                
                keypoint_dict = {}
                world_keypoint_dict = {}
                
                for i, landmark_name in enumerate(self.KEYPOINT_NAMES):
                    # Normalized coordinates
                    point = pose_landmarks[i]
                    keypoint_dict[landmark_name] = {
                        'x': 1.0-point.x,
                        'y': 1.0-point.y,
                        'z': point.z,
                        'confidence': point.visibility
                    }
                    
                    # World coordinates
                    world_point = world_landmarks[i]
                    world_keypoint_dict[landmark_name] = {
                        'x': world_point.x,
                        'y': world_point.y,
                        'z': world_point.z,
                        'confidence': world_point.visibility
                    }
                
                result['poses'].append(keypoint_dict)
                result['world_poses'].append(world_keypoint_dict)
            
            if self.enable_segmentation and detection_result.segmentation_masks:
                # Initialize a combined mask with zeros
                combined_mask = np.zeros_like(detection_result.segmentation_masks[0].numpy_view(), dtype=np.float32)
                
                for mask in detection_result.segmentation_masks:
                    combined_mask += mask.numpy_view()
                    
                # Normalize the combined mask to the range [0, 65535]
                combined_mask = (combined_mask / combined_mask.max() * 65535).astype(np.uint16)
                result['segmentation_mask'] = combined_mask
        
        return result

Metadata

Metadata

Assignees

Labels

gpuMediaPipe GPU related issuesos:macOSIssues on MacOSplatform:pythonMediaPipe Python issuestask:pose landmarkerIssues related to Pose Landmarker: Find people and body positions

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions