Skip to content

MemryX MX3 detector integration #17723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 85 commits into
base: dev
Choose a base branch
from

Conversation

tim-memryx
Copy link

@tim-memryx tim-memryx commented Apr 15, 2025

Proposed change

This PR adds support for the MemryX MX3 AI accelerator as an option for object detectors (SDK link).

We've included options for the following models:

  • YOLOv9s (640)
  • YOLOX (640)
  • SSDLite-MobileNetV2 (320)
  • YOLO-NAS (320)

We've added a section to the main/Dockerfile that installs necessary dependencies inside the container and downloads "DFP" files for each model (this is what we call compiled models). But note that these are compiled from their upstream sources (OpenVINO and Ultralytics**) not retrained nor quantized, as the MX3 runs models with floating point math.

**Ultralytics note: we do not use any Python code, PyTorch source, nor even ONNX files from Ultralytics in this integration into Frigate. Only the weights exist in the form of our compiled DFPs. If that becomes an issue, we can remove the DFP download commands, or whatever else you would advise.

We've tested the containers across x86 systems, Raspberry Pi, and Orange Pi.


Now to summarize the changed/added files:

Setup Scripts & Dockerfile

New File: memryx/user_installation.sh

  • Purpose:
    Installs required MX3 drivers and core libraries on the container host machine.

  • Note:
    Assumes a Debian/Ubuntu host (or their derivative distros such as Raspberry Pi OS and Armbian).

Dockerfile Modifications: main/Dockerfile

  • Updates:
    Added section to install dependencies (core libraries + Python pip packages) and download DFPs.

Frigate Additions/Modifications

New File: frigate/detector/plugins/memryx.py

  • Purpose:
    Implements the detector using MX3 for the currently supported models. Called by async_run_detector.

Modified File: frigate/object_detection/base.py

  • Asynchronous async_run_detector() function:
    This function replaces the blocking call to detect_raw with two asynchronous Python threads: one that calls the detector's send_input and another that calls receive_output. This allows a purely async architecture such as the MX3 to reach maximum FPS.

    The async_run_detector() was made to be agnostic to the MemryX detector and potentially useful for others, while MemryX-specific functionality is kept separate in the detector plugin file.


Summary

This adds MemryX MX3 and multiple object detection models.

Please let us know if there are any changes you would like to see, as we're very excited to be added to Frigate!


Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code
  • Documentation Update

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • The code has been formatted using Ruff (ruff format frigate)
    • We ran ruff on just the new/modified files

Copy link

netlify bot commented Apr 15, 2025

Deploy Preview for frigate-docs canceled.

Name Link
🔨 Latest commit d4e9de0
🔍 Latest deploy log https://app.netlify.com/sites/frigate-docs/deploys/68127f07455c340009789ba4

Copy link
Collaborator

@NickM-27 NickM-27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI, thank you for the PR!

I have left a handful of notes, please let me know if you have any questions or concerns. We will also need to get the documentation updated and added here as well.

self.detector_config,
),
)
if self.detector_config.type == "memryx":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to refactor this to not have a separate async_run_detector target and instead move the async logic to the detector itself. See the Hailo detector for an example of how to use the ResponseModelStore and RequestModelStore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, we've looked this over but are having trouble understanding how Response/RequestModelStore with run_detector would fit our use case. For good performance, our chip needs to have fully separate send threads and receive threads: i.e., keep sending inputs without waiting for any outputs in between.

In run_detector, the main loop consists of:

while not stop_event.is_set():
  .
  input_frame = frame_manager.get(connection_id, 1 ... )
  .
  detections = object_detector.detect_raw(input_frame)

So in this loop for fully separate send/receive, we would need multiple input_frames to be pushed to detect_raw (thus multiple loop iterations) before getting any detection results back.

Would Response/RequestModelStore support this, and if so do you have any quick info on them? No need for anything extensive but some pointers would be much appreciated.

It appeared to us in hailo8l.py that detect_raw was still sending 1 frame -> waiting for inference result -> returning output, thus blocking the run_detector's loop from advancing to the next iteration?

Apologies again that we're having trouble with this, and thanks for your help

Copy link
Collaborator

@NickM-27 NickM-27 Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is correct that only one input image will be sent at the same time. The problem is that it is an absolute requirement that the results returned must exactly match the results of the input image. From what I understand of this code, currently there is no guarantee that this is the case. So with many cameras it is possible that there will be a race condition where two frames get sent at the same time and then the results get mixed up. This would be the primary concern.

If the separate async function is needed then I think that is reasonable, though in that case it would be good to get the duplicated logic of the run_detector* functions to be moved to a separate function like prepare_detector

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Result / Response store is a good way to ensure that the matching requests and results are correctly paired together

Copy link
Author

@tim-memryx tim-memryx Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we do guarantee the order pushed is the order received: the MemryXDetector->send_input function pushes to our internal capture_queue, then the chip triggers execution of the process_input callback function and pops from the queue. The chip will then trigger the output callback (receive_output) for frames in the same order in which they went in. Alongside the frames, we follow a similar system of queues for the connection_ids.

So as long as there is only one run_detector instance running (which is our understanding -- please correct if wrong), there should be a guaranteed ordering.

If you prefer, we could move the connection_id queue out of the memryx detector into the base detector? Then it would be more clear where synchronization between detector_worker and result_worker happens.

(Edit: or try replace the connection_id queue with Result/Response if you'd prefer that too)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as long as there is only one run_detector instance running (which is our understanding -- please correct if wrong), there should be a guaranteed ordering.

there is only one per detector. So, a user could try to do:

detectors:
  memx0:
    type: memryx
  memx1:
    type: memryx

but I don't know if that would be supported on y'alls hardware anyway. Neither coral nor hailo supports more than one process communicating with a single device at a time.

(Edit: or try replace the connection_id queue with Result/Response if you'd prefer that too)

I think this would be the most preferred, since it would be more consistent and we could also look at expanding other detectors to use that in the future if they support async architecture like this.

Copy link
Author

@tim-memryx tim-memryx Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, that multiple detectors idea is very interesting.

There's two possible scenarios for such a config:

  • We can support multiple detectors if there are multiple accelerators connected, by changing the group_id parameter to the AsyncAccl constructor in the detector object. So the memx0 detector would use accelerator 0, memx1 uses accelerator 1, and so on.
    We hadn't considered that in the code in this PR, but it should be fairly quick to implement -- we'll look into adding this!
  • We also support multiple processes using the same accelerator (we call this 'Shared' mode), but this would actually result in worse performance than a single detector process per accelerator (and it's C++ only, until SDK 1.3), so we'd avoid adding that as an option and instead try to catch it as invalid detector config.

So to clarify the plan around async_run_detector:

  • We can keep an async_run_detector function
  • We'll split common code to prepare_detector
  • We'll use Request/ResponseModelStore to track IDs (part of async_run_detector), instead of a connection_id queue inside the MemryXDetector class

Is this understanding correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two possible scenarios for such a config:

possibility 1 sounds like a good idea, you could use PCIe:0 / PCIe:1 like the coral does

for possibility 2, I don't think this is something that we would want either, so let's definitely hold off on that for the foreseeable future

So to clarify the plan around async_run_detector:

yes, that all sounds good to me. Thank you for being responsive on this front.

else:
processed_input = tensor_input.astype(np.float32) / 255.0 # Normalize
# Assuming original input is always NHWC and MemryX wants HWNC:
processed_input = processed_input.transpose(1, 2, 0, 3) # NHWC -> HWNC
Copy link
Collaborator

@NickM-27 NickM-27 Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already exists as a model configuration parameter input_tensor, I think we should use those instead of hardcoding this logic here.

concatenated_img = np.concatenate([x0, x1, x2, x3], axis=2)
processed_input = concatenated_img.astype(np.float32)
else:
processed_input = tensor_input.astype(np.float32) / 255.0 # Normalize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to below this is already supported as the model config input_dtype we should use that, potentially adding a new float32_norm type. That way this else branch won't be needed at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main float input_dtype is actually already normalized

@tim-memryx
Copy link
Author

Thanks Nick! We will work to address these points and update the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants