-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
MemryX MX3 detector integration #17723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
MemryX MX3 detector integration #17723
Conversation
memryx-support-added
✅ Deploy Preview for frigate-docs canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI, thank you for the PR!
I have left a handful of notes, please let me know if you have any questions or concerns. We will also need to get the documentation updated and added here as well.
self.detector_config, | ||
), | ||
) | ||
if self.detector_config.type == "memryx": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to refactor this to not have a separate async_run_detector target and instead move the async logic to the detector itself. See the Hailo detector for an example of how to use the ResponseModelStore and RequestModelStore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, we've looked this over but are having trouble understanding how Response/RequestModelStore with run_detector
would fit our use case. For good performance, our chip needs to have fully separate send threads and receive threads: i.e., keep sending inputs without waiting for any outputs in between.
In run_detector
, the main loop consists of:
while not stop_event.is_set():
.
input_frame = frame_manager.get(connection_id, 1 ... )
.
detections = object_detector.detect_raw(input_frame)
So in this loop for fully separate send/receive, we would need multiple input_frame
s to be pushed to detect_raw (thus multiple loop iterations) before getting any detection
results back.
Would Response/RequestModelStore support this, and if so do you have any quick info on them? No need for anything extensive but some pointers would be much appreciated.
It appeared to us in hailo8l.py that detect_raw
was still sending 1 frame -> waiting for inference result -> returning output, thus blocking the run_detector's loop from advancing to the next iteration?
Apologies again that we're having trouble with this, and thanks for your help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is correct that only one input image will be sent at the same time. The problem is that it is an absolute requirement that the results returned must exactly match the results of the input image. From what I understand of this code, currently there is no guarantee that this is the case. So with many cameras it is possible that there will be a race condition where two frames get sent at the same time and then the results get mixed up. This would be the primary concern.
If the separate async function is needed then I think that is reasonable, though in that case it would be good to get the duplicated logic of the run_detector* functions to be moved to a separate function like prepare_detector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Result / Response store is a good way to ensure that the matching requests and results are correctly paired together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we do guarantee the order pushed is the order received: the MemryXDetector->send_input function pushes to our internal capture_queue
, then the chip triggers execution of the process_input
callback function and pops from the queue. The chip will then trigger the output callback (receive_output
) for frames in the same order in which they went in. Alongside the frames, we follow a similar system of queues for the connection_id
s.
So as long as there is only one run_detector
instance running (which is our understanding -- please correct if wrong), there should be a guaranteed ordering.
If you prefer, we could move the connection_id
queue out of the memryx detector into the base detector? Then it would be more clear where synchronization between detector_worker
and result_worker
happens.
(Edit: or try replace the connection_id queue with Result/Response if you'd prefer that too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as long as there is only one run_detector instance running (which is our understanding -- please correct if wrong), there should be a guaranteed ordering.
there is only one per detector. So, a user could try to do:
detectors:
memx0:
type: memryx
memx1:
type: memryx
but I don't know if that would be supported on y'alls hardware anyway. Neither coral nor hailo supports more than one process communicating with a single device at a time.
(Edit: or try replace the connection_id queue with Result/Response if you'd prefer that too)
I think this would be the most preferred, since it would be more consistent and we could also look at expanding other detectors to use that in the future if they support async architecture like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, that multiple detectors idea is very interesting.
There's two possible scenarios for such a config:
- We can support multiple detectors if there are multiple accelerators connected, by changing the
group_id
parameter to theAsyncAccl
constructor in the detector object. So thememx0
detector would use accelerator 0,memx1
uses accelerator 1, and so on.
We hadn't considered that in the code in this PR, but it should be fairly quick to implement -- we'll look into adding this! - We also support multiple processes using the same accelerator (we call this 'Shared' mode), but this would actually result in worse performance than a single detector process per accelerator (and it's C++ only, until SDK 1.3), so we'd avoid adding that as an option and instead try to catch it as invalid detector config.
So to clarify the plan around async_run_detector:
- We can keep an
async_run_detector
function - We'll split common code to
prepare_detector
- We'll use Request/ResponseModelStore to track IDs (part of async_run_detector), instead of a
connection_id
queue inside the MemryXDetector class
Is this understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's two possible scenarios for such a config:
possibility 1 sounds like a good idea, you could use PCIe:0 / PCIe:1 like the coral does
for possibility 2, I don't think this is something that we would want either, so let's definitely hold off on that for the foreseeable future
So to clarify the plan around async_run_detector:
yes, that all sounds good to me. Thank you for being responsive on this front.
else: | ||
processed_input = tensor_input.astype(np.float32) / 255.0 # Normalize | ||
# Assuming original input is always NHWC and MemryX wants HWNC: | ||
processed_input = processed_input.transpose(1, 2, 0, 3) # NHWC -> HWNC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already exists as a model configuration parameter input_tensor
, I think we should use those instead of hardcoding this logic here.
concatenated_img = np.concatenate([x0, x1, x2, x3], axis=2) | ||
processed_input = concatenated_img.astype(np.float32) | ||
else: | ||
processed_input = tensor_input.astype(np.float32) / 255.0 # Normalize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to below this is already supported as the model config input_dtype
we should use that, potentially adding a new float32_norm
type. That way this else branch won't be needed at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main float
input_dtype is actually already normalized
Thanks Nick! We will work to address these points and update the PR |
Proposed change
This PR adds support for the MemryX MX3 AI accelerator as an option for object detectors (SDK link).
We've included options for the following models:
We've added a section to the
main/Dockerfile
that installs necessary dependencies inside the container and downloads "DFP" files for each model (this is what we call compiled models). But note that these are compiled from their upstream sources (OpenVINO and Ultralytics**) not retrained nor quantized, as the MX3 runs models with floating point math.We've tested the containers across x86 systems, Raspberry Pi, and Orange Pi.
Now to summarize the changed/added files:
Setup Scripts & Dockerfile
New File:
memryx/user_installation.sh
Purpose:
Installs required MX3 drivers and core libraries on the container host machine.
Note:
Assumes a Debian/Ubuntu host (or their derivative distros such as Raspberry Pi OS and Armbian).
Dockerfile Modifications:
main/Dockerfile
Added section to install dependencies (core libraries + Python pip packages) and download DFPs.
Frigate Additions/Modifications
New File:
frigate/detector/plugins/memryx.py
Implements the detector using MX3 for the currently supported models. Called by
async_run_detector
.Modified File:
frigate/object_detection/base.py
Asynchronous
async_run_detector()
function:This function replaces the blocking call to
detect_raw
with two asynchronous Python threads: one that calls the detector'ssend_input
and another that callsreceive_output
. This allows a purely async architecture such as the MX3 to reach maximum FPS.The
async_run_detector()
was made to be agnostic to the MemryX detector and potentially useful for others, while MemryX-specific functionality is kept separate in the detector plugin file.Summary
This adds MemryX MX3 and multiple object detection models.
Please let us know if there are any changes you would like to see, as we're very excited to be added to Frigate!
Type of change
Checklist
ruff format frigate
)