This is the best standalone reference for class-free interactive tracking based on segmentation plus DINO similarity. Use it when you need click-to-select tracking without predefined classes and want the backend to own the tracking state while the frontend controls visualization.
- You need interactive object selection from the live stream.
- You want similarity heatmaps or bbox outputs driven by DINO embeddings.
- You need a strong reference for backend/frontend state synchronization.
- You want a custom standalone app where the only published stream is a rendered video output.
- You need predefined-class object detection.
- You need dataset snapping or measurement rather than tracking.
- You need host/peripheral support instead of standalone-only RVC4 packaging.
- You need a minimal tracker without custom services and custom host-side logic.
Category:apps/dino-trackingShape:frontendPrimary task:interactive similarity-based tracking from FastSAM masks and DINO embeddingsEntrypoint:backend/src/main.pyStandalone path:oakapp.tomlFrontend:frontend/src/App.tsxRuns on:RVC4 standalone onlyRequires:RVC4 device; static frontend build; model downloads for FastSAM and DINOInput:live RGB camera plus click prompts from the frontendOutput:one encodedVideostream with overlaysModels:luxonis/fastsam-s:512x288andluxonis/dinov3-backbone:convnext-small-640x480from backend/src/constants/nn.yamlVisualizer / UI:custom static frontend
- backend/src/main.py: full pipeline and service registration
- backend/src/constants/camera.yaml: camera resolution and FPS defaults
- backend/src/constants/nn.yaml: model names and input sizes
- backend/src/constants/reference_adaptation.yaml: adaptive-reference tuning
- backend/src/object_selection/mask_selection_node.py: mask selection from click prompts
- backend/src/dino_similarity/reference_vectors/reference_vector_from_selection_node.py: initial reference extraction
- backend/src/dino_similarity/reference_vectors/adaptive_reference_vector_node.py: online reference adaptation
- backend/src/detections_tracking/heatmap_to_detections_node.py: bbox extraction from similarity heatmaps
- backend/src/annotations/detections_annotation_overlay_node.py: final overlay rendering
- backend/src/FE_state_synchronization_service.py: backend-to-frontend state export
- frontend/src/App.tsx: click handling and UI state restore
- frontend/src/AnnotationModeSelector.tsx: heatmap versus bbox mode
- frontend/src/OutlinesToggle.tsx: FastSAM outline toggle
- oakapp.toml: static frontend build and backend packaging
- One RGB camera feeds three branches: display, segmentation, and DINO embedding extraction.
- FastSAM segmentation produces masks.
- A frontend click chooses the target mask.
- DINO grid features plus the selected mask produce a reference vector.
- An adaptive-reference node updates that target representation over time.
- Similarity heatmaps are converted into detections, tracked, and rendered as overlays.
- The frontend controls annotation mode, threshold, and outline visibility through services.
RGB -> FastSAM parsing NN -> mask selectionRGB -> DINO NN -> grid featuresmask selection + grid features -> reference vector -> adaptive referenceadaptive reference + grid features + RGB -> similarity heatmapsimilarity heatmap -> detections -> tracker -> overlay -> encoded Video
Safe to change:frontend labels, threshold defaults, annotation mode defaults, outline toggle behavior, YAML constantsRequires care:click-coordinate handling, mask-to-reference-vector logic, adaptive-reference tuning, tracker input/output assumptionsLikely to break if changed blindly:backend/frontend state sync, the one-stream rendered-output design, or the relationship between heatmap and bbox modes
To change camera or model sizes:start in backend/src/constants/To disable adaptation:edit backend/src/constants/reference_adaptation.yaml and the adaptive-reference wiring in backend/src/main.pyTo replace the tracker:keep the selection and similarity pipeline, then swap backend/src/detections_tracking/tracker.pyTo build a different frontend:keep the registered service names stable or update both backend and frontend together
- This example is RVC4 standalone only.
- Initial startup may take longer because model assets must become available before the
Videotopic appears. - The frontend intentionally waits for the
Videotopic before enabling stream interaction. - Backend state is authoritative; the frontend restores state from
BE State Servicerather than owning it locally.
- Only one encoded
Videotopic is published; intermediate masks, heatmaps, and detections are not exposed as separate topics. - The custom frontend assumes service names like
Click Prompt Service,Clear Selection Service,Threshold Update Service,Outlines Trigger Service,Annotation Mode Service, andBE State Service. - The example keeps its tuning in YAML files instead of burying everything in
main.py, so constants belong under backend/src/constants/.
- apps/data-collection: use this when you need interactive prompting and backend state sync but not similarity tracking
- apps/people-demographics-and-sentiment-analysis: use this when you need another standalone frontend app with richer backend state
- neural-networks/object-tracking/deepsort-tracking: use this when you need a class-based tracking reference
- apps/focused-vision: use this when your main problem is preserving detail rather than selecting and tracking an object
Run:oakctl app run .Success looks like:the frontend shows the video stream, clicks select a target, and switching between heatmap and bbox views changes the backend-rendered overlayCommon failure meaning:theVideotopic never becomes available, model assets are unavailable, or service names drifted between backend and frontend