Nowadays, people in Thailand are developing a wide range of applications for various industries, including manufacturing, agriculture, healthcare, and security. However, a common challenge faced by these applications is the need for flexible and accurate object detection systems. Traditional object detection models require extensive retraining and large labeled datasets for each new object or scenario, which is time-consuming and costly.
Many Thai industries need to detect new or rare objects that may not be present in standard datasets. This limitation slows down innovation and makes it difficult to adapt AI solutions to rapidly changing environments or specific local needs.
DAMZ (Detect-anything-model with Zero-shot object detection) addresses this challenge by enabling detection of arbitrary objects without the need for retraining. Leveraging advanced vision-language models, DAMZ can understand textual queries and detect objects in images based on descriptions, even if those objects were never seen during training.
- Zero-shot detection: Detect objects using natural language queries without retraining.
- Flexible API: Easily integrate with existing applications via RESTful endpoints.
- Queue-based processing: Supports asynchronous task submission and scalable processing using RabbitMQ.
- GPU acceleration: Optimized for H100 GPUs for fast inference.
- Industry-ready: Designed for real-world deployment in Thai industrial environments.
- Detecting new machinery or equipment in factory images.
- Locating specific medical instruments in hospital scenes.
- Security applications for identifying suspicious objects.
- Submit an image and text query (e.g., "Find all forklifts in this warehouse photo") via the API.
- DAMZ processes the request using zero-shot object detection.
- Results are returned with bounding boxes and confidence scores for detected objects.
Healthcheck the system.
Detect objects in an image using zero-shot prompting.
Method: POST
Input:
image(form-data): The image file to analyze (JPEG, PNG, etc.)text_queries(form-data or JSON): Comma-separated or JSON array of text queries for object detectionbox_threshold(form-data or JSON): Confidence threshold for bounding boxes (default: 0.4)text_threshold(form-data or JSON): Confidence threshold for text matching (default: 0.4)return_visualization(form-data or JSON): Whether to return visualization image (default: true)async_processing(form-data or JSON): Whether to process asynchronously using queue (default: false)priority(form-data or JSON): Task priority (default: 5)
Example JSON:
{
"image_url": "https://images.unsplash.com/photo-1542909168-82c3e7fdca5c?fm=jpg&q=60&w=3000&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxzZWFyY2h8Mnx8aHVtYW4lMjBmYWNlfGVufDB8fDB8fHww",
"text_queries": ["a cat", "a remote control", "a person"],
"box_threshold": 0.4,
"text_threshold": 0.4,
"return_visualization": true,
"async_processing": false,
"priority": 5
}Response:
- Bounding boxes, confidence scores, and (optionally) visualization image.
Detect objects from an uploaded image file. Accepts the same parameters as /detect/ but uses file upload.
Method: POST
Input:
image(form-data): The image file to analyze- Other parameters as above
Detect objects in video using zero-shot prompting with contextual understanding.
Method: POST
Input:
file(form-data): The video file to analyze (MP4, AVI, etc.)prompt(form-data): Text query for object detection (e.g., "a person")person_weight(form-data): Weight for person detection confidence (default: 0.3)action_weight(form-data): Weight for action recognition (default: 0.6)context_weight(form-data): Weight for contextual understanding (default: 0.1)similarity_threshold(form-data): Minimum similarity score for detection (default: 0.5)action_threshold(form-data): Minimum confidence for action detection (default: 0.4)return_timeline(form-data): Whether to return frame-by-frame timeline (default: true)
Response:
- Frame-by-frame object detections with bounding boxes
- Confidence scores for each detection
- Optional timeline of detected objects throughout video
- Optional visualization video with annotations
-
The accuracy of the base-model that cannot rised because of the zero-shot training.
-
The sentence object detection in video
The way to did the video object detection in zero-shot training. If the object is like the single word ("Person", "Cat") it's too normal and the YOLO and do it. So we decided to did the "Sentence object detection" to find the bounding boxes that are mutual and can specify the object.
-
English-Thailand Translation - supports the Thai prompting
-
Text-Cleasing (NLP) - use to clean the Thai text before the translation
-
Text-Summarization (NLP) - use to check the maximum words