Espressif Multimedia Capture (esp_capture) is a lightweight multimedia capture component developed by Espressif, based on the ESP-GMF architecture. It features low memory footprint, high flexibility, and a modular design. The component integrates functions such as audio/video encoding, image rotation and scaling, echo cancellation, and text overlay. It is widely applicable to scenarios including audio/video recording, AI large model input, WebRTC, RTMP/RTSP streaming, local storage, and remote monitoring.
- 📦 Low memory overhead with modular pipeline structure
- 🎚️ Tight integration with ESP-GMF for advanced audio/video processing
- 🎥 Support for multiple input devices: V4L2, DVP cameras, audio codecs
- 🔁 Parallel streaming and storage options
- ⚙️ Automatic source-sink negotiation for simplified configuration
- ✨ Customizable processing pipelines for professional use cases
A capture system connects sources (input devices) to sinks (output targets) through an intermediate processing path.
graph LR
Capture_Source --> Capture_Path --> Capture_Sink
| Component | Description |
|---|---|
| Capture Source | Interfaces for physical input devices (camera, mic, etc.) |
| Capture Path | Processing pipeline (audio/video filters, encoders, overlays) |
| Capture Sink | Output targets (e.g., streaming, storage, muxers) |
To enable synchronized audio-video muxing, a dedicated sync module aligns timestamps across streams.
graph LR
capture_audio_src --> capture_audio_path --> capture_audio_sink
capture_audio_src --> capture_sync
capture_video_src --> capture_sync
capture_video_src --> capture_video_path --> capture_video_sink
capture_audio_sink --> capture_muxer
capture_video_sink --> capture_muxer
capture_muxer --> capture_muxer_sink
Audio sources are used to acquire audio data from audio input devices connected via various buses (like I2S, USB, etc.).
Interface: esp_capture_audio_src_if_t
Built-in sources:
esp_capture_new_audio_dev_src: Codec-based audio captureesp_capture_new_audio_aec_src: Codec-based audio capture with Acoustic Echo Cancellation (AEC)
Video sources are used to capture video data from video input devices connected via various buses (like SPI, MIPI, USB, etc.).
Interface: esp_capture_video_src_if_t
Built-in sources:
esp_capture_new_video_v4l2_src: V4L2 camera input (viaesp_video)esp_capture_new_video_dvp_src: DVP camera input
Stream synchronization is achieved by the capture_sync module. capture_sync aligns audio and video frame timestamps for synchronized playback or muxing. It is automatically configured through esp_capture_open.
Interface: esp_capture_path_mngr_if_t
Built-in:
esp_capture_new_gmf_audio_mngr: Creates audio processing path usingESP-GMFwith elements like:aud_rate_cvt– Sample rate conversionaud_ch_cvt– Channel conversion (mono ↔ stereo)aud_bit_cvt– Bit depth conversion`aud_enc– Audio encoder
Pipeline Builders (esp_capture_pipeline_builder_if_t):
esp_capture_create_auto_audio_pipeline: Auto-generated audio pipeline based on negotiationesp_capture_create_audio_pipeline: Prebuilt audio template pipeline
Built-in:
esp_capture_new_gmf_video_mngr: Creates video processing path usingESP-GMFwith elements like:vid_ppa– Resize, crop, color conversionvid_overlay– Text/graphic overlaysvid_fps_cvt– Framerate conversionvid_enc– Video encoder
Pipeline Builders:
esp_capture_create_auto_video_pipeline: Auto-generated video pipeline based on negotiationesp_capture_create_video_pipeline: Prebuilt video template pipeline
Mux audio/video into containers for storage or streaming:
- MP4: File-based only
- TS: Supports streaming and file-based
The module provides flexible data flow control options for muxers:
- Muxer-only mode: All data is consumed by the muxer, preventing access to raw audio/video streams
- Streaming while storage: Simultaneous storage and streaming when supported by the muxer
- Unified API: Use
esp_capture_sink_acquire_framefor both muxer output and direct stream access
Overlays are used to mix text or images into original video frames. Typical use cases include: Adding real-time timestamps or statistical data onto video frames.
Interface: esp_capture_overlay_if_t
- Built-in:
esp_capture_new_text_overlay - Automatically handled if overlay is present in the video path
Simplified configuration by automatically connecting sources, paths, and sinks. Typical call sequence for auto capture is shown below (using audio capture as an example):
sequenceDiagram
participant App as Application
participant AudioSrc as Audio Source
participant Capture as ESP Capture
participant Sink as Capture Sink
App->>AudioSrc: esp_capture_new_audio_dev_src(...)
AudioSrc-->>App: audio_src handle
App->>Capture: esp_capture_open(&cfg, &capture)
Note over App,Capture: cfg.audio_src = audio_src
App->>Capture: esp_capture_sink_setup(capture, 0, &sink_cfg, &sink)
App->>Sink: esp_capture_sink_enable(sink, ESP_CAPTURE_RUN_MODE_ALWAYS)
App->>Capture: esp_capture_start(capture)
loop Frame Processing
App->>Sink: esp_capture_sink_acquire_frame(sink, &frame, false)
App->>Sink: esp_capture_sink_release_frame(sink, &frame)
end
App->>Capture: esp_capture_stop(capture)
For detailed examples, see audio_capture and video_capture
- Register Custom Elements
esp_capture_register_element(capture, ESP_CAPTURE_STREAM_TYPE_AUDIO, proc_element);- Customize Pipeline Before Start
const char *elems[] = { "aud_ch_cvt", "aud_rate_cvt", "aud_enc" };
esp_capture_sink_build_pipeline(sink, ESP_CAPTURE_STREAM_TYPE_AUDIO, elems, 3);- Automatically inserts elements like
aud_rate_cvt,aud_ch_cvton demand - Negotiates format based on encoder requirements
- Elements are configured based on negotiation results
Built-in:
esp_capture_audio_pipeline_auto_negotiate– Auto negotiate from audio source to multiple audio sinks
- Automatically inserts
vid_ppa,vid_fps_cvton demand - Prioritizes high-quality format
- Negotiates source format based on encoder capabilities
Built-in:
esp_capture_video_pipeline_auto_negotiate– Auto negotiate from video source to multiple video sinks
In some cases, auto-negotiation for source format and information may not meet requirements.
Audio sources and video sources support set_fixed_caps to fix source format settings and avoid negotiation failure cases.
In complex pipelines, auto-negotiation may fail (e.g., redundant sample rate converter in one pipeline). Manual configuration is recommended.
Unused elements are excluded unless registered.
Enable features only when needed:
CONFIG_ESP_CAPTURE_ENABLE_AUDIO: Enable audio supportCONFIG_ESP_CAPTURE_ENABLE_VIDEO: Enable video support
mp4_muxer_register()/ts_muxer_register()– on-demand muxersesp_audio_enc_register_default()/esp_video_enc_register_default()– customize encoder usage via menuconfig
You can extend esp_capture by:
- Adding a custom capture source
- Implementing a new muxer using
esp_muxer - Creating new encoders via
esp_audio_codec/esp_video_codec