Skip to content

Commit cebfb88

Browse files
committed
Merge branch 'feat/dl_audio' into 'master'
Feat/dl audio See merge request ai/esp-dl!200
2 parents 431b274 + 170b818 commit cebfb88

37 files changed

+2338
-6
lines changed

.gitlab/ci/build.yml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -280,4 +280,15 @@ build_dl_fft:
280280
- IMAGE: [espressif/idf:release-v5.3, espressif/idf:release-v5.5]
281281
TARGET: [esp32p4, esp32s3, esp32c3, esp32]
282282
variables:
283-
EXAMPLE_DIR: test_apps/dl_fft
283+
EXAMPLE_DIR: test_apps/dl_fft
284+
285+
build_dl_audio:
286+
extends:
287+
- .build_examples_template
288+
- .rules:build:test_dl_audio
289+
parallel:
290+
matrix:
291+
- IMAGE: [espressif/idf:release-v5.3, espressif/idf:release-v5.5]
292+
TARGET: [esp32p4, esp32s3, esp32]
293+
variables:
294+
EXAMPLE_DIR: test_apps/dl_audio

.gitlab/ci/rules.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,8 @@
101101
.patterns-test_dl_fft: &patterns-test_dl_fft
102102
- "tools/dl_fft/**/*"
103103

104+
.patterns-test_dl_audio: &patterns-test_dl_audio
105+
- "esp-dl/dl_audio/**/*"
104106

105107
##############
106108
# if anchors #
@@ -302,6 +304,17 @@
302304
- <<: *if-dev-push
303305
changes: *patterns-gitlab-ci
304306

307+
.rules:build:test_dl_audio:
308+
rules:
309+
- <<: *if-protected
310+
- <<: *if-label-build
311+
- <<: *if-dev-push
312+
changes: *patterns-test_dl_fft
313+
- <<: *if-dev-push
314+
changes: *patterns-test_dl_audio
315+
- <<: *if-dev-push
316+
changes: *patterns-gitlab-ci
317+
305318
.rules:pre_check:readme:
306319
rules:
307320
- <<: *if-protected

.gitlab/ci/target_test.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,3 +151,23 @@ test_dl_image:
151151
TEST_FOLDER: 'test_apps/dl_image'
152152
TEST_TARGET: ${IDF_TARGET}
153153
TEST_ENV: ${IDF_TARGET}
154+
155+
test_dl_audio:
156+
extends:
157+
- .pytest_api_template
158+
- .rules:build:test_dl_audio
159+
needs:
160+
- job: "build_dl_audio"
161+
artifacts: true
162+
optional: true
163+
parallel:
164+
matrix:
165+
- IDF_TARGET: [esp32p4, esp32s3, esp32]
166+
IDF_VERSION: "5.3"
167+
tags:
168+
- ${IDF_TARGET}
169+
image: $DOCKER_TARGET_TEST_v5_3_ENV_IMAGE
170+
variables:
171+
TEST_FOLDER: 'test_apps/dl_audio'
172+
TEST_TARGET: ${IDF_TARGET}
173+
TEST_ENV: ${IDF_TARGET}

esp-dl/CMakeLists.txt

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ set(src_dirs ./dl/tool/src
1111
./vision/image
1212
./vision/recognition
1313
./vision/classification
14+
./audio/common
15+
./audio/speech_features
1416
)
1517

1618
set(include_dirs ./dl
@@ -26,6 +28,8 @@ set(include_dirs ./dl
2628
./vision/image
2729
./vision/recognition
2830
./vision/classification
31+
./audio/common
32+
./audio/speech_features
2933
)
3034

3135
if(CONFIG_IDF_TARGET_ESP32)
@@ -47,7 +51,8 @@ elseif(CONFIG_IDF_TARGET_ESP32P4)
4751
list(APPEND src_dirs dl/base/isa/esp32p4)
4852
endif()
4953

50-
set(requires esp_mm
54+
set(requires dl_fft
55+
esp_mm
5156
esp_new_jpeg
5257
esp_driver_jpeg
5358
esp_driver_ppa

esp-dl/README.md

Lines changed: 65 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ESP-DL
22

3-
ESP-DL is designed to maintain optimal performance while significantly reducing the workload in model deployment. Our project has achieved the following key features:
3+
ESP-DL is a lightweight and efficient neural network inference framework specifically designed for ESP series chips (ESP32, ESP32-S3, ESP32-P4). It is built to maintain optimal performance while significantly reducing the workload in model deployment. Our project has achieved the following key features:
44

55
### ESP-DL Standard Model Format
66

@@ -29,4 +29,68 @@ The automatic dual-core scheduling enables computationally intensive operators t
2929

3030
---
3131

32+
## Project Structure
33+
34+
The ESP-DL project is organized to provide a clear separation of concerns for different functionalities. Here's a breakdown of the main directories and their purposes to help beginners get started quickly:
35+
36+
```
37+
esp-dl/
38+
├── dl/ # Core deep learning library
39+
│ ├── base/ # Fundamental data types and utilities
40+
│ ├── tensor/ # TensorBase class for data handling
41+
│ ├── model/ # Model class for loading, building, and running neural networks
42+
│ ├── module/ # Base Module class for operators/layers
43+
│ ├── math/ # Mathematical functions and operations
44+
│ ├── tool/ # Utility tools for the framework
45+
│ ├── dl_define.hpp # Global definitions, quantization and activation types
46+
│ └── dl_define_private.hpp # Private definitions
47+
├── fbs_loader/ # FlatBuffers model loading functionality
48+
│ ├── include/ # Header files for the loader
49+
│ ├── src/ # Source files for the loader
50+
│ ├── lib/ # Pre-compiled FlatBuffers model library
51+
│ └── pack_espdl_models.py # Script to pack multiple models
52+
├── audio/ # Audio processing module
53+
│ ├── common/ # Common audio processing utilities (WAV decoding, etc.)
54+
│ ├── speech_features/ # Speech feature extraction (Fbank, MFCC, Spectrogram)
55+
│ └── README.md # Detailed documentation for audio processing
56+
├── vision/ # Vision processing module
57+
│ ├── image/ # Image processing utilities (JPEG, BMP, drawing, preprocessing)
58+
│ ├── detect/ # Object detection post-processors (YOLO, etc.)
59+
│ ├── classification/ # Image classification post-processors (ImageNet, etc.)
60+
│ └── recognition/ # Face recognition components
61+
├── CMakeLists.txt # CMake build configuration for the ESP-IDF component
62+
├── idf_component.yml # ESP-IDF component manifest
63+
├── LICENSE # Project license information
64+
└── README.md # This file
65+
```
66+
67+
### Core Components (`dl/`)
68+
69+
This is the heart of the ESP-DL framework. It contains the fundamental classes and functions required for neural network inference.
70+
71+
- `base/`: Contains basic utilities and low-level operations.
72+
- `tensor/`: Defines the `TensorBase` class, which is used throughout the framework to represent data.
73+
- `model/`: Contains the `Model` class, which handles loading `.espdl` files, building an execution plan, and running inference.
74+
- `module/`: Defines the `Module` base class, from which all neural network operators (like Conv2D, Pool2D) are derived.
75+
- `math/`: Provides optimized mathematical functions used by operators.
76+
- `tool/`: Offers various utility functions for the framework.
77+
- `dl_define.hpp`: Central place for global definitions like quantization and activation types.
78+
79+
### FlatBuffers Loader (`fbs_loader/`)
80+
81+
This component is responsible for loading models stored in the `.espdl` format, which is based on FlatBuffers.
82+
83+
### Audio Processing (`audio/`)
84+
85+
This module provides functionalities for audio signal processing, particularly focused on speech feature extraction. It includes utilities for WAV decoding and extracting features like Fbank, MFCC, and Spectrogram, optimized for ESP platforms.
86+
87+
### Vision Processing (`vision/`)
88+
89+
This module provides functionalities for computer vision tasks.
90+
91+
- `image/`: Utilities for image loading (JPEG, BMP), preprocessing, color space conversion, and drawing.
92+
- `detect/`: Post-processors for object detection models (e.g., YOLO variants).
93+
- `classification/`: Post-processors for image classification models (e.g., ImageNet classifiers).
94+
- `recognition/`: Components for face recognition tasks.
95+
3296
Explore ESP-DL to streamline your AI model deployment and achieve optimal performance with minimal resource usage.

esp-dl/audio/README.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# ESP-DL Audio Processing Module
2+
3+
The ESP-DL Audio Processing Module is a C++ library designed for audio signal processing, particularly focused on speech feature extraction. It provides implementations of common audio processing algorithms optimized for ESP platforms.
4+
5+
## Features
6+
7+
- WAV file decoding
8+
- Speech feature extractionincluding:
9+
- Fbank (Filter Bank) features
10+
- MFCC (Mel-Frequency Cepstral Coefficients)
11+
- Spectrogram
12+
- Support for various window functions (Hanning, Hamming, Povey, etc.)
13+
- Configurable parameters for feature extraction
14+
- Optimized for ESP platforms with memory allocation capabilities
15+
- Aligned with Kaldi's implementation [(torchaudio.compliance.kaldi)](https://docs.pytorch.org/audio/stable/compliance.kaldi.html).
16+
17+
## Directory Structure
18+
19+
```
20+
audio/
21+
├── common/ # Common audio processing utilities
22+
│ ├── dl_audio_common.cpp/hpp # Common audio functions and definitions
23+
│ └── dl_audio_wav.cpp/hpp # WAV file decoding utilities
24+
└── speech_features/ # Speech feature extraction algorithms
25+
├── dl_fbank.cpp/hpp # Fbank (Filter Bank) feature extraction
26+
├── dl_mfcc.cpp/hpp # MFCC (Mel-Frequency Cepstral Coefficients)
27+
├── dl_spectrogram.cpp/hpp # Spectrogram feature extraction
28+
└── dl_speech_features.cpp/hpp # Base class for speech features
29+
```
30+
31+
## Common Audio Utilities
32+
33+
### WAV Decoding
34+
The module provides functionality to decode WAV audio files into raw PCM data.
35+
36+
```cpp
37+
#include "dl_audio_wav.hpp"
38+
39+
dl::audio::dl_audio_t *audio = dl::audio::decode_wav(wav_data, data_len);
40+
```
41+
42+
### Audio Common Functions
43+
Provides common audio processing functions such as:
44+
- Window function generation (Hanning, Hamming, Blackman, etc.)
45+
- Mel filterbank initialization
46+
- Pre-emphasis filtering
47+
- FFT-related operations
48+
49+
## Speech Feature Extraction
50+
51+
All speech feature extraction classes inherit from the `SpeechFeatureBase` class, which provides a common interface.
52+
53+
### Configuration
54+
Speech feature extraction can be configured using the `SpeechFeatureConfig` structure:
55+
56+
```cpp
57+
dl::audio::SpeechFeatureConfig config;
58+
config.sample_rate = 16000;
59+
config.frame_length = 25; // ms
60+
config.frame_shift = 10; // ms
61+
config.num_mel_bins = 26;
62+
config.window_type = dl::audio::WinType::HANNING;
63+
```
64+
65+
### Fbank (Filter Bank)
66+
Extracts filter bank features from audio signals.
67+
68+
```cpp
69+
#include "dl_fbank.hpp"
70+
71+
dl::audio::Fbank fbank(config);
72+
// Process audio data
73+
std::vector<int> shape = fbank.get_output_shape(audio_length);
74+
float *output_features = (float*) malloc(shape[0] * shape[1]);
75+
fbank.process(audio_data, audio_length, output_features);
76+
```
77+
78+
### MFCC (Mel-Frequency Cepstral Coefficients)
79+
Extracts MFCC features, which are commonly used in speech recognition.
80+
81+
```cpp
82+
#include "dl_mfcc.hpp"
83+
84+
dl::audio::MFCC mfcc(config);
85+
// Process audio data
86+
std::vector<int> shape = mfcc.get_output_shape(audio_length);
87+
float *output_features = (float*) malloc(shape[0] * shape[1]);
88+
mfcc.process(audio_data, audio_length, output_features);
89+
```
90+
91+
### Spectrogram
92+
Computes spectrogram features aligned with torchaudio.compliance.kaldi.spectrogram.
93+
94+
```cpp
95+
#include "dl_spectrogram.hpp"
96+
97+
dl::audio::Spectrogram spectrogram(config);
98+
// Process audio data
99+
std::vector<int> shape = mfcc.get_output_shape(audio_length);
100+
float *output_features = (float*) malloc(shape[0] * shape[1]);
101+
spectrogram.process(audio_data, audio_length, output_features);
102+
```
103+
104+
105+
## License
106+
107+
MIT License

0 commit comments

Comments
 (0)