kokoro TTS Supported axera backend#3281
Conversation
📝 WalkthroughWalkthroughAdds AXERA and AXCL Kokoro offline TTS backends: new model implementations and PIMPL APIs, AXCL model API overload, build/CMake entries, frontend/normalizer integration, and factory wiring to instantiate the new backends when configured. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(200,220,255,0.5)
participant App as Application
participant Factory as OfflineTtsImpl::Create
participant TtsImpl as OfflineTtsKokoroAxcl/AxeraImpl
participant Normalizer as TextNormalizer Chain
end
rect rgba(200,255,200,0.5)
participant Frontend as OfflineTtsFrontend
participant Model as OfflineTtsKokoroModel
participant Runtime as AXCL/AXERA/ORT Runtime
end
App->>Factory: Create(config)
Factory->>TtsImpl: instantiate KokoroAxcl/AxeraImpl
App->>TtsImpl: Generate(text, sid, speed, callback)
TtsImpl->>Normalizer: apply normalization chain
Normalizer-->>TtsImpl: normalized_text
TtsImpl->>Frontend: convert text -> token_ids
Frontend-->>TtsImpl: token_ids
loop per batch
TtsImpl->>Model: Process(tokens, sid, speed)
Model->>Runtime: SetInputTensorData (uint8_t/float/int)
Model->>Runtime: Run inference (multi-stage)
Runtime-->>Model: output tensors
Model->>TtsImpl: audio_chunk
TtsImpl->>App: callback(audio_chunk)
end
TtsImpl-->>App: GeneratedAudio (concatenated)
sequenceDiagram
participant Model as OfflineTtsKokoroModel
participant Ax1 as Axera/AXCL Model1
participant Ax2 as Axera/AXCL Model2
participant Ax3 as Axera/AXCL Model3
participant Ort as ONNX Runtime (Model4)
participant Audio as ISTFT Synth
Model->>Ax1: Run(text inputs)
Ax1-->>Model: duration, alignment
Model->>Ax2: Run(features, alignment targets)
Ax2-->>Model: F0_pred, N_pred, asr
Model->>Ort: Run(F0_pred / features)
Ort-->>Model: har
Model->>Ax3: Run(asr, F0_pred, har, ...)
Ax3-->>Model: spectral frames x
Model->>Audio: PostprocessXToAudio(x)
Audio-->>Model: time_domain_audio
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the hardware compatibility of the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for axera and axcl backends for Kokoro TTS models. However, several security issues related to input validation and memory safety were identified. Specifically, the handling of the Speaker ID (sid) parameter is inconsistent and lacks proper bounds checking, which could lead to buffer over-reads or under-reads (out-of-bounds memory access). This is particularly evident when dealing with single-speaker models where the sid is ignored but still used. Additionally, there is significant code duplication between the axera and axcl implementations, particularly in files like sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc and sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc, and their corresponding header files. Further improvements are needed regarding magic numbers, thread safety, and overall code structure.
| std::vector<float> LoadVoiceEmbedding(int32_t sid, int32_t phoneme_len) { | ||
| int32_t style_dim0 = 510; // max_token_len | ||
| int32_t style_dim1 = 256; | ||
| phoneme_len = std::max(phoneme_len, 0); | ||
|
|
||
| std::vector<float> ref_s(style_dim1); | ||
| if (phoneme_len < style_dim0) { | ||
| const float *p = styles_.data() + sid * style_dim0 * style_dim1 + | ||
| phoneme_len * style_dim1; | ||
| std::copy(p, p + style_dim1, ref_s.begin()); | ||
| } else { | ||
| int32_t idx = style_dim0 / 2; | ||
| const float *p = | ||
| styles_.data() + sid * style_dim0 * style_dim1 + idx * style_dim1; | ||
| std::copy(p, p + style_dim1, ref_s.begin()); | ||
| } | ||
| return ref_s; |
There was a problem hiding this comment.
The LoadVoiceEmbedding function uses the sid parameter to calculate an offset into the styles_ buffer without any bounds checking. If sid is negative or large enough to cause an overflow or exceed the buffer size, it results in a buffer under-read or over-read. This is particularly dangerous if num_speakers is 0, as styles_ might be empty or smaller than expected.
Recommendation: Implement strict bounds checking for sid against meta_data_.num_speakers before using it to calculate memory offsets.
std::vector<float> LoadVoiceEmbedding(int32_t sid, int32_t phoneme_len) {
int32_t style_dim0 = 510; // max_token_len
int32_t style_dim1 = 256;
phoneme_len = std::max(phoneme_len, 0);
sid = std::max(sid, 0);
if (meta_data_.num_speakers > 0) {
sid = std::min(sid, meta_data_.num_speakers - 1);
} else {
sid = 0;
}
std::vector<float> ref_s(style_dim1);
if (styles_.empty()) {
return ref_s;
}
if (phoneme_len < style_dim0) {
const float *p = styles_.data() + static_cast<size_t>(sid) * style_dim0 * style_dim1 +
phoneme_len * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
} else {
int32_t idx = style_dim0 / 2;
const float *p =
styles_.data() + static_cast<size_t>(sid) * style_dim0 * style_dim1 + idx * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
}
return ref_s;
}| std::vector<float> LoadVoiceEmbedding(int32_t sid, int32_t phoneme_len) { | ||
| int32_t style_dim0 = 510; | ||
| int32_t style_dim1 = 256; | ||
| phoneme_len = std::max(phoneme_len, 0); | ||
|
|
||
| sid = std::max(sid, 0); | ||
| if (meta_data_.num_speakers > 0) { | ||
| sid = std::min(sid, meta_data_.num_speakers - 1); | ||
| } | ||
|
|
||
| std::vector<float> ref_s(style_dim1); | ||
| if (phoneme_len < style_dim0) { | ||
| const float *p = styles_.data() + sid * style_dim0 * style_dim1 + | ||
| phoneme_len * style_dim1; | ||
| std::copy(p, p + style_dim1, ref_s.begin()); | ||
| } else { | ||
| int32_t idx = style_dim0 / 2; | ||
| const float *p = | ||
| styles_.data() + sid * style_dim0 * style_dim1 + idx * style_dim1; | ||
| std::copy(p, p + style_dim1, ref_s.begin()); | ||
| } | ||
| return ref_s; | ||
| } |
There was a problem hiding this comment.
In LoadVoiceEmbedding, if meta_data_.num_speakers is 0, the sid parameter is not clamped from above. This can lead to a buffer over-read when accessing styles_ if a non-zero sid is provided.
Recommendation: Ensure sid is set to 0 if num_speakers is 0.
std::vector<float> LoadVoiceEmbedding(int32_t sid, int32_t phoneme_len) {
int32_t style_dim0 = 510;
int32_t style_dim1 = 256;
phoneme_len = std::max(phoneme_len, 0);
sid = std::max(sid, 0);
if (meta_data_.num_speakers > 0) {
sid = std::min(sid, meta_data_.num_speakers - 1);
} else {
sid = 0;
}
std::vector<float> ref_s(style_dim1);
if (styles_.empty()) {
return ref_s;
}
if (phoneme_len < style_dim0) {
const float *p = styles_.data() + static_cast<size_t>(sid) * style_dim0 * style_dim1 +
phoneme_len * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
} else {
int32_t idx = style_dim0 / 2;
const float *p =
styles_.data() + static_cast<size_t>(sid) * style_dim0 * style_dim1 + idx * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
}
return ref_s;
}| return meta_data_; | ||
| } | ||
|
|
||
| std::vector<float> Run(std::vector<int64_t> input_ids, int64_t sid, float speed) { |
There was a problem hiding this comment.
The Run method is not thread-safe. The corresponding implementation for the axera backend (sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc) uses a std::mutex to protect its Run method. To ensure thread safety and consistency across backends, you should consider adding a mutex to this class and locking it within the Run method.
| } | ||
|
|
||
| std::vector<float> x_out = model3_->GetOutputTensorData("x"); | ||
| PostprocessXToAudio(x_out, 23041, audio); |
There was a problem hiding this comment.
The magic number 23041 is hardcoded here. Throughout this file, there are several other magic numbers (e.g., 32 for DOUBLE_INPUT_THRESHOLD, 50 for dur_shape_2, 640 for d_shape_2, 20 and 5 for STFT parameters). These should be defined as named constants with descriptive names to improve code readability and maintainability. For instance: static constexpr int32_t kNumFrames = 23041;.
| } | ||
|
|
||
| std::vector<float> x_out = model3_->GetOutputTensorData("x"); | ||
| PostprocessXToAudio(x_out, 23041, audio); |
There was a problem hiding this comment.
The magic number 23041 is hardcoded here. Throughout this file, there are several other magic numbers (e.g., 32 for DOUBLE_INPUT_THRESHOLD, 50 for dur_shape_2, 640 for d_shape_2, 20 and 5 for STFT parameters). These should be defined as named constants with descriptive names to improve code readability and maintainability. For instance: static constexpr int32_t kNumFrames = 23041;.
| if (num_speakers == 0 && sid != 0) { | ||
| #if __OHOS__ | ||
| SHERPA_ONNX_LOGE( | ||
| "This is a single-speaker model and supports only sid 0. Given sid: " | ||
| "%{public}d. sid is ignored", | ||
| static_cast<int32_t>(sid)); | ||
| #else | ||
| SHERPA_ONNX_LOGE( | ||
| "This is a single-speaker model and supports only sid 0. Given sid: " | ||
| "%d. sid is ignored", | ||
| static_cast<int32_t>(sid)); | ||
| #endif | ||
| } |
There was a problem hiding this comment.
The Generate function identifies that the model is single-speaker when num_speakers == 0, but it fails to reset sid to 0. This allows an invalid sid to be passed to the underlying model implementation, which can lead to out-of-bounds memory access.
Recommendation: Reset sid to 0 when num_speakers == 0.
| if (num_speakers == 0 && sid != 0) { | |
| #if __OHOS__ | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%{public}d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #else | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #endif | |
| } | |
| if (num_speakers == 0 && sid != 0) { | |
| #if __OHOS__ | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%{public}d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #else | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #endif | |
| sid = 0; | |
| } |
| if (num_speakers == 0 && sid != 0) { | ||
| #if __OHOS__ | ||
| SHERPA_ONNX_LOGE( | ||
| "This is a single-speaker model and supports only sid 0. Given sid: " | ||
| "%{public}d. sid is ignored", | ||
| static_cast<int32_t>(sid)); | ||
| #else | ||
| SHERPA_ONNX_LOGE( | ||
| "This is a single-speaker model and supports only sid 0. Given sid: " | ||
| "%d. sid is ignored", | ||
| static_cast<int32_t>(sid)); | ||
| #endif | ||
| } |
There was a problem hiding this comment.
Similar to the AXCL implementation, the Generate function logs that sid is ignored when num_speakers == 0 but fails to reset it to 0, leading to potential out-of-bounds access in the model implementation.
Recommendation: Reset sid to 0 when num_speakers == 0.
| if (num_speakers == 0 && sid != 0) { | |
| #if __OHOS__ | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%{public}d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #else | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #endif | |
| } | |
| if (num_speakers == 0 && sid != 0) { | |
| #if __OHOS__ | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%{public}d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #else | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #endif | |
| sid = 0; | |
| } |
| template <typename T> | ||
| static std::vector<T> Linspace(T a, T b, size_t N) { | ||
| if (N == 0) return {}; | ||
| if (N == 1) return {a}; | ||
| T h = (b - a) / static_cast<T>(N - 1); | ||
| std::vector<T> xs(N); | ||
| T val = a; | ||
| for (size_t i = 0; i < N; ++i) { | ||
| xs[i] = val; | ||
| val += h; | ||
| } | ||
| return xs; | ||
| } |
| }; | ||
|
|
||
| } // namespace sherpa_onnx | ||
| #endif // SHERPA_ONNX_CSRC_AXERA_OFFLINE_TTS_KOKORO_AXERA_IMPL_H_ No newline at end of file |
| class AxeraModel { | ||
| public: | ||
| explicit AxeraModel(const std::string &filename) { | ||
| auto buf = ReadFile(filename); | ||
| Init(buf.data(), buf.size()); | ||
| } | ||
|
|
||
| AxeraModel(const void *cpu_buf, size_t buf_len_in_bytes) { | ||
| Init(cpu_buf, buf_len_in_bytes); | ||
| } | ||
|
|
||
| ~AxeraModel() { | ||
| if (io_data_.pInputs || io_data_.pOutputs) { | ||
| FreeIO(&io_data_); | ||
| } | ||
|
|
||
| if (handle_) { | ||
| AX_ENGINE_DestroyHandle(handle_); | ||
| } | ||
| } | ||
|
|
||
| std::vector<int32_t> TensorShape(const std::string &name) const { | ||
| auto it = input_name_to_index_.find(name); | ||
| if (it != input_name_to_index_.end()) { | ||
| return input_tensor_shapes_[it->second]; | ||
| } | ||
|
|
||
| auto it2 = output_name_to_index_.find(name); | ||
| if (it2 != output_name_to_index_.end()) { | ||
| return output_tensor_shapes_[it2->second]; | ||
| } | ||
|
|
||
| SHERPA_ONNX_LOGE("Found no tensor with name: '%s'", name.c_str()); | ||
| return {}; | ||
| } | ||
|
|
||
| template <typename T> | ||
| bool SetInputTensorData(const std::string &name, const T *p, int32_t n) { | ||
| auto it = input_name_to_index_.find(name); | ||
| if (it == input_name_to_index_.end()) { | ||
| SHERPA_ONNX_LOGE("Found no input tensor with name: '%s'", name.c_str()); | ||
| return false; | ||
| } | ||
|
|
||
| int32_t i = it->second; | ||
| size_t expected_size = io_info_->pInputs[i].nSize; | ||
| size_t given_size = n * sizeof(T); | ||
|
|
||
| if (expected_size != given_size) { | ||
| SHERPA_ONNX_LOGE( | ||
| "Input tensor '%s' size mismatch. Expected %zu bytes, got %zu bytes", | ||
| name.c_str(), expected_size, given_size); | ||
| return false; | ||
| } | ||
|
|
||
| std::memcpy(io_data_.pInputs[i].pVirAddr, p, expected_size); | ||
| return true; | ||
| } | ||
|
|
||
| std::vector<float> GetOutputTensorData(const std::string &name) { | ||
| auto it = output_name_to_index_.find(name); | ||
| if (it == output_name_to_index_.end()) { | ||
| SHERPA_ONNX_LOGE("Found no output tensor with name: '%s'", name.c_str()); | ||
| return {}; | ||
| } | ||
|
|
||
| int32_t i = it->second; | ||
| const auto &out_meta = io_info_->pOutputs[i]; | ||
| auto &out_buf = io_data_.pOutputs[i]; | ||
|
|
||
| AX_SYS_MinvalidateCache(out_buf.phyAddr, out_buf.pVirAddr, out_meta.nSize); | ||
|
|
||
| size_t out_elems = out_meta.nSize / sizeof(float); | ||
| std::vector<float> out(out_elems); | ||
| std::memcpy(out.data(), out_buf.pVirAddr, out_meta.nSize); | ||
|
|
||
| return out; | ||
| } | ||
|
|
||
| bool Run() { | ||
| auto ret = AX_ENGINE_RunSync(handle_, &io_data_); | ||
| if (ret != 0) { | ||
| SHERPA_ONNX_LOGE("AX_ENGINE_RunSync failed, ret = %d", ret); | ||
| return false; | ||
| } | ||
|
|
||
| return true; | ||
| } | ||
|
|
||
| private: | ||
| void Init(const void *model_data, size_t model_data_length) { | ||
| InitContext(model_data, model_data_length, false, &handle_); | ||
| InitInputOutputAttrs(handle_, false, &io_info_); | ||
| PrepareIO(io_info_, &io_data_, false); | ||
|
|
||
| input_tensor_shapes_.reserve(io_info_->nInputSize); | ||
| for (uint32_t i = 0; i < io_info_->nInputSize; ++i) { | ||
| const auto &in = io_info_->pInputs[i]; | ||
| std::string name = in.pName; | ||
| input_name_to_index_[name] = i; | ||
| input_tensor_shapes_.emplace_back(in.pShape, in.pShape + in.nShapeSize); | ||
| } | ||
|
|
||
| output_tensor_shapes_.reserve(io_info_->nOutputSize); | ||
| for (uint32_t i = 0; i < io_info_->nOutputSize; ++i) { | ||
| const auto &out = io_info_->pOutputs[i]; | ||
| std::string name = out.pName; | ||
| output_name_to_index_[name] = i; | ||
| output_tensor_shapes_.emplace_back(out.pShape, | ||
| out.pShape + out.nShapeSize); | ||
| } | ||
| } | ||
|
|
||
| private: | ||
| AX_ENGINE_HANDLE handle_ = nullptr; | ||
| AX_ENGINE_IO_INFO_T *io_info_ = nullptr; | ||
| AX_ENGINE_IO_T io_data_{}; | ||
|
|
||
| std::unordered_map<std::string, int32_t> input_name_to_index_; | ||
| std::unordered_map<std::string, int32_t> output_name_to_index_; | ||
|
|
||
| std::vector<std::vector<int32_t>> input_tensor_shapes_; | ||
| std::vector<std::vector<int32_t>> output_tensor_shapes_; | ||
| }; |
There was a problem hiding this comment.
The AxeraModel class is defined locally within this .cc file. It's a substantial piece of code (over 120 lines) and is very similar in structure and purpose to the AxclModel class. To improve code organization and reduce duplication, consider refactoring this into a separate axera-model.h and axera-model.cc, similar to how AxclModel is handled. This would also make it easier to reuse if needed elsewhere.
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (6)
sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h (1)
38-90: Significant code duplication in constructors for FST/FAR loading.Both constructors contain nearly identical logic for loading
rule_fstsandrule_fars, differing only in file reading (direct path vs. Manager-based). This pattern appears to be duplicated from existing impl files.For maintainability, consider extracting the FST/FAR processing logic into helper methods or a separate utility class that can work with both file paths and Manager-based reading.
Also applies to: 92-147
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h` around lines 38 - 90, The constructor OfflineTtsKokoroAxeraImpl duplicates the logic that parses rule_fsts and rule_fars and populates tn_list_ (using SplitStringToVector, creating kaldifst::TextNormalizer from file paths or from fst::FarReader), which also appears in the other constructor region; extract that shared behavior into one or two helper methods (e.g., LoadRuleFstsFromPaths(const std::string& csv, std::vector<std::unique_ptr<kaldifst::TextNormalizer>>* tn_list) and LoadRuleFarsFromPaths(const std::string& csv, std::vector<std::unique_ptr<kaldifst::TextNormalizer>>* tn_list) or a single LoadRules helper that handles both FST and FAR inputs), call those helpers from the constructor(s) after InitFrontend(), keep existing debug logging and use tn_list_.reserve as appropriate, and reuse the same helpers in the Manager-based constructors to eliminate duplication.sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc (2)
296-298: Consider extracting hardcoded dimension constants.
d_shape_2 = 640is hardcoded with a comment suggesting it relates tostyle_dim_[2]. This magic number, along with others likedur_shape_2 = 50(line 375) andDOUBLE_INPUT_THRESHOLD = 32(line 510), would benefit from being named constants for clarity and maintainability.Example named constants
// At class level or namespace scope static constexpr int32_t kStyleDim2 = 640; static constexpr int32_t kDurShape2 = 50; static constexpr int32_t kDoubleInputThreshold = 32;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 296 - 298, Replace magic numbers with named constexpr constants and use them where d_shape_2, dur_shape_2 and DOUBLE_INPUT_THRESHOLD are set/used: define e.g. kStyleDim2, kDurShape2, kDoubleInputThreshold at class or anonymous-namespace scope and replace the literal 640 (used to initialize d_shape_2 / en), 50 (used for dur_shape_2) and 32 (DOUBLE_INPUT_THRESHOLD) with those constants; ensure any comment referring to style_dim_[2] now references kStyleDim2 and update variable initializations where functions or members (e.g., d_shape_2, dur_shape_2, DOUBLE_INPUT_THRESHOLD) are declared to use the new constants for clarity and maintainability.
231-240: Input truncation warning should consider providing alternative handling.When
input_idsexceedsmax_seq_len_, the code logs a warning and truncates. This is reasonable, but users might want to know this limitation upfront or have the option to split long inputs into multiple calls.The truncation behavior is safe but worth documenting in the public API.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 231 - 240, The current RunModels implementation silently truncates input_ids when actual_len > max_seq_len_; change this to either (A) make truncation explicit/configurable by adding a parameter (e.g., bool allow_truncate) to RunModels and only resize when allow_truncate is true, or (B) refuse to run and return false (or an error code) if input_ids.size() > max_seq_len_, and in both cases update the SHERPA_ONNX_LOGE call to include the offending length and a suggested action (e.g., "increase max_seq_len_, set allow_truncate=true, or split input into multiple calls"); ensure references to RunModels, input_ids, max_seq_len_, and the log message are updated accordingly and document the behavior in the public API.sherpa-onnx/csrc/offline-tts-impl.cc (2)
51-71: Consider case-insensitive provider comparison for user convenience.The provider string comparison is case-sensitive (e.g.,
"axera","axcl"). Users specifying"AXERA"or"Axera"would fall through to the wrong code path. Consider normalizing the provider string to lowercase before comparison.Example approach
std::string provider = config.model.provider; std::transform(provider.begin(), provider.end(), provider.begin(), ::tolower); if (provider == "axera") { // ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/offline-tts-impl.cc` around lines 51 - 71, Normalize config.model.provider to lowercase before comparing (e.g., create a local lowercased string from config.model.provider using std::transform) and then use that lowercase variable for the if checks (e.g., the existing if (provider == "axera") branch). Update any other provider comparisons in this file (such as "axcl") to use the same lowercased variable so comparisons for symbols like config.model.provider, OfflineTtsKokoroAxeraImpl, and related provider branches become case-insensitive.
51-94: Duplicated provider-checking logic between Create overloads.The provider-specific branching for
axeraandaxclis duplicated nearly identically in bothCreate(const OfflineTtsConfig&)andCreate(Manager*, const OfflineTtsConfig&). This increases maintenance burden when adding new providers or modifying behavior.Consider extracting provider validation into a helper function or macro to reduce duplication.
Also applies to: 121-164
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/offline-tts-impl.cc` around lines 51 - 94, The provider-specific branching for "axera" and "axcl" is duplicated between the two Create overloads (Create(const OfflineTtsConfig&) and Create(Manager*, const OfflineTtsConfig&)); extract the common logic into a helper function (e.g. BuildProviderImpl or ValidateAndCreateProvider) that takes the OfflineTtsConfig (and optionally Manager*) and returns the appropriate std::unique_ptr<OfflineTts> or nullptr after handling the SHERPA_ONNX_ENABLE_AXERA / SHERPA_ONNX_ENABLE_AXCL compile-time guards and logging (SHERPA_ONNX_LOGE / SHERPA_ONNX_EXIT). Move the checks for config.model.provider == "axera" and "axcl", the kokoro model existence check, and the platform-specific error messages into that helper, and have both Create overloads call it to reduce duplication and keep behavior identical.sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h (1)
8-9: Missing<cstdint>include forint64_t.The header uses
int64_tin theRunmethod signature (line 25) but doesn't include<cstdint>. While it may be transitively included, explicit inclusion is recommended for types used in public interfaces.Proposed fix
+#include <cstdint> `#include` <memory> `#include` <vector>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h` around lines 8 - 9, Add a direct include of <cstdint> so the public API that uses int64_t (the Run method signature) has the required typedef available; update the header includes at the top (next to the existing <memory> and <vector>) to add <cstdint> to ensure int64_t is defined for the Run method and any other public declarations referencing fixed-width integer types.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h`:
- Around line 163-175: The code logs when a non-zero sid is provided for
single-speaker models but doesn't change the value; update the branch in
OfflineTtsKokoroAxclImpl (where num_speakers and sid are checked) to set sid = 0
when num_speakers == 0 && sid != 0 so the sanitized speaker ID is passed into
Process() (and anywhere else the method uses sid) instead of the original value;
keep the existing log statements.
- Around line 75-83: The code calls fst::FarReader<fst::StdArc>::Open(...) and
immediately dereferences reader (reader->Done()) which can be null; update the
Open call sites (the local reader variable and the in-memory Open(std::move(s))
path) to check whether reader is non-null before entering the for loop, and
handle the failure (log an error/return or skip processing) instead of
dereferencing; specifically add a guard after the Open(...) that validates
reader and only runs the loop that creates kaldifst::TextNormalizer and pushes
into tn_list_ when reader is valid.
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 194-210: LoadVoices currently hardcodes model metadata
(sample_rate, style dims, max_token_len) which breaks compatibility; instead
read these values from the model metadata on model1_ (e.g., fetch keys like
"sample_rate", "style_dim0"/"style_dim1" or "style_dims", "max_token_len", and
optional "voice"/"has_espeak"), parse them into meta_data_ (sample_rate,
max_token_len, voice, has_espeak), compute num_speakers using the retrieved
style dimensions when calculating meta_data_.num_speakers and when slicing
styles_, and keep max_seq_len_ from model1_->TensorShape("input_ids") as before
so the function adapts to different Kokoro model variants.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h`:
- Around line 389-390: The Process method's sid parameter is declared as int32_t
causing narrowing from Generate (which passes an int64_t) and mismatch with
model_->Run (which expects int64_t); change the Process signature to accept
int64_t sid (update GeneratedAudio Process(...) to use int64_t), update any
internal uses and calls (e.g., Generate → Process) to pass the 64-bit sid
unchanged, and ensure the argument passed into model_->Run remains int64_t to
avoid implicit narrowing/widening conversions.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc`:
- Around line 596-606: In PostprocessXToAudio validate that x.size() is at least
expected_size = static_cast<size_t>(half_n_fft) * num_frames * 2 before using
x.begin() + half_n_fft * num_frames; if it is smaller, bail out gracefully
(e.g., log/error/throw and return) or pad/resize x to expected_size so creating
spec_part and phase_part with iterators is safe; update the code around the
spec_part/phase_part assignment to check x.size(), compute expected_size using
half_n_fft and num_frames, and handle the error path instead of unconditionally
using the iterators.
- Around line 321-333: Validate that voices_data_length is a multiple of
sizeof(float) * kStyleDim0 * kStyleDim1 before computing meta_data_.num_speakers
and building styles_; replace the hardcoded 510 and 256 with kStyleDim0 and
kStyleDim1 so both LoadVoices and LoadVoiceEmbedding share the same constants,
compute actual_num_floats and num_speakers using those constants, and if the
blob size is not a whole number of speaker blocks fail early (log/throw/return)
rather than creating a truncated styles_ vector to ensure data integrity.
- Around line 250-252: Validate kokoro.length_scale and computed speed to ensure
they are positive before using them: check config_.kokoro.length_scale > 0
before performing the adjustment in the block that sets speed (the spot using
config_.kokoro.length_scale) and after computing speed ensure speed > 0 (or
clamp to a sane positive minimum / return an error status) so ProcessDuration()
never receives a non-positive divisor; update the caller/flow that relies on
speed to handle error/invalid-config cases and add clear error messages
referencing config_.kokoro.length_scale and speed.
---
Nitpick comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 296-298: Replace magic numbers with named constexpr constants and
use them where d_shape_2, dur_shape_2 and DOUBLE_INPUT_THRESHOLD are set/used:
define e.g. kStyleDim2, kDurShape2, kDoubleInputThreshold at class or
anonymous-namespace scope and replace the literal 640 (used to initialize
d_shape_2 / en), 50 (used for dur_shape_2) and 32 (DOUBLE_INPUT_THRESHOLD) with
those constants; ensure any comment referring to style_dim_[2] now references
kStyleDim2 and update variable initializations where functions or members (e.g.,
d_shape_2, dur_shape_2, DOUBLE_INPUT_THRESHOLD) are declared to use the new
constants for clarity and maintainability.
- Around line 231-240: The current RunModels implementation silently truncates
input_ids when actual_len > max_seq_len_; change this to either (A) make
truncation explicit/configurable by adding a parameter (e.g., bool
allow_truncate) to RunModels and only resize when allow_truncate is true, or (B)
refuse to run and return false (or an error code) if input_ids.size() >
max_seq_len_, and in both cases update the SHERPA_ONNX_LOGE call to include the
offending length and a suggested action (e.g., "increase max_seq_len_, set
allow_truncate=true, or split input into multiple calls"); ensure references to
RunModels, input_ids, max_seq_len_, and the log message are updated accordingly
and document the behavior in the public API.
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h`:
- Around line 8-9: Add a direct include of <cstdint> so the public API that uses
int64_t (the Run method signature) has the required typedef available; update
the header includes at the top (next to the existing <memory> and <vector>) to
add <cstdint> to ensure int64_t is defined for the Run method and any other
public declarations referencing fixed-width integer types.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h`:
- Around line 38-90: The constructor OfflineTtsKokoroAxeraImpl duplicates the
logic that parses rule_fsts and rule_fars and populates tn_list_ (using
SplitStringToVector, creating kaldifst::TextNormalizer from file paths or from
fst::FarReader), which also appears in the other constructor region; extract
that shared behavior into one or two helper methods (e.g.,
LoadRuleFstsFromPaths(const std::string& csv,
std::vector<std::unique_ptr<kaldifst::TextNormalizer>>* tn_list) and
LoadRuleFarsFromPaths(const std::string& csv,
std::vector<std::unique_ptr<kaldifst::TextNormalizer>>* tn_list) or a single
LoadRules helper that handles both FST and FAR inputs), call those helpers from
the constructor(s) after InitFrontend(), keep existing debug logging and use
tn_list_.reserve as appropriate, and reuse the same helpers in the Manager-based
constructors to eliminate duplication.
In `@sherpa-onnx/csrc/offline-tts-impl.cc`:
- Around line 51-71: Normalize config.model.provider to lowercase before
comparing (e.g., create a local lowercased string from config.model.provider
using std::transform) and then use that lowercase variable for the if checks
(e.g., the existing if (provider == "axera") branch). Update any other provider
comparisons in this file (such as "axcl") to use the same lowercased variable so
comparisons for symbols like config.model.provider, OfflineTtsKokoroAxeraImpl,
and related provider branches become case-insensitive.
- Around line 51-94: The provider-specific branching for "axera" and "axcl" is
duplicated between the two Create overloads (Create(const OfflineTtsConfig&) and
Create(Manager*, const OfflineTtsConfig&)); extract the common logic into a
helper function (e.g. BuildProviderImpl or ValidateAndCreateProvider) that takes
the OfflineTtsConfig (and optionally Manager*) and returns the appropriate
std::unique_ptr<OfflineTts> or nullptr after handling the
SHERPA_ONNX_ENABLE_AXERA / SHERPA_ONNX_ENABLE_AXCL compile-time guards and
logging (SHERPA_ONNX_LOGE / SHERPA_ONNX_EXIT). Move the checks for
config.model.provider == "axera" and "axcl", the kokoro model existence check,
and the platform-specific error messages into that helper, and have both Create
overloads call it to reduce duplication and keep behavior identical.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4c749349-7b29-40e8-935c-83313b17a0ed
📒 Files selected for processing (10)
sherpa-onnx/csrc/CMakeLists.txtsherpa-onnx/csrc/axcl/axcl-model.ccsherpa-onnx/csrc/axcl/axcl-model.hsherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.hsherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.ccsherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.hsherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.hsherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.ccsherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.hsherpa-onnx/csrc/offline-tts-impl.cc
| std::unique_ptr<fst::FarReader<fst::StdArc>> reader( | ||
| fst::FarReader<fst::StdArc>::Open(f)); | ||
| for (; !reader->Done(); reader->Next()) { | ||
| std::unique_ptr<fst::StdConstFst> r( | ||
| fst::CastOrConvertToConstFst(reader->GetFst()->Copy())); | ||
|
|
||
| tn_list_.push_back( | ||
| std::make_unique<kaldifst::TextNormalizer>(std::move(r))); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "offline-tts-kokoro-axcl-impl.h" -type fRepository: k2-fsa/sherpa-onnx
Length of output: 116
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h | sed -n '60,90p'Repository: k2-fsa/sherpa-onnx
Length of output: 1224
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h | sed -n '120,155p'Repository: k2-fsa/sherpa-onnx
Length of output: 1412
🏁 Script executed:
# Check if FarReader is from FST library and understand its behavior
rg "FarReader" sherpa-onnx/csrc/axcl/ -A 3 -B 1Repository: k2-fsa/sherpa-onnx
Length of output: 1212
🏁 Script executed:
# Search for other FarReader usage patterns in the codebase
rg "FarReader.*Open" --type cpp --type h -A 2 | head -60Repository: k2-fsa/sherpa-onnx
Length of output: 4026
🏁 Script executed:
# Check if any file has null checks for FarReader::Open
rg "if.*!reader" --type cpp --type h -B 2 -A 1Repository: k2-fsa/sherpa-onnx
Length of output: 44
🏁 Script executed:
# Search for any checks after FarReader::Open
rg "FarReader.*Open" --type cpp --type h -A 5 | grep -E "(if|nullptr|NULL)" | head -20Repository: k2-fsa/sherpa-onnx
Length of output: 44
🏁 Script executed:
# Check the error handling macros to understand the pattern
rg "SHERPA_ONNX_LOGE|SHERPA_ONNX_EXIT" sherpa-onnx/csrc/ --type h --type cpp | head -20Repository: k2-fsa/sherpa-onnx
Length of output: 1824
🏁 Script executed:
# Search for how FarReader is being used and if there are any comments about it
rg "FarReader" --type h --type cpp -B 5 -A 10 | grep -E "(nullptr|NULL|error|fail)" | head -30Repository: k2-fsa/sherpa-onnx
Length of output: 44
🏁 Script executed:
# Check if there's any documentation or comments about FarReader in the codebase
rg "FarReader" sherpa-onnx/csrc/ -B 10 | grep -E "(//|/\*)" | head -20Repository: k2-fsa/sherpa-onnx
Length of output: 44
🏁 Script executed:
# Look at the complete context around one of the FarReader usages to confirm no null check
cat -n sherpa-onnx/csrc/offline-recognizer-impl.cc | sed -n '1,50p' | tail -30Repository: k2-fsa/sherpa-onnx
Length of output: 1699
🏁 Script executed:
# Search for FarReader usage in offline-recognizer-impl.cc with full context
rg "FarReader.*Open" sherpa-onnx/csrc/offline-recognizer-impl.cc -B 5 -A 8Repository: k2-fsa/sherpa-onnx
Length of output: 1131
🏁 Script executed:
# Search for any FAR-related error handling or documentation in the codebase
rg "FarReader" -B 20 | grep -E "(nullptr|ERROR|check|valid|status)" | head -20Repository: k2-fsa/sherpa-onnx
Length of output: 44
🏁 Script executed:
# Check if there's an fst directory in the repo or if it's an external dependency
find . -name "far.h" -o -name "FarReader*" 2>/dev/null | head -10Repository: k2-fsa/sherpa-onnx
Length of output: 44
🏁 Script executed:
# Look for imports of far.h to understand it's an external dependency
rg "far.h" sherpa-onnx/csrc/ -B 2 -A 2Repository: k2-fsa/sherpa-onnx
Length of output: 3340
Add null check after FarReader::Open() to prevent null pointer dereference.
Open() can fail if the FAR archive is missing or corrupt, returning a null pointer. The code immediately calls reader->Done() without validation, causing a crash. Add a guard before the loop:
Proposed fix
std::unique_ptr<fst::FarReader<fst::StdArc>> reader(
fst::FarReader<fst::StdArc>::Open(f));
+ if (!reader) {
+ SHERPA_ONNX_LOGE("Failed to open FAR archive: %s", f.c_str());
+ SHERPA_ONNX_EXIT(-1);
+ }
for (; !reader->Done(); reader->Next()) {Apply the same guard to the in-memory Open(std::move(s)) path at lines 135–144 as well.
Note: This pattern appears in at least 9 other files using FarReader::Open() without validation. Consider a systematic fix across the codebase.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::unique_ptr<fst::FarReader<fst::StdArc>> reader( | |
| fst::FarReader<fst::StdArc>::Open(f)); | |
| for (; !reader->Done(); reader->Next()) { | |
| std::unique_ptr<fst::StdConstFst> r( | |
| fst::CastOrConvertToConstFst(reader->GetFst()->Copy())); | |
| tn_list_.push_back( | |
| std::make_unique<kaldifst::TextNormalizer>(std::move(r))); | |
| } | |
| std::unique_ptr<fst::FarReader<fst::StdArc>> reader( | |
| fst::FarReader<fst::StdArc>::Open(f)); | |
| if (!reader) { | |
| SHERPA_ONNX_LOGE("Failed to open FAR archive: %s", f.c_str()); | |
| SHERPA_ONNX_EXIT(-1); | |
| } | |
| for (; !reader->Done(); reader->Next()) { | |
| std::unique_ptr<fst::StdConstFst> r( | |
| fst::CastOrConvertToConstFst(reader->GetFst()->Copy())); | |
| tn_list_.push_back( | |
| std::make_unique<kaldifst::TextNormalizer>(std::move(r))); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h` around lines 75 - 83,
The code calls fst::FarReader<fst::StdArc>::Open(...) and immediately
dereferences reader (reader->Done()) which can be null; update the Open call
sites (the local reader variable and the in-memory Open(std::move(s)) path) to
check whether reader is non-null before entering the for loop, and handle the
failure (log an error/return or skip processing) instead of dereferencing;
specifically add a guard after the Open(...) that validates reader and only runs
the loop that creates kaldifst::TextNormalizer and pushes into tn_list_ when
reader is valid.
| if (num_speakers == 0 && sid != 0) { | ||
| #if __OHOS__ | ||
| SHERPA_ONNX_LOGE( | ||
| "This is a single-speaker model and supports only sid 0. Given sid: " | ||
| "%{public}d. sid is ignored", | ||
| static_cast<int32_t>(sid)); | ||
| #else | ||
| SHERPA_ONNX_LOGE( | ||
| "This is a single-speaker model and supports only sid 0. Given sid: " | ||
| "%d. sid is ignored", | ||
| static_cast<int32_t>(sid)); | ||
| #endif | ||
| } |
There was a problem hiding this comment.
Actually reset sid for single-speaker models.
This branch only logs that non-zero sid is ignored, but it still passes the original value into Process(). That means a single-speaker model can still receive a non-zero speaker ID here.
🐛 Proposed fix
if (num_speakers == 0 && sid != 0) {
`#if` __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
`#else`
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
`#endif`
+ sid = 0;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (num_speakers == 0 && sid != 0) { | |
| #if __OHOS__ | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%{public}d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #else | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| #endif | |
| } | |
| if (num_speakers == 0 && sid != 0) { | |
| `#if` __OHOS__ | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%{public}d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| `#else` | |
| SHERPA_ONNX_LOGE( | |
| "This is a single-speaker model and supports only sid 0. Given sid: " | |
| "%d. sid is ignored", | |
| static_cast<int32_t>(sid)); | |
| `#endif` | |
| sid = 0; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h` around lines 163 - 175,
The code logs when a non-zero sid is provided for single-speaker models but
doesn't change the value; update the branch in OfflineTtsKokoroAxclImpl (where
num_speakers and sid are checked) to set sid = 0 when num_speakers == 0 && sid
!= 0 so the sanitized speaker ID is passed into Process() (and anywhere else the
method uses sid) instead of the original value; keep the existing log
statements.
| GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens, | ||
| int32_t sid, float speed) const { |
There was a problem hiding this comment.
Type mismatch: sid narrowed from int64_t to int32_t.
The Process method declares int32_t sid (line 390), but it's called from Generate which receives int64_t sid (line 158). The model's Run method (per context snippet at offline-tts-kokoro-model-axera.h:26-27) also expects int64_t sid.
This causes implicit narrowing conversions:
Generate→Process:int64_t→int32_t(potential data loss for large speaker IDs)Process→model_->Run:int32_t→int64_t(safe widening)
Proposed fix
- GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens,
- int32_t sid, float speed) const {
+ GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens,
+ int64_t sid, float speed) const {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens, | |
| int32_t sid, float speed) const { | |
| GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens, | |
| int64_t sid, float speed) const { |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h` around lines 389 -
390, The Process method's sid parameter is declared as int32_t causing narrowing
from Generate (which passes an int64_t) and mismatch with model_->Run (which
expects int64_t); change the Process signature to accept int64_t sid (update
GeneratedAudio Process(...) to use int64_t), update any internal uses and calls
(e.g., Generate → Process) to pass the 64-bit sid unchanged, and ensure the
argument passed into model_->Run remains int64_t to avoid implicit
narrowing/widening conversions.
| if (config_.kokoro.length_scale != 1 && speed == 1) { | ||
| speed = 1. / config_.kokoro.length_scale; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "*offline-tts-kokoro-model-axera.cc" -type fRepository: k2-fsa/sherpa-onnx
Length of output: 120
🏁 Script executed:
wc -l ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.ccRepository: k2-fsa/sherpa-onnx
Length of output: 124
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '240,260p'Repository: k2-fsa/sherpa-onnx
Length of output: 782
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '510,540p'Repository: k2-fsa/sherpa-onnx
Length of output: 1372
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '1,50p'Repository: k2-fsa/sherpa-onnx
Length of output: 1690
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '240,350p'Repository: k2-fsa/sherpa-onnx
Length of output: 4916
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '380,450p'Repository: k2-fsa/sherpa-onnx
Length of output: 3066
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '517,545p'Repository: k2-fsa/sherpa-onnx
Length of output: 1419
🏁 Script executed:
rg -n "length_scale\|speed" ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | head -30Repository: k2-fsa/sherpa-onnx
Length of output: 44
🏁 Script executed:
grep -n "speed\|length_scale" ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.ccRepository: k2-fsa/sherpa-onnx
Length of output: 789
🏁 Script executed:
cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '517,532p'Repository: k2-fsa/sherpa-onnx
Length of output: 822
Add validation for non-positive speed and kokoro.length_scale values.
At line 529, ProcessDuration() divides by speed. If speed ≤ 0, this produces infinity, which is later rounded and cast to int32_t (line 531), resulting in undefined behavior and potentially massive allocations. Additionally, length_scale can indirectly cause invalid speed values through the calculation at line 251.
🛡️ Proposed fix
+ if (speed <= 0) {
+ SHERPA_ONNX_LOGE("speed must be > 0. Given: %f", speed);
+ return {};
+ }
+
+ if (config_.kokoro.length_scale <= 0) {
+ SHERPA_ONNX_LOGE("kokoro.length_scale must be > 0. Given: %f",
+ config_.kokoro.length_scale);
+ return {};
+ }
+
if (config_.kokoro.length_scale != 1 && speed == 1) {
speed = 1. / config_.kokoro.length_scale;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (config_.kokoro.length_scale != 1 && speed == 1) { | |
| speed = 1. / config_.kokoro.length_scale; | |
| } | |
| if (speed <= 0) { | |
| SHERPA_ONNX_LOGE("speed must be > 0. Given: %f", speed); | |
| return {}; | |
| } | |
| if (config_.kokoro.length_scale <= 0) { | |
| SHERPA_ONNX_LOGE("kokoro.length_scale must be > 0. Given: %f", | |
| config_.kokoro.length_scale); | |
| return {}; | |
| } | |
| if (config_.kokoro.length_scale != 1 && speed == 1) { | |
| speed = 1. / config_.kokoro.length_scale; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 250 -
252, Validate kokoro.length_scale and computed speed to ensure they are positive
before using them: check config_.kokoro.length_scale > 0 before performing the
adjustment in the block that sets speed (the spot using
config_.kokoro.length_scale) and after computing speed ensure speed > 0 (or
clamp to a sane positive minimum / return an error status) so ProcessDuration()
never receives a non-positive divisor; update the caller/flow that relies on
speed to handle error/invalid-config cases and add clear error messages
referencing config_.kokoro.length_scale and speed.
| void PostprocessXToAudio(const std::vector<float> &x, int32_t num_frames, | ||
| std::vector<float> &audio) { | ||
| int32_t n_fft = 20; | ||
| int32_t hop_length = 5; | ||
| int32_t half_n_fft = n_fft / 2 + 1; | ||
|
|
||
| std::vector<float> spec_part(half_n_fft * num_frames); | ||
| std::vector<float> phase_part(half_n_fft * num_frames); | ||
| std::vector<float> cos_part(half_n_fft * num_frames); | ||
| spec_part.assign(x.begin(), x.begin() + half_n_fft * num_frames); | ||
| phase_part.assign(x.begin() + half_n_fft * num_frames, x.end()); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd sherpa-onnx && find . -name "offline-tts-kokoro-model-axera.cc" -type fRepository: k2-fsa/sherpa-onnx
Length of output: 108
🏁 Script executed:
cat -n "csrc/axera/offline-tts-kokoro-model-axera.cc" | sed -n '590,620p'Repository: k2-fsa/sherpa-onnx
Length of output: 138
🏁 Script executed:
# Let me also search for the model3 call and understand the data flow
rg "PostprocessXToAudio" "csrc/axera/offline-tts-kokoro-model-axera.cc" -B 3 -A 20Repository: k2-fsa/sherpa-onnx
Length of output: 146
🏁 Script executed:
# Check if there are any size validations in the entire file related to model outputs
rg "\.size\(\)" "csrc/axera/offline-tts-kokoro-model-axera.cc" | head -20Repository: k2-fsa/sherpa-onnx
Length of output: 146
🏁 Script executed:
# Look for model3 invocation to see if output size is validated
rg "model3|model_out" "csrc/axera/offline-tts-kokoro-model-axera.cc" -B 2 -A 2Repository: k2-fsa/sherpa-onnx
Length of output: 146
🏁 Script executed:
find . -name "*kokoro*" -o -name "*axera*" | head -20Repository: k2-fsa/sherpa-onnx
Length of output: 944
🏁 Script executed:
ls -la csrc/ | head -20Repository: k2-fsa/sherpa-onnx
Length of output: 114
🏁 Script executed:
# Check if the file exists at the root or with a different structure
git ls-files | grep -i kokoroRepository: k2-fsa/sherpa-onnx
Length of output: 4340
🏁 Script executed:
git ls-files | grep -i axeraRepository: k2-fsa/sherpa-onnx
Length of output: 562
🏁 Script executed:
cat -n "sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc" | sed -n '590,620p'Repository: k2-fsa/sherpa-onnx
Length of output: 1364
🏁 Script executed:
# Get more context around the function
cat -n "sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc" | sed -n '580,650p'Repository: k2-fsa/sherpa-onnx
Length of output: 3145
🏁 Script executed:
# Find all occurrences of PostprocessXToAudio
rg "PostprocessXToAudio" "sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc" -B 5 -A 30Repository: k2-fsa/sherpa-onnx
Length of output: 2650
🏁 Script executed:
# Look for model3 or invocation of this function to understand how x is passed
rg "model3|PostprocessXToAudio" "sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc" -B 3 -A 5Repository: k2-fsa/sherpa-onnx
Length of output: 3228
Validate x size before splitting spectrogram buffers.
x.begin() + half_n_fft * num_frames (253,451) is used unconditionally to create iterators. If model3 returns a shorter tensor than expected, this creates invalid iterators and invokes undefined behavior. The expected size is half_n_fft * num_frames * 2 = 506,902 floats (spec_part + phase_part).
Proposed fix
void PostprocessXToAudio(const std::vector<float> &x, int32_t num_frames,
std::vector<float> &audio) {
int32_t n_fft = 20;
int32_t hop_length = 5;
int32_t half_n_fft = n_fft / 2 + 1;
+ size_t expected =
+ static_cast<size_t>(half_n_fft) * static_cast<size_t>(num_frames) * 2;
+ if (x.size() < expected) {
+ SHERPA_ONNX_LOGE(
+ "Unexpected model3 output size. Expected at least %zu floats, got "
+ "%zu",
+ expected, x.size());
+ audio.clear();
+ return;
+ }
std::vector<float> spec_part(half_n_fft * num_frames);
std::vector<float> phase_part(half_n_fft * num_frames);
std::vector<float> cos_part(half_n_fft * num_frames);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 596 -
606, In PostprocessXToAudio validate that x.size() is at least expected_size =
static_cast<size_t>(half_n_fft) * num_frames * 2 before using x.begin() +
half_n_fft * num_frames; if it is smaller, bail out gracefully (e.g.,
log/error/throw and return) or pad/resize x to expected_size so creating
spec_part and phase_part with iterators is safe; update the code around the
spec_part/phase_part assignment to check x.size(), compute expected_size using
half_n_fft and num_frames, and handle the error path instead of unconditionally
using the iterators.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc (2)
625-629:⚠️ Potential issue | 🔴 CriticalValidate
xbefore slicing spectrogram and phase buffers.Line 629 is only valid if
model3returned at leasthalf_n_fft * num_frames * 2floats. A shorter tensor makesx.begin() + half_n_fft * num_framesan invalid iterator and invokes undefined behavior.🛡️ Suggested fix
int32_t n_fft = 20; int32_t hop_length = 5; int32_t half_n_fft = n_fft / 2 + 1; + size_t expected = + static_cast<size_t>(half_n_fft) * static_cast<size_t>(num_frames) * 2; + if (x.size() < expected) { + SHERPA_ONNX_LOGE( + "Unexpected model3 output size. Expected at least %zu floats, got " + "%zu", + expected, x.size()); + audio.clear(); + return; + } std::vector<float> spec_part(half_n_fft * num_frames); std::vector<float> phase_part(half_n_fft * num_frames);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 625 - 629, Before slicing into spec_part and phase_part, validate the length of the input vector x: ensure x.size() >= static_cast<size_t>(half_n_fft) * num_frames * 2 (or the exact expected length for model3 output) and handle the short-input case (e.g., log/error and return or throw) so that x.begin() + half_n_fft * num_frames is always a valid iterator; update the code around the spec_part/phase_part assignments (referencing variables x, spec_part, phase_part, half_n_fft, num_frames, and the model3 output) to perform this check and early error handling before calling assign.
247-249:⚠️ Potential issue | 🔴 CriticalReject non-positive
length_scale/speedbefore duration scaling.
speed = 1. / config_.kokoro.length_scalestill allows<= 0.ProcessDuration()divides byspeedon Line 552, so a zero/negative value can produce invalid rounded durations and break the rest of the pipeline.🛡️ Suggested fix
- if (config_.kokoro.length_scale != 1 && speed == 1) { + if (config_.kokoro.length_scale <= 0) { + SHERPA_ONNX_LOGE("kokoro.length_scale must be > 0. Given: %f", + config_.kokoro.length_scale); + return {}; + } + + if (config_.kokoro.length_scale != 1 && speed == 1) { speed = 1. / config_.kokoro.length_scale; } + + if (speed <= 0) { + SHERPA_ONNX_LOGE("speed must be > 0. Given: %f", speed); + return {}; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 247 - 249, Before using length_scale to compute speed (the snippet that sets speed = 1. / config_.kokoro.length_scale), validate config_.kokoro.length_scale > 0 and ensure computed speed > 0; if the value is non-positive, fail early (return an error/Status or throw) with a clear message rather than proceeding to ProcessDuration(), so ProcessDuration() never receives zero/negative speed. Update the block that references config_.kokoro.length_scale and speed and add the guard+error path (or clamp only after explicit decision), ensuring callers observe the failure instead of continuing with invalid durations.
🧹 Nitpick comments (4)
sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc (4)
304-307: Remove unusedinput_lengthsparameter.
input_lengthsis computed inComputeExternalPreprocessingbut never used by the caller. This is dead code that adds confusion.♻️ Proposed fix
Update the function signature and call site:
- void ComputeExternalPreprocessing(const std::vector<int64_t> &input_ids, - int32_t actual_len, - std::vector<int32_t> &input_lengths, - std::vector<int32_t> &text_mask) { - input_lengths = std::vector<int32_t>{actual_len}; + void ComputeExternalPreprocessing(int32_t actual_len, + std::vector<int32_t> &text_mask) { text_mask.resize(max_seq_len_); for (int32_t i = 0; i < max_seq_len_; i++) { text_mask[i] = (i >= actual_len) ? 1 : 0; } }And update the call site in
InferenceSingleChunk:- std::vector<int32_t> input_lengths; std::vector<int32_t> text_mask; - ComputeExternalPreprocessing(input_ids, actual_len, input_lengths, - text_mask); + ComputeExternalPreprocessing(actual_len, text_mask);Also applies to: 576-585
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 304 - 307, The local vector input_lengths returned from ComputeExternalPreprocessing is unused — remove it from the API and all call sites: update ComputeExternalPreprocessing to drop the output parameter (remove the input_lengths parameter from its signature and implementation) and remove the corresponding argument at each call site (e.g., the call in InferenceSingleChunk and the other call around lines 576-585). Ensure compile by adjusting any internal logic that previously wrote into input_lengths and update any tests or callers that relied on that parameter.
74-86: Remove unusedLinspacefunction.This function is defined but never called anywhere in this file. Dead code should be removed to improve maintainability.
🧹 Suggested removal
-template <typename T> -static std::vector<T> Linspace(T a, T b, size_t N) { - if (N == 0) return {}; - if (N == 1) return {a}; - T h = (b - a) / static_cast<T>(N - 1); - std::vector<T> xs(N); - T val = a; - for (size_t i = 0; i < N; ++i) { - xs[i] = val; - val += h; - } - return xs; -}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 74 - 86, The template function Linspace (template<typename T> static std::vector<T> Linspace(T a, T b, size_t N)) is dead code in this translation unit; remove its entire definition to eliminate unused code and improve maintainability, and ensure there are no remaining references or forward declarations to Linspace in this file or its headers—after removal, rebuild/tests to confirm no regressions.
413-422: Consider readingdur_shape_2from model tensor shape.The hardcoded value
50represents the duration tensor's second dimension. For better model compatibility, consider querying this frommodel1_'s output tensor shape for"duration", similar to howmax_seq_len_is obtained fromTensorShape("input_ids").🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 413 - 422, The code hardcodes dur_shape_2 = 50 when summing duration_processed, which can break if the model's "duration" output shape changes; instead query the second dimension from model1_'s output tensor shape for "duration" (similar to how max_seq_len_ is obtained from TensorShape("input_ids")) and use that value for the loop bounds and indexing into duration_processed and pred_dur_original; update references to dur_shape_2, duration_processed, and pred_dur_original accordingly so the loop uses the runtime shape rather than a magic constant.
552-552: Extract magic number to a named constant.The comment
// DOUBLE_INPUT_THRESHOLDindicates this should be a named constant for clarity.♻️ Suggested improvement
Add as a private constant in the Impl class:
class OfflineTtsKokoroModelAxcl::Impl { private: + static constexpr int32_t kDoubleInputThreshold = 32; int32_t max_seq_len_ = 0;And use it:
- if (actual_len <= 32) { // DOUBLE_INPUT_THRESHOLD + if (actual_len <= kDoubleInputThreshold) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` at line 552, Extract the magic number 32 into a named constant by adding a private constant inside the Impl class (e.g., a private static constexpr size_t DOUBLE_INPUT_THRESHOLD = 32 or an appropriate integral type matching actual_len) and replace the literal in the condition `if (actual_len <= 32)` with `if (actual_len <= DOUBLE_INPUT_THRESHOLD)` (remove the trailing comment), ensuring the constant name matches exactly `DOUBLE_INPUT_THRESHOLD` and the type is compatible with `actual_len`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 330-343: The code currently hardcodes d_shape_2 = 640 which can be
inconsistent with model metadata; replace that assignment with d_shape_2 =
style_dim_[2] (ensure style_dim_ has at least 3 elements and is the same integer
type as d_shape_2) and update the en vector allocation to use total_frames *
d_shape_2 as before. Verify indices that use d_shape_2 (the nested loops and
accesses to d[i * d_shape_2 + j]) remain correct and that style_dim_ is
initialized by LoadVoices before this code runs.
- Around line 392-393: Replace the hardcoded 23041 with a dynamic computation:
after calling model3_->GetOutputTensorData("x"), compute num_frames =
x_out.size() / (2 * half_n_fft) where half_n_fft is 11 (used by
PostprocessXToAudio), ensure integer division yields a non-negative size_t and
optionally validate that x_out.size() % (2 * half_n_fft) == 0 (or handle
remainder) before passing num_frames to PostprocessXToAudio so the call becomes
PostprocessXToAudio(x_out, num_frames, audio).
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc`:
- Around line 451-455: The code assumes model1_->GetOutputTensorData("duration")
and ("d") contain at least expected lengths before indexing; add explicit size
checks after retrieving std::vector<float> duration and d to verify
duration.size() >= expected_duration_len (e.g., actual_len * 50 or max_seq_len_
* 640 as used) and d.size() >= expected_d_len before calling ProcessDuration or
iterating over them, and if the checks fail log an error (or return/throw) to
fail cleanly; update callers such as the call to ProcessDuration(duration,
actual_len, speed, pred_dur, total_frames) and any loops that index duration or
d to rely on the validated sizes.
- Around line 567-576: The code currently decrements each token at most once
(for loop over indices) which can leave total_frames too large when abs(diff) >
actual_len; change the logic so you keep decrementing tokens (in order given by
Argsort(pred_dur, actual_len, true)) until decreased reaches abs(diff) or no
more decrements are possible: replace the single-pass for(auto idx : indices)
with a loop like while(decreased < std::abs(diff)) { bool any = false; for(auto
idx : indices) { if(pred_dur[idx] > 1) { pred_dur[idx]--; decreased++; any =
true; if(decreased >= std::abs(diff)) break; } } if(!any) break; } so that
pred_dur entries can be decremented multiple times and total_frames will match
fixed_total_frames before calling SetInputTensorData("pred_aln_trg", ...).
---
Duplicate comments:
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc`:
- Around line 625-629: Before slicing into spec_part and phase_part, validate
the length of the input vector x: ensure x.size() >=
static_cast<size_t>(half_n_fft) * num_frames * 2 (or the exact expected length
for model3 output) and handle the short-input case (e.g., log/error and return
or throw) so that x.begin() + half_n_fft * num_frames is always a valid
iterator; update the code around the spec_part/phase_part assignments
(referencing variables x, spec_part, phase_part, half_n_fft, num_frames, and the
model3 output) to perform this check and early error handling before calling
assign.
- Around line 247-249: Before using length_scale to compute speed (the snippet
that sets speed = 1. / config_.kokoro.length_scale), validate
config_.kokoro.length_scale > 0 and ensure computed speed > 0; if the value is
non-positive, fail early (return an error/Status or throw) with a clear message
rather than proceeding to ProcessDuration(), so ProcessDuration() never receives
zero/negative speed. Update the block that references
config_.kokoro.length_scale and speed and add the guard+error path (or clamp
only after explicit decision), ensuring callers observe the failure instead of
continuing with invalid durations.
---
Nitpick comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 304-307: The local vector input_lengths returned from
ComputeExternalPreprocessing is unused — remove it from the API and all call
sites: update ComputeExternalPreprocessing to drop the output parameter (remove
the input_lengths parameter from its signature and implementation) and remove
the corresponding argument at each call site (e.g., the call in
InferenceSingleChunk and the other call around lines 576-585). Ensure compile by
adjusting any internal logic that previously wrote into input_lengths and update
any tests or callers that relied on that parameter.
- Around line 74-86: The template function Linspace (template<typename T> static
std::vector<T> Linspace(T a, T b, size_t N)) is dead code in this translation
unit; remove its entire definition to eliminate unused code and improve
maintainability, and ensure there are no remaining references or forward
declarations to Linspace in this file or its headers—after removal,
rebuild/tests to confirm no regressions.
- Around line 413-422: The code hardcodes dur_shape_2 = 50 when summing
duration_processed, which can break if the model's "duration" output shape
changes; instead query the second dimension from model1_'s output tensor shape
for "duration" (similar to how max_seq_len_ is obtained from
TensorShape("input_ids")) and use that value for the loop bounds and indexing
into duration_processed and pred_dur_original; update references to dur_shape_2,
duration_processed, and pred_dur_original accordingly so the loop uses the
runtime shape rather than a magic constant.
- Line 552: Extract the magic number 32 into a named constant by adding a
private constant inside the Impl class (e.g., a private static constexpr size_t
DOUBLE_INPUT_THRESHOLD = 32 or an appropriate integral type matching actual_len)
and replace the literal in the condition `if (actual_len <= 32)` with `if
(actual_len <= DOUBLE_INPUT_THRESHOLD)` (remove the trailing comment), ensuring
the constant name matches exactly `DOUBLE_INPUT_THRESHOLD` and the type is
compatible with `actual_len`.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0763a7a7-83ea-4360-86cf-8445357b05d4
📒 Files selected for processing (2)
sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.ccsherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc
| int32_t d_shape_1 = max_seq_len_; | ||
| int32_t d_shape_2 = | ||
| 640; // style_dim_[2] essentially, or based on d_shape_[2] | ||
| std::vector<float> en(total_frames * d_shape_2, 0.0f); | ||
|
|
||
| for (int32_t j = 0; j < d_shape_2; ++j) { | ||
| for (int32_t k = 0; k < total_frames; ++k) { | ||
| float sum = 0.0f; | ||
| for (int32_t i = 0; i < max_seq_len_; ++i) { | ||
| sum += d[i * d_shape_2 + j] * pred_aln_trg[i * total_frames + k]; | ||
| } | ||
| en[j * total_frames + k] = sum; | ||
| } | ||
| } |
There was a problem hiding this comment.
Replace hardcoded d_shape_2 = 640 with style_dim_[2].
The comment indicates this value should be style_dim_[2], which is already read from model metadata in LoadVoices. Using the hardcoded value creates inconsistency and may break compatibility with model variants that have different embedding dimensions.
🔧 Proposed fix
int32_t d_shape_1 = max_seq_len_;
- int32_t d_shape_2 =
- 640; // style_dim_[2] essentially, or based on d_shape_[2]
+ int32_t d_shape_2 = style_dim_[2];
std::vector<float> en(total_frames * d_shape_2, 0.0f);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int32_t d_shape_1 = max_seq_len_; | |
| int32_t d_shape_2 = | |
| 640; // style_dim_[2] essentially, or based on d_shape_[2] | |
| std::vector<float> en(total_frames * d_shape_2, 0.0f); | |
| for (int32_t j = 0; j < d_shape_2; ++j) { | |
| for (int32_t k = 0; k < total_frames; ++k) { | |
| float sum = 0.0f; | |
| for (int32_t i = 0; i < max_seq_len_; ++i) { | |
| sum += d[i * d_shape_2 + j] * pred_aln_trg[i * total_frames + k]; | |
| } | |
| en[j * total_frames + k] = sum; | |
| } | |
| } | |
| int32_t d_shape_1 = max_seq_len_; | |
| int32_t d_shape_2 = style_dim_[2]; | |
| std::vector<float> en(total_frames * d_shape_2, 0.0f); | |
| for (int32_t j = 0; j < d_shape_2; ++j) { | |
| for (int32_t k = 0; k < total_frames; ++k) { | |
| float sum = 0.0f; | |
| for (int32_t i = 0; i < max_seq_len_; ++i) { | |
| sum += d[i * d_shape_2 + j] * pred_aln_trg[i * total_frames + k]; | |
| } | |
| en[j * total_frames + k] = sum; | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 330 -
343, The code currently hardcodes d_shape_2 = 640 which can be inconsistent with
model metadata; replace that assignment with d_shape_2 = style_dim_[2] (ensure
style_dim_ has at least 3 elements and is the same integer type as d_shape_2)
and update the en vector allocation to use total_frames * d_shape_2 as before.
Verify indices that use d_shape_2 (the nested loops and accesses to d[i *
d_shape_2 + j]) remain correct and that style_dim_ is initialized by LoadVoices
before this code runs.
| std::vector<float> x_out = model3_->GetOutputTensorData("x"); | ||
| PostprocessXToAudio(x_out, 23041, audio); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd "offline-tts-kokoro-model-axcl" -t fRepository: k2-fsa/sherpa-onnx
Length of output: 170
🏁 Script executed:
cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc | head -450 | tail -100Repository: k2-fsa/sherpa-onnx
Length of output: 4657
🏁 Script executed:
rg "PostprocessXToAudio" -A 10 -B 2Repository: k2-fsa/sherpa-onnx
Length of output: 5050
🏁 Script executed:
rg "23041" -nRepository: k2-fsa/sherpa-onnx
Length of output: 273
🏁 Script executed:
rg "ProcessDuration|total_frames" -A 5 -B 2Repository: k2-fsa/sherpa-onnx
Length of output: 21852
🏁 Script executed:
rg "n_fft|hop_length" sherpa-onnx/csrc/axcl/ -A 2 -B 2Repository: k2-fsa/sherpa-onnx
Length of output: 3597
🏁 Script executed:
cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc | sed -n '517,570p'Repository: k2-fsa/sherpa-onnx
Length of output: 2468
🏁 Script executed:
rg "x_out\.size()" sherpa-onnx/csrc/axcl/ -A 3 -B 3Repository: k2-fsa/sherpa-onnx
Length of output: 44
🏁 Script executed:
cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc | sed -n '300,410p'Repository: k2-fsa/sherpa-onnx
Length of output: 5317
🏁 Script executed:
cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc | sed -n '429,465p'Repository: k2-fsa/sherpa-onnx
Length of output: 1603
🏁 Script executed:
rg "max_seq_len_\s*=" sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc -B 2 -A 2Repository: k2-fsa/sherpa-onnx
Length of output: 262
🏁 Script executed:
cat sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.hRepository: k2-fsa/sherpa-onnx
Length of output: 1054
🏁 Script executed:
python3 << 'EOF'
# Verify the calculation
half_n_fft = 11
expected_hardcoded = 23041
expected_size = 2 * half_n_fft * expected_hardcoded
print(f"If num_frames = {expected_hardcoded}:")
print(f"Expected x_out.size() = 2 * {half_n_fft} * {expected_hardcoded} = {expected_size}")
print(f"Suggested formula: num_frames = x_out.size() / {2 * half_n_fft}")
print(f"For x_out.size() = {expected_size}: num_frames = {expected_size} / {2 * half_n_fft} = {expected_size // (2 * half_n_fft)}")
EOFRepository: k2-fsa/sherpa-onnx
Length of output: 242
Replace hardcoded num_frames = 23041 with dynamic calculation from output tensor size.
The value 23041 is hardcoded but should be derived from the actual x_out tensor produced by model3. Since PostprocessXToAudio expects the input to contain 2 * half_n_fft * num_frames elements (spec and phase parts), and half_n_fft = 11, calculate num_frames from the actual output:
std::vector<float> x_out = model3_->GetOutputTensorData("x");
- PostprocessXToAudio(x_out, 23041, audio);
+ int32_t half_n_fft = 11;
+ int32_t num_frames = static_cast<int32_t>(x_out.size()) / (2 * half_n_fft);
+ PostprocessXToAudio(x_out, num_frames, audio);This ensures robustness when total_frames varies based on input duration prediction or if model configuration changes.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::vector<float> x_out = model3_->GetOutputTensorData("x"); | |
| PostprocessXToAudio(x_out, 23041, audio); | |
| std::vector<float> x_out = model3_->GetOutputTensorData("x"); | |
| int32_t half_n_fft = 11; | |
| int32_t num_frames = static_cast<int32_t>(x_out.size()) / (2 * half_n_fft); | |
| PostprocessXToAudio(x_out, num_frames, audio); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 392 -
393, Replace the hardcoded 23041 with a dynamic computation: after calling
model3_->GetOutputTensorData("x"), compute num_frames = x_out.size() / (2 *
half_n_fft) where half_n_fft is 11 (used by PostprocessXToAudio), ensure integer
division yields a non-negative size_t and optionally validate that x_out.size()
% (2 * half_n_fft) == 0 (or handle remainder) before passing num_frames to
PostprocessXToAudio so the call becomes PostprocessXToAudio(x_out, num_frames,
audio).
| std::vector<float> duration = model1_->GetOutputTensorData("duration"); | ||
| std::vector<float> d = model1_->GetOutputTensorData("d"); | ||
|
|
||
| std::vector<int32_t> pred_dur; | ||
| ProcessDuration(duration, actual_len, speed, pred_dur, total_frames); |
There was a problem hiding this comment.
Validate duration and d sizes before indexing them.
Line 467 and Line 550 assume part1 always returns at least max_seq_len_ * 640 and actual_len * 50 floats. If the asset is mismatched or truncated, these loops read past the returned buffers instead of failing cleanly.
🛡️ Suggested guard
+ constexpr int32_t kDurationBins = 50;
+ constexpr int32_t kEncoderDim = 640;
+
std::vector<float> duration = model1_->GetOutputTensorData("duration");
std::vector<float> d = model1_->GetOutputTensorData("d");
+
+ if (duration.size() <
+ static_cast<size_t>(actual_len) * kDurationBins ||
+ d.size() < static_cast<size_t>(max_seq_len_) * kEncoderDim) {
+ SHERPA_ONNX_LOGE(
+ "Unexpected model1 output size. duration=%zu, d=%zu, actual_len=%d, "
+ "max_seq_len=%d",
+ duration.size(), d.size(), actual_len, max_seq_len_);
+ return false;
+ }Also applies to: 463-470, 546-550
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 451 -
455, The code assumes model1_->GetOutputTensorData("duration") and ("d") contain
at least expected lengths before indexing; add explicit size checks after
retrieving std::vector<float> duration and d to verify duration.size() >=
expected_duration_len (e.g., actual_len * 50 or max_seq_len_ * 640 as used) and
d.size() >= expected_d_len before calling ProcessDuration or iterating over
them, and if the checks fail log an error (or return/throw) to fail cleanly;
update callers such as the call to ProcessDuration(duration, actual_len, speed,
pred_dur, total_frames) and any loops that index duration or d to rely on the
validated sizes.
| if (diff < 0) { | ||
| auto indices = Argsort(pred_dur, actual_len, true); | ||
| int32_t decreased = 0; | ||
| for (auto idx : indices) { | ||
| if (pred_dur[idx] > 1 && decreased < std::abs(diff)) { | ||
| pred_dur[idx]--; | ||
| decreased++; | ||
| } | ||
| if (decreased >= std::abs(diff)) break; | ||
| } |
There was a problem hiding this comment.
Reduce overshoot by the full abs(diff), not one frame per token.
This loop decrements each token at most once. With slower speech (speed < 1), abs(diff) can easily exceed actual_len, leaving total_frames above fixed_total_frames; then pred_aln_trg no longer matches the fixed AXERA input shape and SetInputTensorData("pred_aln_trg", ...) fails.
🔧 Suggested fix
if (diff < 0) {
auto indices = Argsort(pred_dur, actual_len, true);
- int32_t decreased = 0;
+ int32_t remaining = std::abs(diff);
for (auto idx : indices) {
- if (pred_dur[idx] > 1 && decreased < std::abs(diff)) {
- pred_dur[idx]--;
- decreased++;
- }
- if (decreased >= std::abs(diff)) break;
+ if (remaining == 0) break;
+ if (pred_dur[idx] <= 1) continue;
+
+ int32_t delta = std::min(pred_dur[idx] - 1, remaining);
+ pred_dur[idx] -= delta;
+ remaining -= delta;
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (diff < 0) { | |
| auto indices = Argsort(pred_dur, actual_len, true); | |
| int32_t decreased = 0; | |
| for (auto idx : indices) { | |
| if (pred_dur[idx] > 1 && decreased < std::abs(diff)) { | |
| pred_dur[idx]--; | |
| decreased++; | |
| } | |
| if (decreased >= std::abs(diff)) break; | |
| } | |
| if (diff < 0) { | |
| auto indices = Argsort(pred_dur, actual_len, true); | |
| int32_t remaining = std::abs(diff); | |
| for (auto idx : indices) { | |
| if (remaining == 0) break; | |
| if (pred_dur[idx] <= 1) continue; | |
| int32_t delta = std::min(pred_dur[idx] - 1, remaining); | |
| pred_dur[idx] -= delta; | |
| remaining -= delta; | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 567 -
576, The code currently decrements each token at most once (for loop over
indices) which can leave total_frames too large when abs(diff) > actual_len;
change the logic so you keep decrementing tokens (in order given by
Argsort(pred_dur, actual_len, true)) until decreased reaches abs(diff) or no
more decrements are possible: replace the single-pass for(auto idx : indices)
with a loop like while(decreased < std::abs(diff)) { bool any = false; for(auto
idx : indices) { if(pred_dur[idx] > 1) { pred_dur[idx]--; decreased++; any =
true; if(decreased >= std::abs(diff)) break; } } if(!any) break; } so that
pred_dur entries can be decremented multiple times and total_frames will match
fixed_total_frames before calling SetInputTensorData("pred_aln_trg", ...).
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc`:
- Around line 371-399: After reading meta_data_.num_speakers (via
SHERPA_ONNX_READ_META_DATA) in LoadVoices(), validate that it is > 0; if not,
log an error with SHERPA_ONNX_LOGE (include the bad value) and abort
initialization with SHERPA_ONNX_EXIT(-1). This prevents later out-of-bounds
access in LoadVoiceEmbedding() which assumes meta_data_.num_speakers >= 1;
update the existing validation block (near style_dim_ checks and
expected_num_floats calculation) to perform this non-positive check before
computing expected_num_floats or accessing styles_.
- Around line 400-402: The code constructs styles_ by reinterpreting voices_data
bytes as const float*, which can violate alignment and is undefined on some
architectures; instead, keep voices_data as the raw bytes from ReadFile(),
resize styles_ to actual_num_floats, and use memcpy (or std::copy with
reinterpret_cast<const char*>) to copy actual_num_floats * sizeof(float) bytes
from voices_data.data() into styles_. Update the block that currently uses
reinterpret_cast<const float *>(voices_data) and the styles_ initializer so it
resizes styles_ and performs a safe byte-wise copy from voices_data into
styles_.
In `@sherpa-onnx/csrc/offline-tts-kokoro-model-config.cc`:
- Around line 48-56: The current Validate() in
offline-tts-kokoro-model-config.cc unconditionally splits config.kokoro.model
with SplitStringToVector and checks each part, which allows multi-file strings
to pass but later breaks when offline-tts-kokoro-model.cc calls
ReadFile(config.kokoro.model). Change the validation to be provider-aware:
inside Validate(), check the provider type (the same provider enum/field used by
offline-tts-kokoro-model.cc) and if the provider is AXERA/AXCL allow splitting
and validate each file with FileExists; otherwise require a single file (no
commas), verify FileExists on the full config.kokoro.model string, and return
false with a clear error if multiple files are provided for non-multi-file
providers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d4bdc398-b355-4041-934f-29f20ada8698
📒 Files selected for processing (3)
sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.ccsherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.ccsherpa-onnx/csrc/offline-tts-kokoro-model-config.cc
🚧 Files skipped from review as they are similar to previous changes (1)
- sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
| SHERPA_ONNX_READ_META_DATA(meta_data_.num_speakers, "n_speakers"); | ||
| SHERPA_ONNX_READ_META_DATA(meta_data_.has_espeak, "has_espeak"); | ||
| SHERPA_ONNX_READ_META_DATA_STR_WITH_DEFAULT(meta_data_.voice, "voice", | ||
| "en-us"); | ||
| SHERPA_ONNX_READ_META_DATA_VEC(style_dim_, "style_dim"); | ||
|
|
||
| if (style_dim_.size() != 3) { | ||
| SHERPA_ONNX_LOGE("style_dim should be 3-d, given: %d", | ||
| static_cast<int32_t>(style_dim_.size())); | ||
| SHERPA_ONNX_EXIT(-1); | ||
| } | ||
|
|
||
| if (style_dim_[1] != 1) { | ||
| SHERPA_ONNX_LOGE("style_dim[1] should be 1, given: %d", style_dim_[1]); | ||
| SHERPA_ONNX_EXIT(-1); | ||
| } | ||
|
|
||
| meta_data_.max_token_len = style_dim_[0]; | ||
|
|
||
| int32_t actual_num_floats = voices_data_length / sizeof(float); | ||
| int32_t expected_num_floats = | ||
| style_dim_[0] * style_dim_[2] * meta_data_.num_speakers; | ||
|
|
||
| if (actual_num_floats != expected_num_floats) { | ||
| SHERPA_ONNX_LOGE( | ||
| "Corrupted voices file. Expected #floats: %d, actual: %d", | ||
| expected_num_floats, actual_num_floats); | ||
| SHERPA_ONNX_EXIT(-1); | ||
| } |
There was a problem hiding this comment.
Reject non-positive n_speakers during initialization.
LoadVoices() currently accepts n_speakers == 0 when the voices blob is empty, but LoadVoiceEmbedding() later copies style_dim_[2] floats from styles_.data() anyway. That turns malformed metadata into an out-of-bounds read on the first Run() instead of a clean init failure.
🛡️ Proposed fix
SHERPA_ONNX_READ_META_DATA(meta_data_.num_speakers, "n_speakers");
SHERPA_ONNX_READ_META_DATA(meta_data_.has_espeak, "has_espeak");
SHERPA_ONNX_READ_META_DATA_STR_WITH_DEFAULT(meta_data_.voice, "voice",
"en-us");
SHERPA_ONNX_READ_META_DATA_VEC(style_dim_, "style_dim");
+
+ if (meta_data_.num_speakers <= 0) {
+ SHERPA_ONNX_LOGE("n_speakers should be > 0, given: %d",
+ meta_data_.num_speakers);
+ SHERPA_ONNX_EXIT(-1);
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 371 -
399, After reading meta_data_.num_speakers (via SHERPA_ONNX_READ_META_DATA) in
LoadVoices(), validate that it is > 0; if not, log an error with
SHERPA_ONNX_LOGE (include the bad value) and abort initialization with
SHERPA_ONNX_EXIT(-1). This prevents later out-of-bounds access in
LoadVoiceEmbedding() which assumes meta_data_.num_speakers >= 1; update the
existing validation block (near style_dim_ checks and expected_num_floats
calculation) to perform this non-positive check before computing
expected_num_floats or accessing styles_.
| styles_ = std::vector<float>( | ||
| reinterpret_cast<const float *>(voices_data), | ||
| reinterpret_cast<const float *>(voices_data) + actual_num_floats); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
fd 'file-utils\.(h|cc)$' -t f | while read -r f; do
echo "=== $f ==="
rg -n -C3 '\bReadFile\b' "$f"
doneRepository: k2-fsa/sherpa-onnx
Length of output: 1304
🏁 Script executed:
cd sherpa-onnx/csrc/axera && head -n 410 offline-tts-kokoro-model-axera.cc | tail -n 20 && echo "---" && head -n 420 offline-tts-kokoro-model-axera.cc | tail -n 30Repository: k2-fsa/sherpa-onnx
Length of output: 1862
🏁 Script executed:
cd sherpa-onnx/csrc/axera && rg -n -B10 'styles_.*=.*std::vector<float>' offline-tts-kokoro-model-axera.cc | head -n 50Repository: k2-fsa/sherpa-onnx
Length of output: 537
🏁 Script executed:
cd sherpa-onnx/csrc/axera && rg -n -B30 'int32_t actual_num_floats = voices_data_length' offline-tts-kokoro-model-axera.cc | head -n 60Repository: k2-fsa/sherpa-onnx
Length of output: 1471
Copy the voices blob with memcpy instead of iterating a cast float*.
This constructs styles_ from a const float* obtained by reinterpreting raw file bytes from ReadFile().data(). The underlying storage is a std::vector<char>, which has no alignment guarantee for float. Reinterpreting this as const float* violates alignment requirements and is undefined behavior on ARM-class targets. Resize styles_ and copy bytes into it instead.
🔧 Proposed fix
- styles_ = std::vector<float>(
- reinterpret_cast<const float *>(voices_data),
- reinterpret_cast<const float *>(voices_data) + actual_num_floats);
+ styles_.resize(actual_num_floats);
+ std::memcpy(styles_.data(), voices_data,
+ static_cast<size_t>(actual_num_floats) * sizeof(float));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 400 -
402, The code constructs styles_ by reinterpreting voices_data bytes as const
float*, which can violate alignment and is undefined on some architectures;
instead, keep voices_data as the raw bytes from ReadFile(), resize styles_ to
actual_num_floats, and use memcpy (or std::copy with reinterpret_cast<const
char*>) to copy actual_num_floats * sizeof(float) bytes from voices_data.data()
into styles_. Update the block that currently uses reinterpret_cast<const float
*>(voices_data) and the styles_ initializer so it resizes styles_ and performs a
safe byte-wise copy from voices_data into styles_.
| if (!model.empty()) { | ||
| std::vector<std::string> files; | ||
| SplitStringToVector(model, ",", false, &files); | ||
| for (const auto &f : files) { | ||
| if (!FileExists(f)) { | ||
| SHERPA_ONNX_LOGE("--kokoro-model: '%s' does not exist", model.c_str()); | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
This validation is too permissive for the non-AXERA/AXCL Kokoro path.
Splitting --kokoro-model unconditionally lets a.onnx,b.onnx pass Validate(), but the standard backend in sherpa-onnx/csrc/offline-tts-kokoro-model.cc:32-52 still does ReadFile(config.kokoro.model) on the unsplit string. That turns a config error into a later runtime failure. Please keep multi-file validation provider-specific, or validate the expected file arity at a higher layer where the provider is known.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/offline-tts-kokoro-model-config.cc` around lines 48 - 56,
The current Validate() in offline-tts-kokoro-model-config.cc unconditionally
splits config.kokoro.model with SplitStringToVector and checks each part, which
allows multi-file strings to pass but later breaks when
offline-tts-kokoro-model.cc calls ReadFile(config.kokoro.model). Change the
validation to be provider-aware: inside Validate(), check the provider type (the
same provider enum/field used by offline-tts-kokoro-model.cc) and if the
provider is AXERA/AXCL allow splitting and validate each file with FileExists;
otherwise require a single file (no commas), verify FileExists on the full
config.kokoro.model string, and return false with a clear error if multiple
files are provided for non-multi-file providers.
Summary by CodeRabbit
New Features
Bug Fixes
Chores