Skip to content

kokoro TTS Supported axera backend#3281

Open
Abandon-ht wants to merge 5 commits intok2-fsa:masterfrom
Abandon-ht:kokoro_axera
Open

kokoro TTS Supported axera backend#3281
Abandon-ht wants to merge 5 commits intok2-fsa:masterfrom
Abandon-ht:kokoro_axera

Conversation

@Abandon-ht
Copy link
Contributor

@Abandon-ht Abandon-ht commented Mar 11, 2026

Summary by CodeRabbit

  • New Features

    • Kokoro offline Text-to-Speech (TTS) added with AXERA and AXCL acceleration, multi‑speaker support, speed control, and batched incremental generation with callback support.
    • --kokoro-model accepts comma-separated lists to load multiple model parts.
  • Bug Fixes

    • Improved input handling to accept raw byte buffers in addition to existing numeric tensor formats.
  • Chores

    • Clear guidance emitted when AXERA/AXCL support is not available in the build.

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 11, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Adds AXERA and AXCL Kokoro offline TTS backends: new model implementations and PIMPL APIs, AXCL model API overload, build/CMake entries, frontend/normalizer integration, and factory wiring to instantiate the new backends when configured.

Changes

Cohort / File(s) Summary
Build System
sherpa-onnx/csrc/CMakeLists.txt
Conditionally include AXERA/AXCL Kokoro source files when SHERPA_ONNX_ENABLE_TTS is enabled for each provider.
AXCL Model API
sherpa-onnx/csrc/axcl/axcl-model.h, sherpa-onnx/csrc/axcl/axcl-model.cc
Add SetInputTensorData(const uint8_t *...) overload to accept raw byte buffers alongside existing float/int overloads.
AXCL TTS Impl & Model
sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h, sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h, sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
Introduce OfflineTtsKokoroAxclImpl template and OfflineTtsKokoroModelAxcl implementing multi-stage inference, text normalization, frontend tokenization, batching, voice embeddings, and audio postprocessing.
AXERA TTS Impl & Model
sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h, sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.h, sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc
Introduce OfflineTtsKokoroAxeraImpl template, AxeraModel wrapper, and OfflineTtsKokoroModelAxera with multi-model orchestration, Run pipeline, and STFT-based postprocessing.
Factory Integration
sherpa-onnx/csrc/offline-tts-impl.cc
Add provider-specific branches and conditional includes to create AXERA/AXCL Kokoro implementations; gate behavior on feature macros and kokoro.model presence.
Kokoro Model Config Validation
sherpa-onnx/csrc/offline-tts-kokoro-model-config.cc
Support comma-separated --kokoro-model lists and validate existence for each listed path.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,220,255,0.5)
    participant App as Application
    participant Factory as OfflineTtsImpl::Create
    participant TtsImpl as OfflineTtsKokoroAxcl/AxeraImpl
    participant Normalizer as TextNormalizer Chain
    end
    rect rgba(200,255,200,0.5)
    participant Frontend as OfflineTtsFrontend
    participant Model as OfflineTtsKokoroModel
    participant Runtime as AXCL/AXERA/ORT Runtime
    end

    App->>Factory: Create(config)
    Factory->>TtsImpl: instantiate KokoroAxcl/AxeraImpl
    App->>TtsImpl: Generate(text, sid, speed, callback)
    TtsImpl->>Normalizer: apply normalization chain
    Normalizer-->>TtsImpl: normalized_text
    TtsImpl->>Frontend: convert text -> token_ids
    Frontend-->>TtsImpl: token_ids
    loop per batch
      TtsImpl->>Model: Process(tokens, sid, speed)
      Model->>Runtime: SetInputTensorData (uint8_t/float/int)
      Model->>Runtime: Run inference (multi-stage)
      Runtime-->>Model: output tensors
      Model->>TtsImpl: audio_chunk
      TtsImpl->>App: callback(audio_chunk)
    end
    TtsImpl-->>App: GeneratedAudio (concatenated)
Loading
sequenceDiagram
    participant Model as OfflineTtsKokoroModel
    participant Ax1 as Axera/AXCL Model1
    participant Ax2 as Axera/AXCL Model2
    participant Ax3 as Axera/AXCL Model3
    participant Ort as ONNX Runtime (Model4)
    participant Audio as ISTFT Synth

    Model->>Ax1: Run(text inputs)
    Ax1-->>Model: duration, alignment
    Model->>Ax2: Run(features, alignment targets)
    Ax2-->>Model: F0_pred, N_pred, asr
    Model->>Ort: Run(F0_pred / features)
    Ort-->>Model: har
    Model->>Ax3: Run(asr, F0_pred, har, ...)
    Ax3-->>Model: spectral frames x
    Model->>Audio: PostprocessXToAudio(x)
    Audio-->>Model: time_domain_audio
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Suggested reviewers

  • csukuangfj

Poem

🐇 I hopped through CMake and model trees,
I stitched tokens into audio breeze,
AXERA and AXCL hum in tune,
Kokoro sings beneath the moon,
Bytes turned into a bunny tune 🎶

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'kokoro TTS Supported axera backend' accurately describes the main change: adding axera backend support for Kokoro TTS implementation across multiple new files and configurations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the hardware compatibility of the sherpa-onnx project by integrating support for Kokoro Text-to-Speech models on Axera and AXCL platforms. This allows users to leverage specialized hardware acceleration for TTS inference, potentially leading to improved performance and efficiency for supported devices.

Highlights

  • Axera and AXCL TTS Support: Introduced support for Kokoro Text-to-Speech (TTS) models on both Axera and AXCL hardware platforms, enabling accelerated inference for these models.
  • Conditional Compilation: Updated the CMake build system to conditionally include the new Axera and AXCL specific TTS source files only when the respective SHERPA_ONNX_ENABLE_AXERA or SHERPA_ONNX_ENABLE_AXCL flags are enabled.
  • AXCL Model Enhancements: Extended the AxclModel class to handle uint8_t input tensor data, which is necessary for certain model inputs in the new Kokoro TTS implementation.
  • New TTS Implementations: Added dedicated C++ implementations (OfflineTtsKokoroAxeraImpl, OfflineTtsKokoroModelAxera, OfflineTtsKokoroAxclImpl, OfflineTtsKokoroModelAxcl) to encapsulate the platform-specific logic for running Kokoro TTS models on Axera and AXCL.
  • Factory Method Integration: Modified the OfflineTtsImpl::Create factory method to dynamically instantiate the appropriate Axera or AXCL Kokoro TTS implementation based on the config.model.provider setting.
Changelog
  • sherpa-onnx/csrc/CMakeLists.txt
    • Updated build configuration to include Kokoro TTS source files for Axera and AXCL.
  • sherpa-onnx/csrc/axcl/axcl-model.cc
    • Added an overload for SetInputTensorData to accept uint8_t pointers.
  • sherpa-onnx/csrc/axcl/axcl-model.h
    • Declared an overload for SetInputTensorData to accept uint8_t pointers.
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h
    • Added a new header defining the OfflineTtsKokoroAxclImpl class for AXCL Kokoro TTS.
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
    • Implemented the OfflineTtsKokoroModelAxcl class, providing AXCL-specific logic for Kokoro TTS model execution.
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h
    • Added a new header defining the OfflineTtsKokoroModelAxcl class.
  • sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h
    • Added a new header defining the OfflineTtsKokoroAxeraImpl class for Axera Kokoro TTS.
  • sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc
    • Implemented the OfflineTtsKokoroModelAxera class, providing Axera-specific logic for Kokoro TTS model execution.
  • sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.h
    • Added a new header defining the OfflineTtsKokoroModelAxera class.
  • sherpa-onnx/csrc/offline-tts-impl.cc
    • Modified the Create factory method to support instantiation of Axera and AXCL Kokoro TTS implementations.
Activity
  • Abandon-ht created this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for axera and axcl backends for Kokoro TTS models. However, several security issues related to input validation and memory safety were identified. Specifically, the handling of the Speaker ID (sid) parameter is inconsistent and lacks proper bounds checking, which could lead to buffer over-reads or under-reads (out-of-bounds memory access). This is particularly evident when dealing with single-speaker models where the sid is ignored but still used. Additionally, there is significant code duplication between the axera and axcl implementations, particularly in files like sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc and sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc, and their corresponding header files. Further improvements are needed regarding magic numbers, thread safety, and overall code structure.

Comment on lines +212 to +228
std::vector<float> LoadVoiceEmbedding(int32_t sid, int32_t phoneme_len) {
int32_t style_dim0 = 510; // max_token_len
int32_t style_dim1 = 256;
phoneme_len = std::max(phoneme_len, 0);

std::vector<float> ref_s(style_dim1);
if (phoneme_len < style_dim0) {
const float *p = styles_.data() + sid * style_dim0 * style_dim1 +
phoneme_len * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
} else {
int32_t idx = style_dim0 / 2;
const float *p =
styles_.data() + sid * style_dim0 * style_dim1 + idx * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
}
return ref_s;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The LoadVoiceEmbedding function uses the sid parameter to calculate an offset into the styles_ buffer without any bounds checking. If sid is negative or large enough to cause an overflow or exceed the buffer size, it results in a buffer under-read or over-read. This is particularly dangerous if num_speakers is 0, as styles_ might be empty or smaller than expected.

Recommendation: Implement strict bounds checking for sid against meta_data_.num_speakers before using it to calculate memory offsets.

  std::vector<float> LoadVoiceEmbedding(int32_t sid, int32_t phoneme_len) {
    int32_t style_dim0 = 510; // max_token_len
    int32_t style_dim1 = 256;
    phoneme_len = std::max(phoneme_len, 0);

    sid = std::max(sid, 0);
    if (meta_data_.num_speakers > 0) {
      sid = std::min(sid, meta_data_.num_speakers - 1);
    } else {
      sid = 0;
    }

    std::vector<float> ref_s(style_dim1);
    if (styles_.empty()) {
        return ref_s;
    }

    if (phoneme_len < style_dim0) {
      const float *p = styles_.data() + static_cast<size_t>(sid) * style_dim0 * style_dim1 +
                       phoneme_len * style_dim1;
      std::copy(p, p + style_dim1, ref_s.begin());
    } else {
      int32_t idx = style_dim0 / 2;
      const float *p =
          styles_.data() + static_cast<size_t>(sid) * style_dim0 * style_dim1 + idx * style_dim1;
      std::copy(p, p + style_dim1, ref_s.begin());
    }
    return ref_s;
  }

Comment on lines +345 to +367
std::vector<float> LoadVoiceEmbedding(int32_t sid, int32_t phoneme_len) {
int32_t style_dim0 = 510;
int32_t style_dim1 = 256;
phoneme_len = std::max(phoneme_len, 0);

sid = std::max(sid, 0);
if (meta_data_.num_speakers > 0) {
sid = std::min(sid, meta_data_.num_speakers - 1);
}

std::vector<float> ref_s(style_dim1);
if (phoneme_len < style_dim0) {
const float *p = styles_.data() + sid * style_dim0 * style_dim1 +
phoneme_len * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
} else {
int32_t idx = style_dim0 / 2;
const float *p =
styles_.data() + sid * style_dim0 * style_dim1 + idx * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
}
return ref_s;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

In LoadVoiceEmbedding, if meta_data_.num_speakers is 0, the sid parameter is not clamped from above. This can lead to a buffer over-read when accessing styles_ if a non-zero sid is provided.

Recommendation: Ensure sid is set to 0 if num_speakers is 0.

  std::vector<float> LoadVoiceEmbedding(int32_t sid, int32_t phoneme_len) {
    int32_t style_dim0 = 510;
    int32_t style_dim1 = 256;
    phoneme_len = std::max(phoneme_len, 0);

    sid = std::max(sid, 0);
    if (meta_data_.num_speakers > 0) {
      sid = std::min(sid, meta_data_.num_speakers - 1);
    } else {
      sid = 0;
    }

    std::vector<float> ref_s(style_dim1);
    if (styles_.empty()) {
        return ref_s;
    }

    if (phoneme_len < style_dim0) {
      const float *p = styles_.data() + static_cast<size_t>(sid) * style_dim0 * style_dim1 +
                       phoneme_len * style_dim1;
      std::copy(p, p + style_dim1, ref_s.begin());
    } else {
      int32_t idx = style_dim0 / 2;
      const float *p =
          styles_.data() + static_cast<size_t>(sid) * style_dim0 * style_dim1 + idx * style_dim1;
      std::copy(p, p + style_dim1, ref_s.begin());
    }
    return ref_s;
  }

return meta_data_;
}

std::vector<float> Run(std::vector<int64_t> input_ids, int64_t sid, float speed) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Run method is not thread-safe. The corresponding implementation for the axera backend (sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc) uses a std::mutex to protect its Run method. To ensure thread safety and consistency across backends, you should consider adding a mutex to this class and locking it within the Run method.

}

std::vector<float> x_out = model3_->GetOutputTensorData("x");
PostprocessXToAudio(x_out, 23041, audio);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The magic number 23041 is hardcoded here. Throughout this file, there are several other magic numbers (e.g., 32 for DOUBLE_INPUT_THRESHOLD, 50 for dur_shape_2, 640 for d_shape_2, 20 and 5 for STFT parameters). These should be defined as named constants with descriptive names to improve code readability and maintainability. For instance: static constexpr int32_t kNumFrames = 23041;.

}

std::vector<float> x_out = model3_->GetOutputTensorData("x");
PostprocessXToAudio(x_out, 23041, audio);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The magic number 23041 is hardcoded here. Throughout this file, there are several other magic numbers (e.g., 32 for DOUBLE_INPUT_THRESHOLD, 50 for dur_shape_2, 640 for d_shape_2, 20 and 5 for STFT parameters). These should be defined as named constants with descriptive names to improve code readability and maintainability. For instance: static constexpr int32_t kNumFrames = 23041;.

Comment on lines +163 to +175
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The Generate function identifies that the model is single-speaker when num_speakers == 0, but it fails to reset sid to 0. This allows an invalid sid to be passed to the underlying model implementation, which can lead to out-of-bounds memory access.

Recommendation: Reset sid to 0 when num_speakers == 0.

Suggested change
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
}
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
sid = 0;
}

Comment on lines +163 to +175
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Similar to the AXCL implementation, the Generate function logs that sid is ignored when num_speakers == 0 but fails to reset it to 0, leading to potential out-of-bounds access in the model implementation.

Recommendation: Reset sid to 0 when num_speakers == 0.

Suggested change
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
}
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
sid = 0;
}

Comment on lines +73 to +85
template <typename T>
static std::vector<T> Linspace(T a, T b, size_t N) {
if (N == 0) return {};
if (N == 1) return {a};
T h = (b - a) / static_cast<T>(N - 1);
std::vector<T> xs(N);
T val = a;
for (size_t i = 0; i < N; ++i) {
xs[i] = val;
val += h;
}
return xs;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function Linspace is defined but not used in this file. It should be removed to keep the codebase clean.

};

} // namespace sherpa_onnx
#endif // SHERPA_ONNX_CSRC_AXERA_OFFLINE_TTS_KOKORO_AXERA_IMPL_H_ No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file is missing a newline character at the end. It's a common convention to end files with a newline.

Comment on lines +78 to +201
class AxeraModel {
public:
explicit AxeraModel(const std::string &filename) {
auto buf = ReadFile(filename);
Init(buf.data(), buf.size());
}

AxeraModel(const void *cpu_buf, size_t buf_len_in_bytes) {
Init(cpu_buf, buf_len_in_bytes);
}

~AxeraModel() {
if (io_data_.pInputs || io_data_.pOutputs) {
FreeIO(&io_data_);
}

if (handle_) {
AX_ENGINE_DestroyHandle(handle_);
}
}

std::vector<int32_t> TensorShape(const std::string &name) const {
auto it = input_name_to_index_.find(name);
if (it != input_name_to_index_.end()) {
return input_tensor_shapes_[it->second];
}

auto it2 = output_name_to_index_.find(name);
if (it2 != output_name_to_index_.end()) {
return output_tensor_shapes_[it2->second];
}

SHERPA_ONNX_LOGE("Found no tensor with name: '%s'", name.c_str());
return {};
}

template <typename T>
bool SetInputTensorData(const std::string &name, const T *p, int32_t n) {
auto it = input_name_to_index_.find(name);
if (it == input_name_to_index_.end()) {
SHERPA_ONNX_LOGE("Found no input tensor with name: '%s'", name.c_str());
return false;
}

int32_t i = it->second;
size_t expected_size = io_info_->pInputs[i].nSize;
size_t given_size = n * sizeof(T);

if (expected_size != given_size) {
SHERPA_ONNX_LOGE(
"Input tensor '%s' size mismatch. Expected %zu bytes, got %zu bytes",
name.c_str(), expected_size, given_size);
return false;
}

std::memcpy(io_data_.pInputs[i].pVirAddr, p, expected_size);
return true;
}

std::vector<float> GetOutputTensorData(const std::string &name) {
auto it = output_name_to_index_.find(name);
if (it == output_name_to_index_.end()) {
SHERPA_ONNX_LOGE("Found no output tensor with name: '%s'", name.c_str());
return {};
}

int32_t i = it->second;
const auto &out_meta = io_info_->pOutputs[i];
auto &out_buf = io_data_.pOutputs[i];

AX_SYS_MinvalidateCache(out_buf.phyAddr, out_buf.pVirAddr, out_meta.nSize);

size_t out_elems = out_meta.nSize / sizeof(float);
std::vector<float> out(out_elems);
std::memcpy(out.data(), out_buf.pVirAddr, out_meta.nSize);

return out;
}

bool Run() {
auto ret = AX_ENGINE_RunSync(handle_, &io_data_);
if (ret != 0) {
SHERPA_ONNX_LOGE("AX_ENGINE_RunSync failed, ret = %d", ret);
return false;
}

return true;
}

private:
void Init(const void *model_data, size_t model_data_length) {
InitContext(model_data, model_data_length, false, &handle_);
InitInputOutputAttrs(handle_, false, &io_info_);
PrepareIO(io_info_, &io_data_, false);

input_tensor_shapes_.reserve(io_info_->nInputSize);
for (uint32_t i = 0; i < io_info_->nInputSize; ++i) {
const auto &in = io_info_->pInputs[i];
std::string name = in.pName;
input_name_to_index_[name] = i;
input_tensor_shapes_.emplace_back(in.pShape, in.pShape + in.nShapeSize);
}

output_tensor_shapes_.reserve(io_info_->nOutputSize);
for (uint32_t i = 0; i < io_info_->nOutputSize; ++i) {
const auto &out = io_info_->pOutputs[i];
std::string name = out.pName;
output_name_to_index_[name] = i;
output_tensor_shapes_.emplace_back(out.pShape,
out.pShape + out.nShapeSize);
}
}

private:
AX_ENGINE_HANDLE handle_ = nullptr;
AX_ENGINE_IO_INFO_T *io_info_ = nullptr;
AX_ENGINE_IO_T io_data_{};

std::unordered_map<std::string, int32_t> input_name_to_index_;
std::unordered_map<std::string, int32_t> output_name_to_index_;

std::vector<std::vector<int32_t>> input_tensor_shapes_;
std::vector<std::vector<int32_t>> output_tensor_shapes_;
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The AxeraModel class is defined locally within this .cc file. It's a substantial piece of code (over 120 lines) and is very similar in structure and purpose to the AxclModel class. To improve code organization and reduce duplication, consider refactoring this into a separate axera-model.h and axera-model.cc, similar to how AxclModel is handled. This would also make it easier to reuse if needed elsewhere.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (6)
sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h (1)

38-90: Significant code duplication in constructors for FST/FAR loading.

Both constructors contain nearly identical logic for loading rule_fsts and rule_fars, differing only in file reading (direct path vs. Manager-based). This pattern appears to be duplicated from existing impl files.

For maintainability, consider extracting the FST/FAR processing logic into helper methods or a separate utility class that can work with both file paths and Manager-based reading.

Also applies to: 92-147

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h` around lines 38 - 90,
The constructor OfflineTtsKokoroAxeraImpl duplicates the logic that parses
rule_fsts and rule_fars and populates tn_list_ (using SplitStringToVector,
creating kaldifst::TextNormalizer from file paths or from fst::FarReader), which
also appears in the other constructor region; extract that shared behavior into
one or two helper methods (e.g., LoadRuleFstsFromPaths(const std::string& csv,
std::vector<std::unique_ptr<kaldifst::TextNormalizer>>* tn_list) and
LoadRuleFarsFromPaths(const std::string& csv,
std::vector<std::unique_ptr<kaldifst::TextNormalizer>>* tn_list) or a single
LoadRules helper that handles both FST and FAR inputs), call those helpers from
the constructor(s) after InitFrontend(), keep existing debug logging and use
tn_list_.reserve as appropriate, and reuse the same helpers in the Manager-based
constructors to eliminate duplication.
sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc (2)

296-298: Consider extracting hardcoded dimension constants.

d_shape_2 = 640 is hardcoded with a comment suggesting it relates to style_dim_[2]. This magic number, along with others like dur_shape_2 = 50 (line 375) and DOUBLE_INPUT_THRESHOLD = 32 (line 510), would benefit from being named constants for clarity and maintainability.

Example named constants
// At class level or namespace scope
static constexpr int32_t kStyleDim2 = 640;
static constexpr int32_t kDurShape2 = 50;
static constexpr int32_t kDoubleInputThreshold = 32;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 296 -
298, Replace magic numbers with named constexpr constants and use them where
d_shape_2, dur_shape_2 and DOUBLE_INPUT_THRESHOLD are set/used: define e.g.
kStyleDim2, kDurShape2, kDoubleInputThreshold at class or anonymous-namespace
scope and replace the literal 640 (used to initialize d_shape_2 / en), 50 (used
for dur_shape_2) and 32 (DOUBLE_INPUT_THRESHOLD) with those constants; ensure
any comment referring to style_dim_[2] now references kStyleDim2 and update
variable initializations where functions or members (e.g., d_shape_2,
dur_shape_2, DOUBLE_INPUT_THRESHOLD) are declared to use the new constants for
clarity and maintainability.

231-240: Input truncation warning should consider providing alternative handling.

When input_ids exceeds max_seq_len_, the code logs a warning and truncates. This is reasonable, but users might want to know this limitation upfront or have the option to split long inputs into multiple calls.

The truncation behavior is safe but worth documenting in the public API.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 231 -
240, The current RunModels implementation silently truncates input_ids when
actual_len > max_seq_len_; change this to either (A) make truncation
explicit/configurable by adding a parameter (e.g., bool allow_truncate) to
RunModels and only resize when allow_truncate is true, or (B) refuse to run and
return false (or an error code) if input_ids.size() > max_seq_len_, and in both
cases update the SHERPA_ONNX_LOGE call to include the offending length and a
suggested action (e.g., "increase max_seq_len_, set allow_truncate=true, or
split input into multiple calls"); ensure references to RunModels, input_ids,
max_seq_len_, and the log message are updated accordingly and document the
behavior in the public API.
sherpa-onnx/csrc/offline-tts-impl.cc (2)

51-71: Consider case-insensitive provider comparison for user convenience.

The provider string comparison is case-sensitive (e.g., "axera", "axcl"). Users specifying "AXERA" or "Axera" would fall through to the wrong code path. Consider normalizing the provider string to lowercase before comparison.

Example approach
std::string provider = config.model.provider;
std::transform(provider.begin(), provider.end(), provider.begin(), ::tolower);
if (provider == "axera") {
  // ...
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/offline-tts-impl.cc` around lines 51 - 71, Normalize
config.model.provider to lowercase before comparing (e.g., create a local
lowercased string from config.model.provider using std::transform) and then use
that lowercase variable for the if checks (e.g., the existing if (provider ==
"axera") branch). Update any other provider comparisons in this file (such as
"axcl") to use the same lowercased variable so comparisons for symbols like
config.model.provider, OfflineTtsKokoroAxeraImpl, and related provider branches
become case-insensitive.

51-94: Duplicated provider-checking logic between Create overloads.

The provider-specific branching for axera and axcl is duplicated nearly identically in both Create(const OfflineTtsConfig&) and Create(Manager*, const OfflineTtsConfig&). This increases maintenance burden when adding new providers or modifying behavior.

Consider extracting provider validation into a helper function or macro to reduce duplication.

Also applies to: 121-164

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/offline-tts-impl.cc` around lines 51 - 94, The
provider-specific branching for "axera" and "axcl" is duplicated between the two
Create overloads (Create(const OfflineTtsConfig&) and Create(Manager*, const
OfflineTtsConfig&)); extract the common logic into a helper function (e.g.
BuildProviderImpl or ValidateAndCreateProvider) that takes the OfflineTtsConfig
(and optionally Manager*) and returns the appropriate
std::unique_ptr<OfflineTts> or nullptr after handling the
SHERPA_ONNX_ENABLE_AXERA / SHERPA_ONNX_ENABLE_AXCL compile-time guards and
logging (SHERPA_ONNX_LOGE / SHERPA_ONNX_EXIT). Move the checks for
config.model.provider == "axera" and "axcl", the kokoro model existence check,
and the platform-specific error messages into that helper, and have both Create
overloads call it to reduce duplication and keep behavior identical.
sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h (1)

8-9: Missing <cstdint> include for int64_t.

The header uses int64_t in the Run method signature (line 25) but doesn't include <cstdint>. While it may be transitively included, explicit inclusion is recommended for types used in public interfaces.

Proposed fix
+#include <cstdint>
 `#include` <memory>
 `#include` <vector>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h` around lines 8 - 9,
Add a direct include of <cstdint> so the public API that uses int64_t (the Run
method signature) has the required typedef available; update the header includes
at the top (next to the existing <memory> and <vector>) to add <cstdint> to
ensure int64_t is defined for the Run method and any other public declarations
referencing fixed-width integer types.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h`:
- Around line 163-175: The code logs when a non-zero sid is provided for
single-speaker models but doesn't change the value; update the branch in
OfflineTtsKokoroAxclImpl (where num_speakers and sid are checked) to set sid = 0
when num_speakers == 0 && sid != 0 so the sanitized speaker ID is passed into
Process() (and anywhere else the method uses sid) instead of the original value;
keep the existing log statements.
- Around line 75-83: The code calls fst::FarReader<fst::StdArc>::Open(...) and
immediately dereferences reader (reader->Done()) which can be null; update the
Open call sites (the local reader variable and the in-memory Open(std::move(s))
path) to check whether reader is non-null before entering the for loop, and
handle the failure (log an error/return or skip processing) instead of
dereferencing; specifically add a guard after the Open(...) that validates
reader and only runs the loop that creates kaldifst::TextNormalizer and pushes
into tn_list_ when reader is valid.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 194-210: LoadVoices currently hardcodes model metadata
(sample_rate, style dims, max_token_len) which breaks compatibility; instead
read these values from the model metadata on model1_ (e.g., fetch keys like
"sample_rate", "style_dim0"/"style_dim1" or "style_dims", "max_token_len", and
optional "voice"/"has_espeak"), parse them into meta_data_ (sample_rate,
max_token_len, voice, has_espeak), compute num_speakers using the retrieved
style dimensions when calculating meta_data_.num_speakers and when slicing
styles_, and keep max_seq_len_ from model1_->TensorShape("input_ids") as before
so the function adapts to different Kokoro model variants.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h`:
- Around line 389-390: The Process method's sid parameter is declared as int32_t
causing narrowing from Generate (which passes an int64_t) and mismatch with
model_->Run (which expects int64_t); change the Process signature to accept
int64_t sid (update GeneratedAudio Process(...) to use int64_t), update any
internal uses and calls (e.g., Generate → Process) to pass the 64-bit sid
unchanged, and ensure the argument passed into model_->Run remains int64_t to
avoid implicit narrowing/widening conversions.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc`:
- Around line 596-606: In PostprocessXToAudio validate that x.size() is at least
expected_size = static_cast<size_t>(half_n_fft) * num_frames * 2 before using
x.begin() + half_n_fft * num_frames; if it is smaller, bail out gracefully
(e.g., log/error/throw and return) or pad/resize x to expected_size so creating
spec_part and phase_part with iterators is safe; update the code around the
spec_part/phase_part assignment to check x.size(), compute expected_size using
half_n_fft and num_frames, and handle the error path instead of unconditionally
using the iterators.
- Around line 321-333: Validate that voices_data_length is a multiple of
sizeof(float) * kStyleDim0 * kStyleDim1 before computing meta_data_.num_speakers
and building styles_; replace the hardcoded 510 and 256 with kStyleDim0 and
kStyleDim1 so both LoadVoices and LoadVoiceEmbedding share the same constants,
compute actual_num_floats and num_speakers using those constants, and if the
blob size is not a whole number of speaker blocks fail early (log/throw/return)
rather than creating a truncated styles_ vector to ensure data integrity.
- Around line 250-252: Validate kokoro.length_scale and computed speed to ensure
they are positive before using them: check config_.kokoro.length_scale > 0
before performing the adjustment in the block that sets speed (the spot using
config_.kokoro.length_scale) and after computing speed ensure speed > 0 (or
clamp to a sane positive minimum / return an error status) so ProcessDuration()
never receives a non-positive divisor; update the caller/flow that relies on
speed to handle error/invalid-config cases and add clear error messages
referencing config_.kokoro.length_scale and speed.

---

Nitpick comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 296-298: Replace magic numbers with named constexpr constants and
use them where d_shape_2, dur_shape_2 and DOUBLE_INPUT_THRESHOLD are set/used:
define e.g. kStyleDim2, kDurShape2, kDoubleInputThreshold at class or
anonymous-namespace scope and replace the literal 640 (used to initialize
d_shape_2 / en), 50 (used for dur_shape_2) and 32 (DOUBLE_INPUT_THRESHOLD) with
those constants; ensure any comment referring to style_dim_[2] now references
kStyleDim2 and update variable initializations where functions or members (e.g.,
d_shape_2, dur_shape_2, DOUBLE_INPUT_THRESHOLD) are declared to use the new
constants for clarity and maintainability.
- Around line 231-240: The current RunModels implementation silently truncates
input_ids when actual_len > max_seq_len_; change this to either (A) make
truncation explicit/configurable by adding a parameter (e.g., bool
allow_truncate) to RunModels and only resize when allow_truncate is true, or (B)
refuse to run and return false (or an error code) if input_ids.size() >
max_seq_len_, and in both cases update the SHERPA_ONNX_LOGE call to include the
offending length and a suggested action (e.g., "increase max_seq_len_, set
allow_truncate=true, or split input into multiple calls"); ensure references to
RunModels, input_ids, max_seq_len_, and the log message are updated accordingly
and document the behavior in the public API.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h`:
- Around line 8-9: Add a direct include of <cstdint> so the public API that uses
int64_t (the Run method signature) has the required typedef available; update
the header includes at the top (next to the existing <memory> and <vector>) to
add <cstdint> to ensure int64_t is defined for the Run method and any other
public declarations referencing fixed-width integer types.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h`:
- Around line 38-90: The constructor OfflineTtsKokoroAxeraImpl duplicates the
logic that parses rule_fsts and rule_fars and populates tn_list_ (using
SplitStringToVector, creating kaldifst::TextNormalizer from file paths or from
fst::FarReader), which also appears in the other constructor region; extract
that shared behavior into one or two helper methods (e.g.,
LoadRuleFstsFromPaths(const std::string& csv,
std::vector<std::unique_ptr<kaldifst::TextNormalizer>>* tn_list) and
LoadRuleFarsFromPaths(const std::string& csv,
std::vector<std::unique_ptr<kaldifst::TextNormalizer>>* tn_list) or a single
LoadRules helper that handles both FST and FAR inputs), call those helpers from
the constructor(s) after InitFrontend(), keep existing debug logging and use
tn_list_.reserve as appropriate, and reuse the same helpers in the Manager-based
constructors to eliminate duplication.

In `@sherpa-onnx/csrc/offline-tts-impl.cc`:
- Around line 51-71: Normalize config.model.provider to lowercase before
comparing (e.g., create a local lowercased string from config.model.provider
using std::transform) and then use that lowercase variable for the if checks
(e.g., the existing if (provider == "axera") branch). Update any other provider
comparisons in this file (such as "axcl") to use the same lowercased variable so
comparisons for symbols like config.model.provider, OfflineTtsKokoroAxeraImpl,
and related provider branches become case-insensitive.
- Around line 51-94: The provider-specific branching for "axera" and "axcl" is
duplicated between the two Create overloads (Create(const OfflineTtsConfig&) and
Create(Manager*, const OfflineTtsConfig&)); extract the common logic into a
helper function (e.g. BuildProviderImpl or ValidateAndCreateProvider) that takes
the OfflineTtsConfig (and optionally Manager*) and returns the appropriate
std::unique_ptr<OfflineTts> or nullptr after handling the
SHERPA_ONNX_ENABLE_AXERA / SHERPA_ONNX_ENABLE_AXCL compile-time guards and
logging (SHERPA_ONNX_LOGE / SHERPA_ONNX_EXIT). Move the checks for
config.model.provider == "axera" and "axcl", the kokoro model existence check,
and the platform-specific error messages into that helper, and have both Create
overloads call it to reduce duplication and keep behavior identical.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4c749349-7b29-40e8-935c-83313b17a0ed

📥 Commits

Reviewing files that changed from the base of the PR and between 71cf9c9 and 1b940fa.

📒 Files selected for processing (10)
  • sherpa-onnx/csrc/CMakeLists.txt
  • sherpa-onnx/csrc/axcl/axcl-model.cc
  • sherpa-onnx/csrc/axcl/axcl-model.h
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h
  • sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h
  • sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc
  • sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.h
  • sherpa-onnx/csrc/offline-tts-impl.cc

Comment on lines +75 to +83
std::unique_ptr<fst::FarReader<fst::StdArc>> reader(
fst::FarReader<fst::StdArc>::Open(f));
for (; !reader->Done(); reader->Next()) {
std::unique_ptr<fst::StdConstFst> r(
fst::CastOrConvertToConstFst(reader->GetFst()->Copy()));

tn_list_.push_back(
std::make_unique<kaldifst::TextNormalizer>(std::move(r)));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "offline-tts-kokoro-axcl-impl.h" -type f

Repository: k2-fsa/sherpa-onnx

Length of output: 116


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h | sed -n '60,90p'

Repository: k2-fsa/sherpa-onnx

Length of output: 1224


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h | sed -n '120,155p'

Repository: k2-fsa/sherpa-onnx

Length of output: 1412


🏁 Script executed:

# Check if FarReader is from FST library and understand its behavior
rg "FarReader" sherpa-onnx/csrc/axcl/ -A 3 -B 1

Repository: k2-fsa/sherpa-onnx

Length of output: 1212


🏁 Script executed:

# Search for other FarReader usage patterns in the codebase
rg "FarReader.*Open" --type cpp --type h -A 2 | head -60

Repository: k2-fsa/sherpa-onnx

Length of output: 4026


🏁 Script executed:

# Check if any file has null checks for FarReader::Open
rg "if.*!reader" --type cpp --type h -B 2 -A 1

Repository: k2-fsa/sherpa-onnx

Length of output: 44


🏁 Script executed:

# Search for any checks after FarReader::Open
rg "FarReader.*Open" --type cpp --type h -A 5 | grep -E "(if|nullptr|NULL)" | head -20

Repository: k2-fsa/sherpa-onnx

Length of output: 44


🏁 Script executed:

# Check the error handling macros to understand the pattern
rg "SHERPA_ONNX_LOGE|SHERPA_ONNX_EXIT" sherpa-onnx/csrc/ --type h --type cpp | head -20

Repository: k2-fsa/sherpa-onnx

Length of output: 1824


🏁 Script executed:

# Search for how FarReader is being used and if there are any comments about it
rg "FarReader" --type h --type cpp -B 5 -A 10 | grep -E "(nullptr|NULL|error|fail)" | head -30

Repository: k2-fsa/sherpa-onnx

Length of output: 44


🏁 Script executed:

# Check if there's any documentation or comments about FarReader in the codebase
rg "FarReader" sherpa-onnx/csrc/ -B 10 | grep -E "(//|/\*)" | head -20

Repository: k2-fsa/sherpa-onnx

Length of output: 44


🏁 Script executed:

# Look at the complete context around one of the FarReader usages to confirm no null check
cat -n sherpa-onnx/csrc/offline-recognizer-impl.cc | sed -n '1,50p' | tail -30

Repository: k2-fsa/sherpa-onnx

Length of output: 1699


🏁 Script executed:

# Search for FarReader usage in offline-recognizer-impl.cc with full context
rg "FarReader.*Open" sherpa-onnx/csrc/offline-recognizer-impl.cc -B 5 -A 8

Repository: k2-fsa/sherpa-onnx

Length of output: 1131


🏁 Script executed:

# Search for any FAR-related error handling or documentation in the codebase
rg "FarReader" -B 20 | grep -E "(nullptr|ERROR|check|valid|status)" | head -20

Repository: k2-fsa/sherpa-onnx

Length of output: 44


🏁 Script executed:

# Check if there's an fst directory in the repo or if it's an external dependency
find . -name "far.h" -o -name "FarReader*" 2>/dev/null | head -10

Repository: k2-fsa/sherpa-onnx

Length of output: 44


🏁 Script executed:

# Look for imports of far.h to understand it's an external dependency
rg "far.h" sherpa-onnx/csrc/ -B 2 -A 2

Repository: k2-fsa/sherpa-onnx

Length of output: 3340


Add null check after FarReader::Open() to prevent null pointer dereference.

Open() can fail if the FAR archive is missing or corrupt, returning a null pointer. The code immediately calls reader->Done() without validation, causing a crash. Add a guard before the loop:

Proposed fix
         std::unique_ptr<fst::FarReader<fst::StdArc>> reader(
             fst::FarReader<fst::StdArc>::Open(f));
+        if (!reader) {
+          SHERPA_ONNX_LOGE("Failed to open FAR archive: %s", f.c_str());
+          SHERPA_ONNX_EXIT(-1);
+        }
         for (; !reader->Done(); reader->Next()) {

Apply the same guard to the in-memory Open(std::move(s)) path at lines 135–144 as well.

Note: This pattern appears in at least 9 other files using FarReader::Open() without validation. Consider a systematic fix across the codebase.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::unique_ptr<fst::FarReader<fst::StdArc>> reader(
fst::FarReader<fst::StdArc>::Open(f));
for (; !reader->Done(); reader->Next()) {
std::unique_ptr<fst::StdConstFst> r(
fst::CastOrConvertToConstFst(reader->GetFst()->Copy()));
tn_list_.push_back(
std::make_unique<kaldifst::TextNormalizer>(std::move(r)));
}
std::unique_ptr<fst::FarReader<fst::StdArc>> reader(
fst::FarReader<fst::StdArc>::Open(f));
if (!reader) {
SHERPA_ONNX_LOGE("Failed to open FAR archive: %s", f.c_str());
SHERPA_ONNX_EXIT(-1);
}
for (; !reader->Done(); reader->Next()) {
std::unique_ptr<fst::StdConstFst> r(
fst::CastOrConvertToConstFst(reader->GetFst()->Copy()));
tn_list_.push_back(
std::make_unique<kaldifst::TextNormalizer>(std::move(r)));
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h` around lines 75 - 83,
The code calls fst::FarReader<fst::StdArc>::Open(...) and immediately
dereferences reader (reader->Done()) which can be null; update the Open call
sites (the local reader variable and the in-memory Open(std::move(s)) path) to
check whether reader is non-null before entering the for loop, and handle the
failure (log an error/return or skip processing) instead of dereferencing;
specifically add a guard after the Open(...) that validates reader and only runs
the loop that creates kaldifst::TextNormalizer and pushes into tn_list_ when
reader is valid.

Comment on lines +163 to +175
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Actually reset sid for single-speaker models.

This branch only logs that non-zero sid is ignored, but it still passes the original value into Process(). That means a single-speaker model can still receive a non-zero speaker ID here.

🐛 Proposed fix
     if (num_speakers == 0 && sid != 0) {
 `#if` __OHOS__
       SHERPA_ONNX_LOGE(
           "This is a single-speaker model and supports only sid 0. Given sid: "
           "%{public}d. sid is ignored",
           static_cast<int32_t>(sid));
 `#else`
       SHERPA_ONNX_LOGE(
           "This is a single-speaker model and supports only sid 0. Given sid: "
           "%d. sid is ignored",
           static_cast<int32_t>(sid));
 `#endif`
+      sid = 0;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
}
if (num_speakers == 0 && sid != 0) {
`#if` __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
`#else`
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
`#endif`
sid = 0;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h` around lines 163 - 175,
The code logs when a non-zero sid is provided for single-speaker models but
doesn't change the value; update the branch in OfflineTtsKokoroAxclImpl (where
num_speakers and sid are checked) to set sid = 0 when num_speakers == 0 && sid
!= 0 so the sanitized speaker ID is passed into Process() (and anywhere else the
method uses sid) instead of the original value; keep the existing log
statements.

Comment on lines +389 to +390
GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens,
int32_t sid, float speed) const {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Type mismatch: sid narrowed from int64_t to int32_t.

The Process method declares int32_t sid (line 390), but it's called from Generate which receives int64_t sid (line 158). The model's Run method (per context snippet at offline-tts-kokoro-model-axera.h:26-27) also expects int64_t sid.

This causes implicit narrowing conversions:

  1. GenerateProcess: int64_tint32_t (potential data loss for large speaker IDs)
  2. Processmodel_->Run: int32_tint64_t (safe widening)
Proposed fix
-  GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens,
-                         int32_t sid, float speed) const {
+  GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens,
+                         int64_t sid, float speed) const {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens,
int32_t sid, float speed) const {
GeneratedAudio Process(const std::vector<std::vector<int64_t>> &tokens,
int64_t sid, float speed) const {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-axera-impl.h` around lines 389 -
390, The Process method's sid parameter is declared as int32_t causing narrowing
from Generate (which passes an int64_t) and mismatch with model_->Run (which
expects int64_t); change the Process signature to accept int64_t sid (update
GeneratedAudio Process(...) to use int64_t), update any internal uses and calls
(e.g., Generate → Process) to pass the 64-bit sid unchanged, and ensure the
argument passed into model_->Run remains int64_t to avoid implicit
narrowing/widening conversions.

Comment on lines +250 to +252
if (config_.kokoro.length_scale != 1 && speed == 1) {
speed = 1. / config_.kokoro.length_scale;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "*offline-tts-kokoro-model-axera.cc" -type f

Repository: k2-fsa/sherpa-onnx

Length of output: 120


🏁 Script executed:

wc -l ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc

Repository: k2-fsa/sherpa-onnx

Length of output: 124


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '240,260p'

Repository: k2-fsa/sherpa-onnx

Length of output: 782


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '510,540p'

Repository: k2-fsa/sherpa-onnx

Length of output: 1372


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '1,50p'

Repository: k2-fsa/sherpa-onnx

Length of output: 1690


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '240,350p'

Repository: k2-fsa/sherpa-onnx

Length of output: 4916


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '380,450p'

Repository: k2-fsa/sherpa-onnx

Length of output: 3066


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '517,545p'

Repository: k2-fsa/sherpa-onnx

Length of output: 1419


🏁 Script executed:

rg -n "length_scale\|speed" ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | head -30

Repository: k2-fsa/sherpa-onnx

Length of output: 44


🏁 Script executed:

grep -n "speed\|length_scale" ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc

Repository: k2-fsa/sherpa-onnx

Length of output: 789


🏁 Script executed:

cat -n ./sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc | sed -n '517,532p'

Repository: k2-fsa/sherpa-onnx

Length of output: 822


Add validation for non-positive speed and kokoro.length_scale values.

At line 529, ProcessDuration() divides by speed. If speed ≤ 0, this produces infinity, which is later rounded and cast to int32_t (line 531), resulting in undefined behavior and potentially massive allocations. Additionally, length_scale can indirectly cause invalid speed values through the calculation at line 251.

🛡️ Proposed fix
+    if (speed <= 0) {
+      SHERPA_ONNX_LOGE("speed must be > 0. Given: %f", speed);
+      return {};
+    }
+
+    if (config_.kokoro.length_scale <= 0) {
+      SHERPA_ONNX_LOGE("kokoro.length_scale must be > 0. Given: %f",
+                       config_.kokoro.length_scale);
+      return {};
+    }
+
     if (config_.kokoro.length_scale != 1 && speed == 1) {
       speed = 1. / config_.kokoro.length_scale;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (config_.kokoro.length_scale != 1 && speed == 1) {
speed = 1. / config_.kokoro.length_scale;
}
if (speed <= 0) {
SHERPA_ONNX_LOGE("speed must be > 0. Given: %f", speed);
return {};
}
if (config_.kokoro.length_scale <= 0) {
SHERPA_ONNX_LOGE("kokoro.length_scale must be > 0. Given: %f",
config_.kokoro.length_scale);
return {};
}
if (config_.kokoro.length_scale != 1 && speed == 1) {
speed = 1. / config_.kokoro.length_scale;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 250 -
252, Validate kokoro.length_scale and computed speed to ensure they are positive
before using them: check config_.kokoro.length_scale > 0 before performing the
adjustment in the block that sets speed (the spot using
config_.kokoro.length_scale) and after computing speed ensure speed > 0 (or
clamp to a sane positive minimum / return an error status) so ProcessDuration()
never receives a non-positive divisor; update the caller/flow that relies on
speed to handle error/invalid-config cases and add clear error messages
referencing config_.kokoro.length_scale and speed.

Comment on lines +596 to +606
void PostprocessXToAudio(const std::vector<float> &x, int32_t num_frames,
std::vector<float> &audio) {
int32_t n_fft = 20;
int32_t hop_length = 5;
int32_t half_n_fft = n_fft / 2 + 1;

std::vector<float> spec_part(half_n_fft * num_frames);
std::vector<float> phase_part(half_n_fft * num_frames);
std::vector<float> cos_part(half_n_fft * num_frames);
spec_part.assign(x.begin(), x.begin() + half_n_fft * num_frames);
phase_part.assign(x.begin() + half_n_fft * num_frames, x.end());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cd sherpa-onnx && find . -name "offline-tts-kokoro-model-axera.cc" -type f

Repository: k2-fsa/sherpa-onnx

Length of output: 108


🏁 Script executed:

cat -n "csrc/axera/offline-tts-kokoro-model-axera.cc" | sed -n '590,620p'

Repository: k2-fsa/sherpa-onnx

Length of output: 138


🏁 Script executed:

# Let me also search for the model3 call and understand the data flow
rg "PostprocessXToAudio" "csrc/axera/offline-tts-kokoro-model-axera.cc" -B 3 -A 20

Repository: k2-fsa/sherpa-onnx

Length of output: 146


🏁 Script executed:

# Check if there are any size validations in the entire file related to model outputs
rg "\.size\(\)" "csrc/axera/offline-tts-kokoro-model-axera.cc" | head -20

Repository: k2-fsa/sherpa-onnx

Length of output: 146


🏁 Script executed:

# Look for model3 invocation to see if output size is validated
rg "model3|model_out" "csrc/axera/offline-tts-kokoro-model-axera.cc" -B 2 -A 2

Repository: k2-fsa/sherpa-onnx

Length of output: 146


🏁 Script executed:

find . -name "*kokoro*" -o -name "*axera*" | head -20

Repository: k2-fsa/sherpa-onnx

Length of output: 944


🏁 Script executed:

ls -la csrc/ | head -20

Repository: k2-fsa/sherpa-onnx

Length of output: 114


🏁 Script executed:

# Check if the file exists at the root or with a different structure
git ls-files | grep -i kokoro

Repository: k2-fsa/sherpa-onnx

Length of output: 4340


🏁 Script executed:

git ls-files | grep -i axera

Repository: k2-fsa/sherpa-onnx

Length of output: 562


🏁 Script executed:

cat -n "sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc" | sed -n '590,620p'

Repository: k2-fsa/sherpa-onnx

Length of output: 1364


🏁 Script executed:

# Get more context around the function
cat -n "sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc" | sed -n '580,650p'

Repository: k2-fsa/sherpa-onnx

Length of output: 3145


🏁 Script executed:

# Find all occurrences of PostprocessXToAudio
rg "PostprocessXToAudio" "sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc" -B 5 -A 30

Repository: k2-fsa/sherpa-onnx

Length of output: 2650


🏁 Script executed:

# Look for model3 or invocation of this function to understand how x is passed
rg "model3|PostprocessXToAudio" "sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc" -B 3 -A 5

Repository: k2-fsa/sherpa-onnx

Length of output: 3228


Validate x size before splitting spectrogram buffers.

x.begin() + half_n_fft * num_frames (253,451) is used unconditionally to create iterators. If model3 returns a shorter tensor than expected, this creates invalid iterators and invokes undefined behavior. The expected size is half_n_fft * num_frames * 2 = 506,902 floats (spec_part + phase_part).

Proposed fix
   void PostprocessXToAudio(const std::vector<float> &x, int32_t num_frames,
                            std::vector<float> &audio) {
     int32_t n_fft = 20;
     int32_t hop_length = 5;
     int32_t half_n_fft = n_fft / 2 + 1;
+    size_t expected =
+        static_cast<size_t>(half_n_fft) * static_cast<size_t>(num_frames) * 2;
+    if (x.size() < expected) {
+      SHERPA_ONNX_LOGE(
+          "Unexpected model3 output size. Expected at least %zu floats, got "
+          "%zu",
+          expected, x.size());
+      audio.clear();
+      return;
+    }

     std::vector<float> spec_part(half_n_fft * num_frames);
     std::vector<float> phase_part(half_n_fft * num_frames);
     std::vector<float> cos_part(half_n_fft * num_frames);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 596 -
606, In PostprocessXToAudio validate that x.size() is at least expected_size =
static_cast<size_t>(half_n_fft) * num_frames * 2 before using x.begin() +
half_n_fft * num_frames; if it is smaller, bail out gracefully (e.g.,
log/error/throw and return) or pad/resize x to expected_size so creating
spec_part and phase_part with iterators is safe; update the code around the
spec_part/phase_part assignment to check x.size(), compute expected_size using
half_n_fft and num_frames, and handle the error path instead of unconditionally
using the iterators.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc (2)

625-629: ⚠️ Potential issue | 🔴 Critical

Validate x before slicing spectrogram and phase buffers.

Line 629 is only valid if model3 returned at least half_n_fft * num_frames * 2 floats. A shorter tensor makes x.begin() + half_n_fft * num_frames an invalid iterator and invokes undefined behavior.

🛡️ Suggested fix
     int32_t n_fft = 20;
     int32_t hop_length = 5;
     int32_t half_n_fft = n_fft / 2 + 1;
+    size_t expected =
+        static_cast<size_t>(half_n_fft) * static_cast<size_t>(num_frames) * 2;
+    if (x.size() < expected) {
+      SHERPA_ONNX_LOGE(
+          "Unexpected model3 output size. Expected at least %zu floats, got "
+          "%zu",
+          expected, x.size());
+      audio.clear();
+      return;
+    }
 
     std::vector<float> spec_part(half_n_fft * num_frames);
     std::vector<float> phase_part(half_n_fft * num_frames);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 625 -
629, Before slicing into spec_part and phase_part, validate the length of the
input vector x: ensure x.size() >= static_cast<size_t>(half_n_fft) * num_frames
* 2 (or the exact expected length for model3 output) and handle the short-input
case (e.g., log/error and return or throw) so that x.begin() + half_n_fft *
num_frames is always a valid iterator; update the code around the
spec_part/phase_part assignments (referencing variables x, spec_part,
phase_part, half_n_fft, num_frames, and the model3 output) to perform this check
and early error handling before calling assign.

247-249: ⚠️ Potential issue | 🔴 Critical

Reject non-positive length_scale / speed before duration scaling.

speed = 1. / config_.kokoro.length_scale still allows <= 0. ProcessDuration() divides by speed on Line 552, so a zero/negative value can produce invalid rounded durations and break the rest of the pipeline.

🛡️ Suggested fix
-    if (config_.kokoro.length_scale != 1 && speed == 1) {
+    if (config_.kokoro.length_scale <= 0) {
+      SHERPA_ONNX_LOGE("kokoro.length_scale must be > 0. Given: %f",
+                       config_.kokoro.length_scale);
+      return {};
+    }
+
+    if (config_.kokoro.length_scale != 1 && speed == 1) {
       speed = 1. / config_.kokoro.length_scale;
     }
+
+    if (speed <= 0) {
+      SHERPA_ONNX_LOGE("speed must be > 0. Given: %f", speed);
+      return {};
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 247 -
249, Before using length_scale to compute speed (the snippet that sets speed =
1. / config_.kokoro.length_scale), validate config_.kokoro.length_scale > 0 and
ensure computed speed > 0; if the value is non-positive, fail early (return an
error/Status or throw) with a clear message rather than proceeding to
ProcessDuration(), so ProcessDuration() never receives zero/negative speed.
Update the block that references config_.kokoro.length_scale and speed and add
the guard+error path (or clamp only after explicit decision), ensuring callers
observe the failure instead of continuing with invalid durations.
🧹 Nitpick comments (4)
sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc (4)

304-307: Remove unused input_lengths parameter.

input_lengths is computed in ComputeExternalPreprocessing but never used by the caller. This is dead code that adds confusion.

♻️ Proposed fix

Update the function signature and call site:

-  void ComputeExternalPreprocessing(const std::vector<int64_t> &input_ids,
-                                    int32_t actual_len,
-                                    std::vector<int32_t> &input_lengths,
-                                    std::vector<int32_t> &text_mask) {
-    input_lengths = std::vector<int32_t>{actual_len};
+  void ComputeExternalPreprocessing(int32_t actual_len,
+                                    std::vector<int32_t> &text_mask) {
     text_mask.resize(max_seq_len_);
     for (int32_t i = 0; i < max_seq_len_; i++) {
       text_mask[i] = (i >= actual_len) ? 1 : 0;
     }
   }

And update the call site in InferenceSingleChunk:

-    std::vector<int32_t> input_lengths;
     std::vector<int32_t> text_mask;
-    ComputeExternalPreprocessing(input_ids, actual_len, input_lengths,
-                                 text_mask);
+    ComputeExternalPreprocessing(actual_len, text_mask);

Also applies to: 576-585

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 304 -
307, The local vector input_lengths returned from ComputeExternalPreprocessing
is unused — remove it from the API and all call sites: update
ComputeExternalPreprocessing to drop the output parameter (remove the
input_lengths parameter from its signature and implementation) and remove the
corresponding argument at each call site (e.g., the call in InferenceSingleChunk
and the other call around lines 576-585). Ensure compile by adjusting any
internal logic that previously wrote into input_lengths and update any tests or
callers that relied on that parameter.

74-86: Remove unused Linspace function.

This function is defined but never called anywhere in this file. Dead code should be removed to improve maintainability.

🧹 Suggested removal
-template <typename T>
-static std::vector<T> Linspace(T a, T b, size_t N) {
-  if (N == 0) return {};
-  if (N == 1) return {a};
-  T h = (b - a) / static_cast<T>(N - 1);
-  std::vector<T> xs(N);
-  T val = a;
-  for (size_t i = 0; i < N; ++i) {
-    xs[i] = val;
-    val += h;
-  }
-  return xs;
-}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 74 - 86,
The template function Linspace (template<typename T> static std::vector<T>
Linspace(T a, T b, size_t N)) is dead code in this translation unit; remove its
entire definition to eliminate unused code and improve maintainability, and
ensure there are no remaining references or forward declarations to Linspace in
this file or its headers—after removal, rebuild/tests to confirm no regressions.

413-422: Consider reading dur_shape_2 from model tensor shape.

The hardcoded value 50 represents the duration tensor's second dimension. For better model compatibility, consider querying this from model1_'s output tensor shape for "duration", similar to how max_seq_len_ is obtained from TensorShape("input_ids").

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 413 -
422, The code hardcodes dur_shape_2 = 50 when summing duration_processed, which
can break if the model's "duration" output shape changes; instead query the
second dimension from model1_'s output tensor shape for "duration" (similar to
how max_seq_len_ is obtained from TensorShape("input_ids")) and use that value
for the loop bounds and indexing into duration_processed and pred_dur_original;
update references to dur_shape_2, duration_processed, and pred_dur_original
accordingly so the loop uses the runtime shape rather than a magic constant.

552-552: Extract magic number to a named constant.

The comment // DOUBLE_INPUT_THRESHOLD indicates this should be a named constant for clarity.

♻️ Suggested improvement

Add as a private constant in the Impl class:

  class OfflineTtsKokoroModelAxcl::Impl {
   private:
+    static constexpr int32_t kDoubleInputThreshold = 32;
     int32_t max_seq_len_ = 0;

And use it:

-    if (actual_len <= 32) {  // DOUBLE_INPUT_THRESHOLD
+    if (actual_len <= kDoubleInputThreshold) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` at line 552, Extract
the magic number 32 into a named constant by adding a private constant inside
the Impl class (e.g., a private static constexpr size_t DOUBLE_INPUT_THRESHOLD =
32 or an appropriate integral type matching actual_len) and replace the literal
in the condition `if (actual_len <= 32)` with `if (actual_len <=
DOUBLE_INPUT_THRESHOLD)` (remove the trailing comment), ensuring the constant
name matches exactly `DOUBLE_INPUT_THRESHOLD` and the type is compatible with
`actual_len`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 330-343: The code currently hardcodes d_shape_2 = 640 which can be
inconsistent with model metadata; replace that assignment with d_shape_2 =
style_dim_[2] (ensure style_dim_ has at least 3 elements and is the same integer
type as d_shape_2) and update the en vector allocation to use total_frames *
d_shape_2 as before. Verify indices that use d_shape_2 (the nested loops and
accesses to d[i * d_shape_2 + j]) remain correct and that style_dim_ is
initialized by LoadVoices before this code runs.
- Around line 392-393: Replace the hardcoded 23041 with a dynamic computation:
after calling model3_->GetOutputTensorData("x"), compute num_frames =
x_out.size() / (2 * half_n_fft) where half_n_fft is 11 (used by
PostprocessXToAudio), ensure integer division yields a non-negative size_t and
optionally validate that x_out.size() % (2 * half_n_fft) == 0 (or handle
remainder) before passing num_frames to PostprocessXToAudio so the call becomes
PostprocessXToAudio(x_out, num_frames, audio).

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc`:
- Around line 451-455: The code assumes model1_->GetOutputTensorData("duration")
and ("d") contain at least expected lengths before indexing; add explicit size
checks after retrieving std::vector<float> duration and d to verify
duration.size() >= expected_duration_len (e.g., actual_len * 50 or max_seq_len_
* 640 as used) and d.size() >= expected_d_len before calling ProcessDuration or
iterating over them, and if the checks fail log an error (or return/throw) to
fail cleanly; update callers such as the call to ProcessDuration(duration,
actual_len, speed, pred_dur, total_frames) and any loops that index duration or
d to rely on the validated sizes.
- Around line 567-576: The code currently decrements each token at most once
(for loop over indices) which can leave total_frames too large when abs(diff) >
actual_len; change the logic so you keep decrementing tokens (in order given by
Argsort(pred_dur, actual_len, true)) until decreased reaches abs(diff) or no
more decrements are possible: replace the single-pass for(auto idx : indices)
with a loop like while(decreased < std::abs(diff)) { bool any = false; for(auto
idx : indices) { if(pred_dur[idx] > 1) { pred_dur[idx]--; decreased++; any =
true; if(decreased >= std::abs(diff)) break; } } if(!any) break; } so that
pred_dur entries can be decremented multiple times and total_frames will match
fixed_total_frames before calling SetInputTensorData("pred_aln_trg", ...).

---

Duplicate comments:
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc`:
- Around line 625-629: Before slicing into spec_part and phase_part, validate
the length of the input vector x: ensure x.size() >=
static_cast<size_t>(half_n_fft) * num_frames * 2 (or the exact expected length
for model3 output) and handle the short-input case (e.g., log/error and return
or throw) so that x.begin() + half_n_fft * num_frames is always a valid
iterator; update the code around the spec_part/phase_part assignments
(referencing variables x, spec_part, phase_part, half_n_fft, num_frames, and the
model3 output) to perform this check and early error handling before calling
assign.
- Around line 247-249: Before using length_scale to compute speed (the snippet
that sets speed = 1. / config_.kokoro.length_scale), validate
config_.kokoro.length_scale > 0 and ensure computed speed > 0; if the value is
non-positive, fail early (return an error/Status or throw) with a clear message
rather than proceeding to ProcessDuration(), so ProcessDuration() never receives
zero/negative speed. Update the block that references
config_.kokoro.length_scale and speed and add the guard+error path (or clamp
only after explicit decision), ensuring callers observe the failure instead of
continuing with invalid durations.

---

Nitpick comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 304-307: The local vector input_lengths returned from
ComputeExternalPreprocessing is unused — remove it from the API and all call
sites: update ComputeExternalPreprocessing to drop the output parameter (remove
the input_lengths parameter from its signature and implementation) and remove
the corresponding argument at each call site (e.g., the call in
InferenceSingleChunk and the other call around lines 576-585). Ensure compile by
adjusting any internal logic that previously wrote into input_lengths and update
any tests or callers that relied on that parameter.
- Around line 74-86: The template function Linspace (template<typename T> static
std::vector<T> Linspace(T a, T b, size_t N)) is dead code in this translation
unit; remove its entire definition to eliminate unused code and improve
maintainability, and ensure there are no remaining references or forward
declarations to Linspace in this file or its headers—after removal,
rebuild/tests to confirm no regressions.
- Around line 413-422: The code hardcodes dur_shape_2 = 50 when summing
duration_processed, which can break if the model's "duration" output shape
changes; instead query the second dimension from model1_'s output tensor shape
for "duration" (similar to how max_seq_len_ is obtained from
TensorShape("input_ids")) and use that value for the loop bounds and indexing
into duration_processed and pred_dur_original; update references to dur_shape_2,
duration_processed, and pred_dur_original accordingly so the loop uses the
runtime shape rather than a magic constant.
- Line 552: Extract the magic number 32 into a named constant by adding a
private constant inside the Impl class (e.g., a private static constexpr size_t
DOUBLE_INPUT_THRESHOLD = 32 or an appropriate integral type matching actual_len)
and replace the literal in the condition `if (actual_len <= 32)` with `if
(actual_len <= DOUBLE_INPUT_THRESHOLD)` (remove the trailing comment), ensuring
the constant name matches exactly `DOUBLE_INPUT_THRESHOLD` and the type is
compatible with `actual_len`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0763a7a7-83ea-4360-86cf-8445357b05d4

📥 Commits

Reviewing files that changed from the base of the PR and between 1b940fa and dd29306.

📒 Files selected for processing (2)
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
  • sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc

Comment on lines +330 to +343
int32_t d_shape_1 = max_seq_len_;
int32_t d_shape_2 =
640; // style_dim_[2] essentially, or based on d_shape_[2]
std::vector<float> en(total_frames * d_shape_2, 0.0f);

for (int32_t j = 0; j < d_shape_2; ++j) {
for (int32_t k = 0; k < total_frames; ++k) {
float sum = 0.0f;
for (int32_t i = 0; i < max_seq_len_; ++i) {
sum += d[i * d_shape_2 + j] * pred_aln_trg[i * total_frames + k];
}
en[j * total_frames + k] = sum;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace hardcoded d_shape_2 = 640 with style_dim_[2].

The comment indicates this value should be style_dim_[2], which is already read from model metadata in LoadVoices. Using the hardcoded value creates inconsistency and may break compatibility with model variants that have different embedding dimensions.

🔧 Proposed fix
     int32_t d_shape_1 = max_seq_len_;
-    int32_t d_shape_2 =
-        640;  // style_dim_[2] essentially, or based on d_shape_[2]
+    int32_t d_shape_2 = style_dim_[2];
     std::vector<float> en(total_frames * d_shape_2, 0.0f);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int32_t d_shape_1 = max_seq_len_;
int32_t d_shape_2 =
640; // style_dim_[2] essentially, or based on d_shape_[2]
std::vector<float> en(total_frames * d_shape_2, 0.0f);
for (int32_t j = 0; j < d_shape_2; ++j) {
for (int32_t k = 0; k < total_frames; ++k) {
float sum = 0.0f;
for (int32_t i = 0; i < max_seq_len_; ++i) {
sum += d[i * d_shape_2 + j] * pred_aln_trg[i * total_frames + k];
}
en[j * total_frames + k] = sum;
}
}
int32_t d_shape_1 = max_seq_len_;
int32_t d_shape_2 = style_dim_[2];
std::vector<float> en(total_frames * d_shape_2, 0.0f);
for (int32_t j = 0; j < d_shape_2; ++j) {
for (int32_t k = 0; k < total_frames; ++k) {
float sum = 0.0f;
for (int32_t i = 0; i < max_seq_len_; ++i) {
sum += d[i * d_shape_2 + j] * pred_aln_trg[i * total_frames + k];
}
en[j * total_frames + k] = sum;
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 330 -
343, The code currently hardcodes d_shape_2 = 640 which can be inconsistent with
model metadata; replace that assignment with d_shape_2 = style_dim_[2] (ensure
style_dim_ has at least 3 elements and is the same integer type as d_shape_2)
and update the en vector allocation to use total_frames * d_shape_2 as before.
Verify indices that use d_shape_2 (the nested loops and accesses to d[i *
d_shape_2 + j]) remain correct and that style_dim_ is initialized by LoadVoices
before this code runs.

Comment on lines +392 to +393
std::vector<float> x_out = model3_->GetOutputTensorData("x");
PostprocessXToAudio(x_out, 23041, audio);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

fd "offline-tts-kokoro-model-axcl" -t f

Repository: k2-fsa/sherpa-onnx

Length of output: 170


🏁 Script executed:

cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc | head -450 | tail -100

Repository: k2-fsa/sherpa-onnx

Length of output: 4657


🏁 Script executed:

rg "PostprocessXToAudio" -A 10 -B 2

Repository: k2-fsa/sherpa-onnx

Length of output: 5050


🏁 Script executed:

rg "23041" -n

Repository: k2-fsa/sherpa-onnx

Length of output: 273


🏁 Script executed:

rg "ProcessDuration|total_frames" -A 5 -B 2

Repository: k2-fsa/sherpa-onnx

Length of output: 21852


🏁 Script executed:

rg "n_fft|hop_length" sherpa-onnx/csrc/axcl/ -A 2 -B 2

Repository: k2-fsa/sherpa-onnx

Length of output: 3597


🏁 Script executed:

cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc | sed -n '517,570p'

Repository: k2-fsa/sherpa-onnx

Length of output: 2468


🏁 Script executed:

rg "x_out\.size()" sherpa-onnx/csrc/axcl/ -A 3 -B 3

Repository: k2-fsa/sherpa-onnx

Length of output: 44


🏁 Script executed:

cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc | sed -n '300,410p'

Repository: k2-fsa/sherpa-onnx

Length of output: 5317


🏁 Script executed:

cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc | sed -n '429,465p'

Repository: k2-fsa/sherpa-onnx

Length of output: 1603


🏁 Script executed:

rg "max_seq_len_\s*=" sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc -B 2 -A 2

Repository: k2-fsa/sherpa-onnx

Length of output: 262


🏁 Script executed:

cat sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h

Repository: k2-fsa/sherpa-onnx

Length of output: 1054


🏁 Script executed:

python3 << 'EOF'
# Verify the calculation
half_n_fft = 11
expected_hardcoded = 23041
expected_size = 2 * half_n_fft * expected_hardcoded
print(f"If num_frames = {expected_hardcoded}:")
print(f"Expected x_out.size() = 2 * {half_n_fft} * {expected_hardcoded} = {expected_size}")
print(f"Suggested formula: num_frames = x_out.size() / {2 * half_n_fft}")
print(f"For x_out.size() = {expected_size}: num_frames = {expected_size} / {2 * half_n_fft} = {expected_size // (2 * half_n_fft)}")
EOF

Repository: k2-fsa/sherpa-onnx

Length of output: 242


Replace hardcoded num_frames = 23041 with dynamic calculation from output tensor size.

The value 23041 is hardcoded but should be derived from the actual x_out tensor produced by model3. Since PostprocessXToAudio expects the input to contain 2 * half_n_fft * num_frames elements (spec and phase parts), and half_n_fft = 11, calculate num_frames from the actual output:

    std::vector<float> x_out = model3_->GetOutputTensorData("x");
-    PostprocessXToAudio(x_out, 23041, audio);
+    int32_t half_n_fft = 11;
+    int32_t num_frames = static_cast<int32_t>(x_out.size()) / (2 * half_n_fft);
+    PostprocessXToAudio(x_out, num_frames, audio);

This ensures robustness when total_frames varies based on input duration prediction or if model configuration changes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<float> x_out = model3_->GetOutputTensorData("x");
PostprocessXToAudio(x_out, 23041, audio);
std::vector<float> x_out = model3_->GetOutputTensorData("x");
int32_t half_n_fft = 11;
int32_t num_frames = static_cast<int32_t>(x_out.size()) / (2 * half_n_fft);
PostprocessXToAudio(x_out, num_frames, audio);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 392 -
393, Replace the hardcoded 23041 with a dynamic computation: after calling
model3_->GetOutputTensorData("x"), compute num_frames = x_out.size() / (2 *
half_n_fft) where half_n_fft is 11 (used by PostprocessXToAudio), ensure integer
division yields a non-negative size_t and optionally validate that x_out.size()
% (2 * half_n_fft) == 0 (or handle remainder) before passing num_frames to
PostprocessXToAudio so the call becomes PostprocessXToAudio(x_out, num_frames,
audio).

Comment on lines +451 to +455
std::vector<float> duration = model1_->GetOutputTensorData("duration");
std::vector<float> d = model1_->GetOutputTensorData("d");

std::vector<int32_t> pred_dur;
ProcessDuration(duration, actual_len, speed, pred_dur, total_frames);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate duration and d sizes before indexing them.

Line 467 and Line 550 assume part1 always returns at least max_seq_len_ * 640 and actual_len * 50 floats. If the asset is mismatched or truncated, these loops read past the returned buffers instead of failing cleanly.

🛡️ Suggested guard
+    constexpr int32_t kDurationBins = 50;
+    constexpr int32_t kEncoderDim = 640;
+
     std::vector<float> duration = model1_->GetOutputTensorData("duration");
     std::vector<float> d = model1_->GetOutputTensorData("d");
+
+    if (duration.size() <
+            static_cast<size_t>(actual_len) * kDurationBins ||
+        d.size() < static_cast<size_t>(max_seq_len_) * kEncoderDim) {
+      SHERPA_ONNX_LOGE(
+          "Unexpected model1 output size. duration=%zu, d=%zu, actual_len=%d, "
+          "max_seq_len=%d",
+          duration.size(), d.size(), actual_len, max_seq_len_);
+      return false;
+    }

Also applies to: 463-470, 546-550

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 451 -
455, The code assumes model1_->GetOutputTensorData("duration") and ("d") contain
at least expected lengths before indexing; add explicit size checks after
retrieving std::vector<float> duration and d to verify duration.size() >=
expected_duration_len (e.g., actual_len * 50 or max_seq_len_ * 640 as used) and
d.size() >= expected_d_len before calling ProcessDuration or iterating over
them, and if the checks fail log an error (or return/throw) to fail cleanly;
update callers such as the call to ProcessDuration(duration, actual_len, speed,
pred_dur, total_frames) and any loops that index duration or d to rely on the
validated sizes.

Comment on lines +567 to +576
if (diff < 0) {
auto indices = Argsort(pred_dur, actual_len, true);
int32_t decreased = 0;
for (auto idx : indices) {
if (pred_dur[idx] > 1 && decreased < std::abs(diff)) {
pred_dur[idx]--;
decreased++;
}
if (decreased >= std::abs(diff)) break;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reduce overshoot by the full abs(diff), not one frame per token.

This loop decrements each token at most once. With slower speech (speed < 1), abs(diff) can easily exceed actual_len, leaving total_frames above fixed_total_frames; then pred_aln_trg no longer matches the fixed AXERA input shape and SetInputTensorData("pred_aln_trg", ...) fails.

🔧 Suggested fix
     if (diff < 0) {
       auto indices = Argsort(pred_dur, actual_len, true);
-      int32_t decreased = 0;
+      int32_t remaining = std::abs(diff);
       for (auto idx : indices) {
-        if (pred_dur[idx] > 1 && decreased < std::abs(diff)) {
-          pred_dur[idx]--;
-          decreased++;
-        }
-        if (decreased >= std::abs(diff)) break;
+        if (remaining == 0) break;
+        if (pred_dur[idx] <= 1) continue;
+
+        int32_t delta = std::min(pred_dur[idx] - 1, remaining);
+        pred_dur[idx] -= delta;
+        remaining -= delta;
       }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (diff < 0) {
auto indices = Argsort(pred_dur, actual_len, true);
int32_t decreased = 0;
for (auto idx : indices) {
if (pred_dur[idx] > 1 && decreased < std::abs(diff)) {
pred_dur[idx]--;
decreased++;
}
if (decreased >= std::abs(diff)) break;
}
if (diff < 0) {
auto indices = Argsort(pred_dur, actual_len, true);
int32_t remaining = std::abs(diff);
for (auto idx : indices) {
if (remaining == 0) break;
if (pred_dur[idx] <= 1) continue;
int32_t delta = std::min(pred_dur[idx] - 1, remaining);
pred_dur[idx] -= delta;
remaining -= delta;
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 567 -
576, The code currently decrements each token at most once (for loop over
indices) which can leave total_frames too large when abs(diff) > actual_len;
change the logic so you keep decrementing tokens (in order given by
Argsort(pred_dur, actual_len, true)) until decreased reaches abs(diff) or no
more decrements are possible: replace the single-pass for(auto idx : indices)
with a loop like while(decreased < std::abs(diff)) { bool any = false; for(auto
idx : indices) { if(pred_dur[idx] > 1) { pred_dur[idx]--; decreased++; any =
true; if(decreased >= std::abs(diff)) break; } } if(!any) break; } so that
pred_dur entries can be decremented multiple times and total_frames will match
fixed_total_frames before calling SetInputTensorData("pred_aln_trg", ...).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc`:
- Around line 371-399: After reading meta_data_.num_speakers (via
SHERPA_ONNX_READ_META_DATA) in LoadVoices(), validate that it is > 0; if not,
log an error with SHERPA_ONNX_LOGE (include the bad value) and abort
initialization with SHERPA_ONNX_EXIT(-1). This prevents later out-of-bounds
access in LoadVoiceEmbedding() which assumes meta_data_.num_speakers >= 1;
update the existing validation block (near style_dim_ checks and
expected_num_floats calculation) to perform this non-positive check before
computing expected_num_floats or accessing styles_.
- Around line 400-402: The code constructs styles_ by reinterpreting voices_data
bytes as const float*, which can violate alignment and is undefined on some
architectures; instead, keep voices_data as the raw bytes from ReadFile(),
resize styles_ to actual_num_floats, and use memcpy (or std::copy with
reinterpret_cast<const char*>) to copy actual_num_floats * sizeof(float) bytes
from voices_data.data() into styles_. Update the block that currently uses
reinterpret_cast<const float *>(voices_data) and the styles_ initializer so it
resizes styles_ and performs a safe byte-wise copy from voices_data into
styles_.

In `@sherpa-onnx/csrc/offline-tts-kokoro-model-config.cc`:
- Around line 48-56: The current Validate() in
offline-tts-kokoro-model-config.cc unconditionally splits config.kokoro.model
with SplitStringToVector and checks each part, which allows multi-file strings
to pass but later breaks when offline-tts-kokoro-model.cc calls
ReadFile(config.kokoro.model). Change the validation to be provider-aware:
inside Validate(), check the provider type (the same provider enum/field used by
offline-tts-kokoro-model.cc) and if the provider is AXERA/AXCL allow splitting
and validate each file with FileExists; otherwise require a single file (no
commas), verify FileExists on the full config.kokoro.model string, and return
false with a clear error if multiple files are provided for non-multi-file
providers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d4bdc398-b355-4041-934f-29f20ada8698

📥 Commits

Reviewing files that changed from the base of the PR and between dd29306 and 4cdc81f.

📒 Files selected for processing (3)
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
  • sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc
  • sherpa-onnx/csrc/offline-tts-kokoro-model-config.cc
🚧 Files skipped from review as they are similar to previous changes (1)
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc

Comment on lines +371 to +399
SHERPA_ONNX_READ_META_DATA(meta_data_.num_speakers, "n_speakers");
SHERPA_ONNX_READ_META_DATA(meta_data_.has_espeak, "has_espeak");
SHERPA_ONNX_READ_META_DATA_STR_WITH_DEFAULT(meta_data_.voice, "voice",
"en-us");
SHERPA_ONNX_READ_META_DATA_VEC(style_dim_, "style_dim");

if (style_dim_.size() != 3) {
SHERPA_ONNX_LOGE("style_dim should be 3-d, given: %d",
static_cast<int32_t>(style_dim_.size()));
SHERPA_ONNX_EXIT(-1);
}

if (style_dim_[1] != 1) {
SHERPA_ONNX_LOGE("style_dim[1] should be 1, given: %d", style_dim_[1]);
SHERPA_ONNX_EXIT(-1);
}

meta_data_.max_token_len = style_dim_[0];

int32_t actual_num_floats = voices_data_length / sizeof(float);
int32_t expected_num_floats =
style_dim_[0] * style_dim_[2] * meta_data_.num_speakers;

if (actual_num_floats != expected_num_floats) {
SHERPA_ONNX_LOGE(
"Corrupted voices file. Expected #floats: %d, actual: %d",
expected_num_floats, actual_num_floats);
SHERPA_ONNX_EXIT(-1);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Reject non-positive n_speakers during initialization.

LoadVoices() currently accepts n_speakers == 0 when the voices blob is empty, but LoadVoiceEmbedding() later copies style_dim_[2] floats from styles_.data() anyway. That turns malformed metadata into an out-of-bounds read on the first Run() instead of a clean init failure.

🛡️ Proposed fix
     SHERPA_ONNX_READ_META_DATA(meta_data_.num_speakers, "n_speakers");
     SHERPA_ONNX_READ_META_DATA(meta_data_.has_espeak, "has_espeak");
     SHERPA_ONNX_READ_META_DATA_STR_WITH_DEFAULT(meta_data_.voice, "voice",
                                                 "en-us");
     SHERPA_ONNX_READ_META_DATA_VEC(style_dim_, "style_dim");
+
+    if (meta_data_.num_speakers <= 0) {
+      SHERPA_ONNX_LOGE("n_speakers should be > 0, given: %d",
+                       meta_data_.num_speakers);
+      SHERPA_ONNX_EXIT(-1);
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 371 -
399, After reading meta_data_.num_speakers (via SHERPA_ONNX_READ_META_DATA) in
LoadVoices(), validate that it is > 0; if not, log an error with
SHERPA_ONNX_LOGE (include the bad value) and abort initialization with
SHERPA_ONNX_EXIT(-1). This prevents later out-of-bounds access in
LoadVoiceEmbedding() which assumes meta_data_.num_speakers >= 1; update the
existing validation block (near style_dim_ checks and expected_num_floats
calculation) to perform this non-positive check before computing
expected_num_floats or accessing styles_.

Comment on lines +400 to +402
styles_ = std::vector<float>(
reinterpret_cast<const float *>(voices_data),
reinterpret_cast<const float *>(voices_data) + actual_num_floats);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

fd 'file-utils\.(h|cc)$' -t f | while read -r f; do
  echo "=== $f ==="
  rg -n -C3 '\bReadFile\b' "$f"
done

Repository: k2-fsa/sherpa-onnx

Length of output: 1304


🏁 Script executed:

cd sherpa-onnx/csrc/axera && head -n 410 offline-tts-kokoro-model-axera.cc | tail -n 20 && echo "---" && head -n 420 offline-tts-kokoro-model-axera.cc | tail -n 30

Repository: k2-fsa/sherpa-onnx

Length of output: 1862


🏁 Script executed:

cd sherpa-onnx/csrc/axera && rg -n -B10 'styles_.*=.*std::vector<float>' offline-tts-kokoro-model-axera.cc | head -n 50

Repository: k2-fsa/sherpa-onnx

Length of output: 537


🏁 Script executed:

cd sherpa-onnx/csrc/axera && rg -n -B30 'int32_t actual_num_floats = voices_data_length' offline-tts-kokoro-model-axera.cc | head -n 60

Repository: k2-fsa/sherpa-onnx

Length of output: 1471


Copy the voices blob with memcpy instead of iterating a cast float*.

This constructs styles_ from a const float* obtained by reinterpreting raw file bytes from ReadFile().data(). The underlying storage is a std::vector<char>, which has no alignment guarantee for float. Reinterpreting this as const float* violates alignment requirements and is undefined behavior on ARM-class targets. Resize styles_ and copy bytes into it instead.

🔧 Proposed fix
-    styles_ = std::vector<float>(
-        reinterpret_cast<const float *>(voices_data),
-        reinterpret_cast<const float *>(voices_data) + actual_num_floats);
+    styles_.resize(actual_num_floats);
+    std::memcpy(styles_.data(), voices_data,
+                static_cast<size_t>(actual_num_floats) * sizeof(float));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axera/offline-tts-kokoro-model-axera.cc` around lines 400 -
402, The code constructs styles_ by reinterpreting voices_data bytes as const
float*, which can violate alignment and is undefined on some architectures;
instead, keep voices_data as the raw bytes from ReadFile(), resize styles_ to
actual_num_floats, and use memcpy (or std::copy with reinterpret_cast<const
char*>) to copy actual_num_floats * sizeof(float) bytes from voices_data.data()
into styles_. Update the block that currently uses reinterpret_cast<const float
*>(voices_data) and the styles_ initializer so it resizes styles_ and performs a
safe byte-wise copy from voices_data into styles_.

Comment on lines +48 to +56
if (!model.empty()) {
std::vector<std::string> files;
SplitStringToVector(model, ",", false, &files);
for (const auto &f : files) {
if (!FileExists(f)) {
SHERPA_ONNX_LOGE("--kokoro-model: '%s' does not exist", model.c_str());
return false;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This validation is too permissive for the non-AXERA/AXCL Kokoro path.

Splitting --kokoro-model unconditionally lets a.onnx,b.onnx pass Validate(), but the standard backend in sherpa-onnx/csrc/offline-tts-kokoro-model.cc:32-52 still does ReadFile(config.kokoro.model) on the unsplit string. That turns a config error into a later runtime failure. Please keep multi-file validation provider-specific, or validate the expected file arity at a higher layer where the provider is known.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/offline-tts-kokoro-model-config.cc` around lines 48 - 56,
The current Validate() in offline-tts-kokoro-model-config.cc unconditionally
splits config.kokoro.model with SplitStringToVector and checks each part, which
allows multi-file strings to pass but later breaks when
offline-tts-kokoro-model.cc calls ReadFile(config.kokoro.model). Change the
validation to be provider-aware: inside Validate(), check the provider type (the
same provider enum/field used by offline-tts-kokoro-model.cc) and if the
provider is AXERA/AXCL allow splitting and validate each file with FileExists;
otherwise require a single file (no commas), verify FileExists on the full
config.kokoro.model string, and return false with a clear error if multiple
files are provided for non-multi-file providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant