[Bug/Proposal] espnet-asrのNumPy 2.x環境でのインストールエラーとtorchaudioアライメントへの移行提案


**Description**
ReazonSpeechのESPnet ASRモジュールにおいて、依存している `ctc-segmentation` ライブラリとNumPy 2.x系の間に非互換性の問題があり、インストールおよび実行ができない状態になっています。

ESPnet本体はすでにNumPy >= 2.0に対応していますが、`ctc-segmentation` はNumPy 1.xのC-API（ABI）でコンパイルされているため、現在の環境で `ctc.py` を呼び出すとクラッシュしてしまいます。

**環境**
* OS: Linux (Pop!_OS) / Google Colab
* Python: 3.13
* NumPy: 2.3.5 / 2.0.2
* TorchAudio: 2.9
* ReazonSpeech: 3.0.0

**エラー詳細**
モジュールインポート時に以下のエラーが発生します。
```python
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.

ImportError: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use '<void>numpy._import_array' to disable if you are certain you don't need it).
```
※ 参考：ESPnet側のNumPy 2.x対応コミットはこちらです。
[https://github.com/espnet/espnet/pull/6221](https://github.com/espnet/espnet/pull/6221)

**再現コード**
```bash
pip install numpy>=2.0
pip install ctc-segmentation
python
import ctc_segmentation
```


**提案内容**
この問題を根本的に解決するため、`ctc-segmentation` のロジックを **`torchaudio.functional.forced_align`** を用いた実装へ置き換えることを提案します。

プロジェクト内ですでに `torchaudio` が依存パッケージとして利用されているため、新たな依存関係を増やすことなく、ネイティブなPyTorch実装へ移行できます。

**検証結果**
Google ColabのCPU環境にて、ダミーデータを用いて `ctc-segmentation` と `torchaudio` のパフォーマンス比較テストを実施しました。
シードによって結果が違うが。同じアライメント処理を行った結果、`torchaudio` の方が **約3-5倍高速** に動作することを確認しました。
* **修正案の実装コード**: [https://github.com/deeplearningcafe/ReazonSpeech/commit/b74577b29840379e96d3268167d046e32bb52fdd](https://github.com/deeplearningcafe/ReazonSpeech/commit/b74577b29840379e96d3268167d046e32bb52fdd)

**実行結果のログ（シード46）:**
```text
--- Scenario: Short (50 chars, 500 frames) ---
Results Match (atol=1e-7) : True
Original ctc_segmentation : 0.00518 seconds
New torchaudio.forced_align : 0.00112 seconds
Speedup Factor            : 4.60x

--- Scenario: Medium (200 chars, 2000 frames) ---
Results Match (atol=1e-7) : True
Original ctc_segmentation : 0.03200 seconds
New torchaudio.forced_align : 0.00555 seconds
Speedup Factor            : 5.76x

--- Scenario: Long (1000 chars, 10000 frames) ---
Results Match (atol=1e-7) : True
Original ctc_segmentation : 0.68273 seconds
New torchaudio.forced_align : 0.09612 seconds
Speedup Factor            : 7.10x

```

* ベンチマークコード:
```python
import time
import numpy as np
import torch
import torchaudio
import ctc_segmentation

class MockASRModel:
    def __init__(self, token_list, blank_id):
        self.token_list = token_list
        self.blank_id = blank_id

class MockModel:
    def __init__(self, token_list, blank_id):
        self.asr_model = MockASRModel(token_list, blank_id)

def generate_dummy_data(num_chars, num_frames):
    """
    Generates sensible dummy data where the target text is highly
    probable in the lpz matrix to prevent ctc_segmentation from
    failing or backtracking infinitely.
    """

    vocab = ['<blank>'] +[chr(i) for i in range(97, 123)] + ['<eos>']
    model = MockModel(vocab, 0)

    # Generate random text from the valid alphabet ONLY (exclude blank and eos)
    text = "".join(np.random.choice(vocab[1:-1], num_chars))

    # Dummy audio samples (used only for index_duration calculation)
    # Assuming 1 frame = 320 samples
    samples = np.zeros(num_frames * 320)

    # Create logits with strong blank predictions everywhere
    logits = np.full((num_frames, len(vocab)), -10.0)
    logits[:, 0] = 10.0

    # Inject high probabilities for the target text evenly spaced
    step = num_frames // (num_chars + 1)
    for i, char in enumerate(text):
        frame_idx = (i + 1) * step
        char_idx = vocab.index(char)
        logits[frame_idx, 0] = -10.0
        logits[frame_idx, char_idx] = 10.0

    # Convert logits to probabilities
    lpz_probs = torch.softmax(torch.tensor(logits), dim=-1).numpy()

    return model, samples, text, lpz_probs

# ctc_segmentation
def get_timings_original(model, samples, text, lpz):
    """Original method using the Cython-based ctc_segmentation."""
    opt = ctc_segmentation.CtcSegmentationParameters(
        index_duration = len(samples) / (lpz.shape[0] + 1),
        char_list = model.asr_model.token_list[:-1] # Exclude EOS
    )
    matrix, indices = ctc_segmentation.prepare_text(opt, [text])
    timings = ctc_segmentation.ctc_segmentation(opt, lpz, matrix)[0]

    return timings[indices[0]+1:indices[1]]

# torchaudio forced_align)
def get_timings_new(model, samples, text, lpz):
    """New method using torchaudio native forced alignment."""
    token_list = model.asr_model.token_list
    blank_id = model.asr_model.blank_id

    tokens =[]
    char_to_token_idx =[]
    max_token_len = max(len(t) for t in token_list)

    # Greedy token matching
    i = 0
    while i < len(text):
        match_found = False
        for length in range(max_token_len, 0, -1):
            if i + length <= len(text):
                span = text[i:i+length]
                if span in token_list:
                    tokens.append(token_list.index(span))
                    char_to_token_idx.append(i)
                    i += length
                    match_found = True
                    break
        if not match_found:
            i += 1

    if not tokens:
        return np.zeros(len(text))

    # Enforce CPU usage for a fair comparison
    log_probs = torch.from_numpy(lpz).clamp(min=1e-7).log().unsqueeze(0)
    targets = torch.tensor([tokens], dtype=torch.long)

    in_lens = torch.tensor([log_probs.shape[1]], dtype=torch.long)
    tgt_lens = torch.tensor([targets.shape[1]], dtype=torch.long)

    # Native alignment
    alignments, _ = torchaudio.functional.forced_align(
        log_probs,
        targets,
        input_lengths=in_lens,
        target_lengths=tgt_lens,
        blank=blank_id
    )

    alignments = alignments[0].tolist()

    timings = np.zeros(len(text))
    target_idx = 0
    prev_token_id = blank_id

    for frame_idx, token_id in enumerate(alignments):
        if target_idx < len(tokens):
            expected = tokens[target_idx]
            if token_id == expected and prev_token_id != expected:
                orig_char_idx = char_to_token_idx[target_idx]
                timings[orig_char_idx] = frame_idx
                target_idx += 1
        prev_token_id = token_id

    char_to_token_set = set(char_to_token_idx)
    for i in range(len(text)):
        if i not in char_to_token_set:
            timings[i] = timings[i-1] if i > 0 else 0

    index_duration = len(samples) / (lpz.shape[0] + 1)
    timings = timings * index_duration

    return timings


def run_benchmark():
    # Test cases: (num_chars, num_frames)
    scenarios =[
        ("Short", 50, 500),
        ("Medium", 200, 2000),
        ("Long", 1000, 10000)
    ]

    iterations = 10

    print("Starting Alignment Benchmark (CPU Only)...\n")

    for name, n_chars, n_frames in scenarios:
        print(f"--- Scenario: {name} ({n_chars} chars, {n_frames} frames) ---")
        model, samples, text, lpz = generate_dummy_data(n_chars, n_frames)

        # Correctness Check & Warmup
        timings_orig = get_timings_original(model, samples, text, lpz)
        timings_new = get_timings_new(model, samples, text, lpz)

        # Store outputs on a list as float32 tensors
        outputs =[
            torch.from_numpy(timings_orig).float(),
            torch.from_numpy(timings_new).float()
        ]

        # Check equality with a tolerance of 1e-7
        is_equal = torch.all(
            torch.isclose(outputs[0], outputs[1], atol=1e-7)
        ).item()

        print(f"Results Match (atol=1e-7) : {is_equal}")
        if not is_equal:
            max_diff = torch.max(torch.abs(outputs[0] - outputs[1])).item()
            print(f"Warning: Max timing difference = {max_diff:.6f}s")

        # Benchmark Original
        start_time = time.perf_counter()
        for _ in range(iterations):
            _ = get_timings_original(model, samples, text, lpz)
        orig_time = (time.perf_counter() - start_time) / iterations

        # Benchmark New
        start_time = time.perf_counter()
        for _ in range(iterations):
            _ = get_timings_new(model, samples, text, lpz)
        new_time = (time.perf_counter() - start_time) / iterations

        print(f"Original ctc_segmentation : {orig_time:.5f} seconds")
        print(f"New torchaudio.forced_align : {new_time:.5f} seconds")

        speedup = orig_time / new_time if new_time > 0 else float('inf')
        print(f"Speedup Factor            : {speedup:.2f}x\n")

# 実行
torch.manual_seed(46)
np.random.seed(46)

torch.set_default_device('cpu')
run_benchmark()
```


**移行によるメリット**
1. **NumPy 2.x環境でのエラー解消**: C-APIのABI不整合問題が解決し、最新のPython/NumPy環境で正常に動作するようになります。
2. **依存関係の削減**: メンテナンスが滞りがちな外部Cythonライブラリへの依存を減らせます。
3. **パフォーマンス向上**: CPU環境でのセグメンテーション処理が大幅に高速化されます。

当方で `ctc.py` 内の `get_timings` 関数を `torchaudio` 用に書き換えたコードを作成済みです。もしよろしければ、こちらからPull Requestを作成させていただくことも可能ですが、いかがでしょうか？

ご確認とご検討のほど、よろしくお願いいたします。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug/Proposal] espnet-asrのNumPy 2.x環境でのインストールエラーとtorchaudioアライメントへの移行提案 #78

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug/Proposal] espnet-asrのNumPy 2.x環境でのインストールエラーとtorchaudioアライメントへの移行提案 #78

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions