Skip to content

35% faster packing + rename bfd-requeue to bfd_split#5189

Merged
qgallouedec merged 26 commits intohuggingface:mainfrom
mariosasko:vectorized-bfd-chunking
Mar 13, 2026
Merged

35% faster packing + rename bfd-requeue to bfd_split#5189
qgallouedec merged 26 commits intohuggingface:mainfrom
mariosasko:vectorized-bfd-chunking

Conversation

@mariosasko
Copy link
Contributor

@mariosasko mariosasko commented Feb 26, 2026

What does this PR do?

This PR improves the packing logic to make it faster, less error-prone, and easier to read.

The main changes are:

  • Replacing the AI-generated BFD splitting (a.k.a. "requeuing") logic from Preserve truncated tokens in BFD packing #4632
    with a vectorized implementation that is significantly shorter and 30% faster.

  • Updating pack_examples to restore the input dataset’s format and perform proper input validation.

  • Applying a minor optimization to the wrapped packing implementation by reusing the offsets across all packed columns.

  • Aligning the naming with the literature (e.g., requeuesplit) while preserving backward compatibility.

P.S. The recent Qwen3-Coder-Next technical report includes a nice comparison of packing techniques, which would be a nice addition to the docs. However, the report is not yet available on arXiv, so it cannot be cited as an HF paper. Done 😄

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.


Note

Medium Risk
Touches core SFT data preprocessing (packing/truncation) and changes strategy semantics/validation, which can alter token ordering/retention and training results if misconfigured; backward compatibility is handled via a deprecation warning.

Overview
Improves SFT dataset packing and clarifies packing strategies. The BFD packer is rewritten with a more vectorized PyArrow implementation, adds shared column validation, filters empty sequences, and introduces an explicit overflow behavior switch (truncate vs split) used by the new "bfd_split" strategy.

API/UX updates around packing. pack_dataset now validates strategies, preserves/restores the original dataset format (including IterableDatasets) and ensures seq_lengths is included in formatted columns for BFD-based strategies; truncate_dataset is unified to the same Arrow-based map + format restore path.

Naming + docs/tests alignment. SFTConfig deprecates bfd-requeue by warning and mapping it to bfd_split, SFTTrainer treats bfd_split like bfd for padding-free/FlashAttention checks, tests are updated to assert format preservation and bfd_split behavior (including empty sequences), and docs add/expand guidance and citations for bfd_split vs wrapped.

Written by Cursor Bugbot for commit b265bbf. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, just

@mariosasko mariosasko requested a review from qgallouedec March 13, 2026 17:25
@qgallouedec
Copy link
Member

Thanks @mariosasko reviewing it today

@qgallouedec qgallouedec changed the title Misc packing improvements 35% faster packing + rename bfd-requeue to bfd_split Mar 13, 2026
@qgallouedec
Copy link
Member

import gc
import time

import numpy as np
from datasets import Dataset

from trl import pack_dataset


def generate_dataset(num_examples, mean_length=256, std_length=128, seed=42):
    """Generate a synthetic dataset with variable-length sequences."""
    rng = np.random.default_rng(seed)
    lengths = rng.normal(mean_length, std_length, num_examples).astype(int).clip(min=1)
    input_ids = [rng.integers(0, 32000, size=length).tolist() for length in lengths]
    attention_mask = [[1] * length for length in lengths]
    return Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})


def benchmark(num_examples, seq_length, strategy, min_runs=5, min_time=2.0, **kwargs):
    dataset = generate_dataset(num_examples, **kwargs)
    # Warmup (3 runs to stabilize caches)
    for _ in range(3):
        pack_dataset(dataset, seq_length=seq_length, strategy=strategy)

    # Run until we have enough samples and enough wall time
    times = []
    total_elapsed = 0.0
    while len(times) < min_runs or total_elapsed < min_time:
        gc.collect()
        gc.disable()
        start = time.perf_counter()
        packed = pack_dataset(dataset, seq_length=seq_length, strategy=strategy)
        elapsed = time.perf_counter() - start
        gc.enable()
        times.append(elapsed)
        total_elapsed += elapsed

    # Use median instead of mean to reduce sensitivity to outliers
    median_time = np.median(times)
    iqr = np.percentile(times, 75) - np.percentile(times, 25)
    total_tokens = sum(len(ids) for ids in dataset["input_ids"])
    packed_examples = len(packed)

    return {
        "num_examples": num_examples,
        "seq_length": seq_length,
        "strategy": strategy,
        "total_tokens": total_tokens,
        "packed_examples": packed_examples,
        "median_time": median_time,
        "iqr": iqr,
        "num_runs": len(times),
        "throughput_examples_per_sec": num_examples / median_time,
        "throughput_tokens_per_sec": total_tokens / median_time,
    }


if __name__ == "__main__":
    configs = [
        # Vary dataset size
        {"num_examples": 1_000, "seq_length": 512, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 512, "strategy": "bfd"},
        {"num_examples": 100_000, "seq_length": 512, "strategy": "bfd"},
        # Vary seq_length
        {"num_examples": 10_000, "seq_length": 256, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 1024, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 4096, "strategy": "bfd"},
        # Compare strategies
        {"num_examples": 10_000, "seq_length": 512, "strategy": "bfd_split"},
        {"num_examples": 10_000, "seq_length": 512, "strategy": "wrapped"},
    ]

    header = f"{'strategy':<12} {'examples':>10} {'seq_len':>8} {'packed':>8} {'runs':>5} {'median (s)':>14} {'tok/s':>14}"
    print(header)
    print("-" * len(header))

    for cfg in configs:
        result = benchmark(**cfg)
        print(
            f"{result['strategy']:<12} "
            f"{result['num_examples']:>10,} "
            f"{result['seq_length']:>8} "
            f"{result['packed_examples']:>8,} "
            f"{result['num_runs']:>5} "
            f"{result['median_time']:>8.3f} ± {result['iqr']:.3f} "
            f"{result['throughput_tokens_per_sec']:>14,.0f}"
        )
benchmark_results

Nice!!

TBH, I didn't look deeply into the new logic. My understanding is pretty basic, I get the general idea of how it works, I don't see any obvious mistakes, the tests are passing, and the agents are happy. So I'm going to approve and merge this. Albert, feel free to take a look too later; you have more experience with datasets than I do.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@qgallouedec qgallouedec merged commit 6c0fccd into huggingface:main Mar 13, 2026
1 check passed
@mariosasko mariosasko deleted the vectorized-bfd-chunking branch March 13, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants