Skip to content

Numerical Duplication in XTransformer Inference #6

@Lakshmi-bashyam

Description

@Lakshmi-bashyam

Bug Report: Numerical Duplication in XTransformer Inference

Summary

A critical numerical regression has been identified in the pecos.xmc.xtransformer module. During inference, output embeddings for distinct input instances exhibit exact numerical duplication. This failure is specifically observed in multi-GPU environments when the --max-pred-chunk 1 flag is applied.

Observed Behavior

  • Widespread Test Failures: The same duplication error and numerical mismatch are observed across multiple test suites, including test_encode and test_bert.
  • Environment-Dependent: The bug is exclusive to multi-GPU setups. All relevant tests pass successfully on CPU or Single-GPU configurations (e.g., when setting CUDA_VISIBLE_DEVICES=0).
  • Row-Level Shadowing: Under the failing conditions, Row 1 of the obtained embedding matrix is a bit-identical duplicate of Row 0. This persists even though the input text sequences for each row are confirmed to be unique.
  • Chunking Logic Inconsistency: The use of --max-pred-chunk 1 in a multi-GPU environment appears to break data isolation, causing the first instance's result to be repeated across the entire output instead of unique results per chunk.
  • Software Context: Highly reproducible with transformers==4.49.0 and python 3.10.

Full Error Log

The assertion failure below highlights that Obtained Row 1 (0.0646193) is identical to Row 0, whereas the Expected ground truth for Row 1 is -1.2388697.

_______________________________________________________ test_encode ________________________________________________________

    def test_encode(tmpdir):
        ...
        X_emb_pred_B1 = np.load(str(emb_path_B1))
>       assert X_emb_pred_B1 == approx(X_emb_pred, abs=1e-6)
E       assert array([[ 0.06460965,  0.03247809, -0.02677868,  0.02476774, -0.04013923,
E                0.0062143 ,  0.06983176,  0.01013092],
E              [ 0.0646193 ,  0.03247118, -0.026836  ,  0.02469238, -0.04022763,
E                0.0061893 ,  0.06984028,  0.01007433],
E              [ 0.06460024,  0.03247342, -0.02679601,  0.02474649, -0.04013934,
E                0.00621432,  0.06983399,  0.01009427]], dtype=float32) == approx([[0.06460965...], [-1.23886978...], [0.06460023...]])
E
E         comparison failed. Mismatched elements: 8 / 24:
E         Max absolute difference: 2.264845132827759
E         Max relative difference: 91.10012817382812
E
E         Index  | Obtained     | Expected
E         (0, 0) | 0.064609654  | 0.064609654 ± 1.0e-06 <-- [REFERENCE ROW 0]
E         (1, 0) | 0.064619295  | -1.2388697862 ± 1.0e-06 <-- [ERROR: Duplicate of Row 0]
E         (1, 1) | 0.03247118   | 2.2973163127 ± 1.0e-06  <-- [ERROR: Duplicate of Row 0]
E         (1, 2) | -0.026836    | -0.1875699758 ± 1.0e-06
E         ...
E
E         Traceback (most recent call last):
E           File "test/pecos/xmc/xtransformer/test_xtransformer.py", line 202, in test_encode
E             assert X_emb_pred_B1 == approx(X_emb_pred, abs=1e-6)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions