forked from amzn/pecos
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug Report: Numerical Duplication in XTransformer Inference
Summary
A critical numerical regression has been identified in the pecos.xmc.xtransformer module. During inference, output embeddings for distinct input instances exhibit exact numerical duplication. This failure is specifically observed in multi-GPU environments when the --max-pred-chunk 1 flag is applied.
Observed Behavior
- Widespread Test Failures: The same duplication error and numerical mismatch are observed across multiple test suites, including
test_encodeandtest_bert. - Environment-Dependent: The bug is exclusive to multi-GPU setups. All relevant tests pass successfully on CPU or Single-GPU configurations (e.g., when setting
CUDA_VISIBLE_DEVICES=0). - Row-Level Shadowing: Under the failing conditions, Row 1 of the obtained embedding matrix is a bit-identical duplicate of Row 0. This persists even though the input text sequences for each row are confirmed to be unique.
- Chunking Logic Inconsistency: The use of
--max-pred-chunk 1in a multi-GPU environment appears to break data isolation, causing the first instance's result to be repeated across the entire output instead of unique results per chunk. - Software Context: Highly reproducible with
transformers==4.49.0andpython 3.10.
Full Error Log
The assertion failure below highlights that Obtained Row 1 (0.0646193) is identical to Row 0, whereas the Expected ground truth for Row 1 is -1.2388697.
_______________________________________________________ test_encode ________________________________________________________
def test_encode(tmpdir):
...
X_emb_pred_B1 = np.load(str(emb_path_B1))
> assert X_emb_pred_B1 == approx(X_emb_pred, abs=1e-6)
E assert array([[ 0.06460965, 0.03247809, -0.02677868, 0.02476774, -0.04013923,
E 0.0062143 , 0.06983176, 0.01013092],
E [ 0.0646193 , 0.03247118, -0.026836 , 0.02469238, -0.04022763,
E 0.0061893 , 0.06984028, 0.01007433],
E [ 0.06460024, 0.03247342, -0.02679601, 0.02474649, -0.04013934,
E 0.00621432, 0.06983399, 0.01009427]], dtype=float32) == approx([[0.06460965...], [-1.23886978...], [0.06460023...]])
E
E comparison failed. Mismatched elements: 8 / 24:
E Max absolute difference: 2.264845132827759
E Max relative difference: 91.10012817382812
E
E Index | Obtained | Expected
E (0, 0) | 0.064609654 | 0.064609654 ± 1.0e-06 <-- [REFERENCE ROW 0]
E (1, 0) | 0.064619295 | -1.2388697862 ± 1.0e-06 <-- [ERROR: Duplicate of Row 0]
E (1, 1) | 0.03247118 | 2.2973163127 ± 1.0e-06 <-- [ERROR: Duplicate of Row 0]
E (1, 2) | -0.026836 | -0.1875699758 ± 1.0e-06
E ...
E
E Traceback (most recent call last):
E File "test/pecos/xmc/xtransformer/test_xtransformer.py", line 202, in test_encode
E assert X_emb_pred_B1 == approx(X_emb_pred, abs=1e-6)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working