Skip to content

test(TPU): relax numerical tests for TPUs + parallel TPU testing#2609

Merged
avik-pal merged 7 commits intomainfrom
ap/tpu_test_fixes
Mar 6, 2026
Merged

test(TPU): relax numerical tests for TPUs + parallel TPU testing#2609
avik-pal merged 7 commits intomainfrom
ap/tpu_test_fixes

Conversation

@avik-pal
Copy link
Collaborator

@avik-pal avik-pal commented Mar 4, 2026

@avik-pal avik-pal force-pushed the ap/tpu_test_fixes branch 2 times, most recently from ab2b836 to efd8382 Compare March 4, 2026 23:39
@avik-pal
Copy link
Collaborator Author

avik-pal commented Mar 5, 2026

@sbrantq can you take a look at the probprog tests for TPU https://github.com/EnzymeAD/Reactant.jl/actions/runs/22694756046/job/65798539411?pr=2609, most are likely just tolerance issues

@avik-pal avik-pal force-pushed the ap/tpu_test_fixes branch 4 times, most recently from 119ee80 to 9d5d92a Compare March 5, 2026 20:03
@sbrantq
Copy link
Member

sbrantq commented Mar 5, 2026

@sbrantq can you take a look at the probprog tests for TPU https://github.com/EnzymeAD/Reactant.jl/actions/runs/22694756046/job/65798539411?pr=2609, most are likely just tolerance issues

Looks like JAX used to generate reference results is falling back to CPU here, giving different rng results. I guess the fix would be only doing pointwise check when Reactant is using a CPU backend (will commit a fix shortly)

@avik-pal avik-pal force-pushed the ap/tpu_test_fixes branch 2 times, most recently from 8c2a9c5 to 0bf22d5 Compare March 5, 2026 21:23
@avik-pal avik-pal marked this pull request as ready for review March 5, 2026 21:46
@avik-pal avik-pal requested a review from wsmoses March 5, 2026 21:46
@avik-pal avik-pal changed the title test(TPU): relax numerical tests for TPUs test(TPU): relax numerical tests for TPUs + parallel TPU testing Mar 5, 2026
end
else
if RunningOnTPU
@warn "Skipping MultiRotate test on TPU"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what crashes here, will concurrently investigate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avik-pal avik-pal force-pushed the ap/tpu_test_fixes branch 2 times, most recently from fd1ad4f to 9eda871 Compare March 6, 2026 00:56
@avik-pal avik-pal force-pushed the ap/tpu_test_fixes branch from 50c6653 to 4da2f49 Compare March 6, 2026 01:32
@avik-pal
Copy link
Collaborator Author

avik-pal commented Mar 6, 2026

@sbrantq should we disable these on TPU? Seems like an XLA internal bug

│ F0306 03:13:48.227292   15978 shape_util.cc:1214] Check failed: return_shape->IsTuple() Invalid index {0} for shape u32[2,2]{1,0}
│ *** Check failure stack trace: ***
│     @     0x7a3282749064  absl::log_internal::LogMessage::SendToLog()
│     @     0x7a3282749018  absl::log_internal::LogMessage::Flush()
│     @     0x7a32820fa4d4  xla::ShapeUtil::GetSubshape()
│     @     0x7a327527bdfb  xla::ShapeTree<>::CopySubtreeFrom()
│     @     0x7a327527a843  xla::HloReplicationAnalysis::ComputeHloReplicationOnComputation()
│     @     0x7a327527b1da  xla::HloReplicationAnalysis::ComputeHloReplicationOnComputation()
│     @     0x7a3275279d60  xla::HloReplicationAnalysis::ComputeHloReplicationOnComputation()
│     @     0x7a3275279d60  xla::HloReplicationAnalysis::ComputeHloReplicationOnComputation()
│     @     0x7a3275279d60  xla::HloReplicationAnalysis::ComputeHloReplicationOnComputation()
│     @     0x7a327527bf3a  xla::HloReplicationAnalysis::ComputeHloReplication()
│     @     0x7a327527e33b  xla::HloReplicationAnalysis::Run()
│     @     0x7a327527e245  xla::HloReplicationAnalysis::Run()
│     @     0x7a327523ce75  xla::AllReduceSimplifier::RunImpl()
│     @     0x7a327e7320e5  xla::HloPassPipeline::RunPassesInternal<>()
│     @     0x7a327e73192f  xla::HloPassPipeline::RunImpl()
│     @     0x7a3273223fca  xla::HloPassFix<>::RunOnChangedComputationsOnce()
│     @     0x7a3273223681  xla::HloPassFix<>::RunToFixPoint()
│     @     0x7a3273223260  xla::HloPassFix<>::RunImpl()
│     @     0x7a327e7320e5  xla::HloPassPipeline::RunPassesInternal<>()
│     @     0x7a327e73192f  xla::HloPassPipeline::RunImpl()
│     @     0x7a32732196c9  xla::jellyfish::(anonymous namespace)::HloOptimizeThroughLayoutAssignment()::$_0::operator()()
│     @     0x7a327df59c93  absl::internal_any_invocable::LocalInvoker<>()
│     @     0x7a32825a42d6  Thread::ThreadBody()
│     @     0x7a39edd17aa4  (unknown)
│     @     0x7a39edda4c6c  (unknown)
│ https://symbolize.stripped_domain/r/?trace=7a3282749063,7a3282749017,7a32820fa4d3,7a327527bdfa,7a327527a842,7a327527b1d9,7a3275279d5f,7a3275279d5f,7a3275279d5f,7a327527bf39,7a327527e33a,7a327527e244,7a327523ce74,7a327e7320e4,7a327e73192e,7a3273223fc9,7a3273223680,7a327322325f,7a327e7320e4,7a327e73192e,7a32732196c8,7a327df59c92,7a32825a42d5,7a39edd17aa3,7a39edda4c6b&map= 
│ 
│ [13014] signal 6 (-6): Aborted
│ in expression starting at /__w/Reactant.jl/Reactant.jl/test/probprog/mcmc_logpdf.jl:38

@avik-pal avik-pal force-pushed the ap/tpu_test_fixes branch from 8dfb352 to 1ac45b8 Compare March 6, 2026 14:45
@avik-pal avik-pal merged commit 49642d2 into main Mar 6, 2026
115 of 125 checks passed
@avik-pal avik-pal deleted the ap/tpu_test_fixes branch March 6, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants