Skip to content

Conversation

@mhucka
Copy link
Member

@mhucka mhucka commented Nov 30, 2024

The aim of this PR is to stop the current CI workflow failures in ci.yaml and cirq_compatibility.yaml (both in .github/workflows/).

In the case of the failure in ci.yaml, the change is only a stopgap measure: it disables the address sanitizer tests. The failures happen when the workflow runners are updated from Ubuntu 16.04 to 20.04; this update is necessary because GitHub no longer offers the Ubuntu 16 runners.

After spending a ridiculous amount of time testing various combinations of TensorFlow, TensorFlow Quantum, and compiler toolchains on a more recent Linux, my conclusion is that the ASAN failures stem from differences in the toolchains used to produce the copy of TensorFlow 2.15.0 we get from PyPI, and what we get under Ubuntu 20 when compiling TFQ on GitHub. This conclusion comes from the fact if I build a local copy of TensorFlow 2.15.0, and then build TFQ against that, using Clang for everything, the ASAN failures go away.

Given that we can't build TensorFlow as part of this workflow (it takes 2 hours to build using 24 cores on a fast machine), it's not clear what can be done to resolve the ASAN failures correctly. So, I'm temporarily commenting out the leak tests in this workflow so that we can proceed on doing other updates and releasing a new version of TFQ. However, this needs to be revisited at some point.

Ubuntu 16.04 is no longer supported by GitHub. Updated the runner to
use Ubuntu 20.04.
The current failures in the Cirq compatibility CI workflow are limited
to the Address Sanitizer (ASAN) tests in `scripts/msan_test.sh`. They
started happening only when we updated the version of Linux used by
the workflow from Ubuntu 16.04 to 20.04, because GitHub no longer
offers the Ubuntu 16 runners.

After spending a ridiculous amount of time testing various
combinations of TensorFlow, TensorFlow Quantum, and compiler
toolchains on a more recent Linux, my conclusion is that the ASAN
failures stem from differences in the toolchains used to produce the
copy of TensorFlow 2.15.0 we get from PyPI, and the current toolchain
used to compile TFQ on GitHub. This conclusion comes from the fact if
I build a local copy of TensorFlow, and then build TFQ against that,
using Clang for everything, the ASAN failures go away.

Given that we can't build TensorFlow as part of this workflow (it
takes 2 hours to build using 24-cores on a fast machine), it's not
clear what can be done to stop the ASAN failures.

I'm temporarily commenting out the leak tests in this workflow so that
we can proceed on doing other updates and releasing a new version of
TFQ. However, this needs to be revisited at some point.
@mhucka mhucka self-assigned this Nov 30, 2024
@mhucka mhucka marked this pull request as ready for review November 30, 2024 23:30
@MichaelBroughton MichaelBroughton merged commit 605d282 into tensorflow:master Dec 2, 2024
6 checks passed
@mhucka mhucka deleted the mhucka-stopgap-ci-fixes branch December 3, 2024 23:51
@mhucka mhucka restored the mhucka-stopgap-ci-fixes branch December 4, 2024 00:04
@mhucka mhucka added the area/devops Involves build systems, Make files, Bazel files, continuous integration, and/or other DevOps topics label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/devops Involves build systems, Make files, Bazel files, continuous integration, and/or other DevOps topics

Development

Successfully merging this pull request may close these issues.

2 participants