Skip to content

Add --experimental_check_external_other_files option #25957

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

hanwen-flow
Copy link
Contributor

@hanwen-flow hanwen-flow commented Apr 28, 2025

This option disables the code path that checks EXTERNAL_OTHER files.
These are files outside of roots that Bazel tracks; they occur when
outputs/sources/repos symlink files from outside directories.

This is useful for process migration scenarios using containers. Here
ctime and inode numbers change from under the bazel server without
file content changes.

Addresses #25954.

RELNOTES: File change checks for non-output, non-repo external files
can now be disabled with the
--experimental_check_external_other_files flag.

@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Apr 28, 2025
@meisterT meisterT added the team-Core Skyframe, bazel query, BEP, options parsing, bazelrc label Apr 28, 2025
@meteorcloudy meteorcloudy requested a review from lberki May 7, 2025 08:37
@meteorcloudy
Copy link
Member

@lberki Can you take a look?

@lberki
Copy link
Contributor

lberki commented May 8, 2025

The change itself looks reasonable, although it's lacking a test case. Mind adding one?

But before that: In what scenario is this problematic? As far as I can tell, this can only make a difference when the files underneath actually changed, and even then, Bazel would eventually checksum them and realize that they haven't.

I think it's very reasonable not to check the files under the install base (after all, they are already checked for changes by the Bazel launcher as you point it out), but if e.g. someone symlinks a file from /usr/bin to an external repository and that file changes, it makes total sense for Bazel to notice that change, doesn't it?

@lberki lberki requested a review from Wyverald May 8, 2025 09:10
@lberki
Copy link
Contributor

lberki commented May 8, 2025

Also adding @Wyverald to the review thread since it's been a long time I looked at external repositories.

@hanwen-flow
Copy link
Contributor Author

Mind adding one?

not at all; i'll look into it.

In what scenario is this problematic?

If, as you suggested some months ago, one uses CRIU for freezing bazel, and then revives the server on a new machine, the files have new identities. In particular the ctime and inode number are not under control of Docker/CRI-O/Podman, so they always change. This has no effect on correctness (the files are unchanged and hash to the same checksum), but it causes some invalidations to occur. In our test repo (30k source files, 20k actions), this makes a null build go from 0.5s to 30s.

You can try this out for yourself, by hacking the launcher to ignore mtime changes, and then touching embedded_tools on a live server.

Maybe these bogus invalidations should be handled more cleverly (are they hashed with sha256 as well, or handled differently?), but diagnosing this goes beyond my Bazel-fu. I tried reading the commits that introduced this and couldn't make sense of how this works, and why the check happens in an else , rather than being together with the rest of the checks.

it's very reasonable not to check the files under the install base (after all, they are already checked for changes by the Bazel launcher as you point it out)

if you can give me hint where these checks are scheduled, I can send a separate PR to disable it. Then we can change the "non-output-external-file" name to a more descriptive "host-file".

but if e.g. someone symlinks a file from /usr/bin to an external repository and that file changes, it makes total sense for Bazel to notice that change, doesn't it?

I agree.

@Wyverald Wyverald requested a review from haxorz May 8, 2025 14:39
@Wyverald
Copy link
Member

Wyverald commented May 8, 2025

If possible, I'd like @haxorz to take a look too, as he's reviewed some previous changes in this area. I admit I understand all the external/external-repo file checking logic very poorly, despite having worked on external dependencies for a few years.

Just regarding the PR itself, it seems okay to me to get it submitted since it's guarded behind a flag and all. I'd love to understand this part of the code better.

@haxorz haxorz added the team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. label May 8, 2025
@haxorz
Copy link
Contributor

haxorz commented May 8, 2025

Sorry but I probably won't have time to do a deep review for a few weeks.

@lberki
Copy link
Contributor

lberki commented May 8, 2025

@hanwen-flow any ideas as to how Bazel should tell the "CRIU snapshot got revived, now everything is different" case apart from the "/usr/bin/ls really changed" one? AFAICT Bazel sees the exact same things in both cases: a changed inode in the file system. I suspect that the check summing happens only as part of checking the local action cache, which would inflict a Skyframe invalidation upon you.

If there is no way to do so, the command line flag would only turn one kind of badness (slow startup) into another kind (incorrectness) and Bazel's ethos is that we'd rather be correct than fast if both at the same time is completely impossible..

@hanwen-flow
Copy link
Contributor Author

hanwen-flow commented May 8, 2025

any ideas as to how Bazel should tell the "CRIU snapshot got revived, now everything is different" case apart from the "/usr/bin/ls really changed" one?

Bazel can't; that is exactly why we need the flag, because we can tell the cases apart and tell bazel what is going on. We want to use CRIU to provide 'warm bazel' CI infrastructure. In that case, we know that we are reviving a CRIU snapshot, so we know that all the file changes are bogus. We also know that the file system is not changed, because in a hermetic CI run, we only change the source under test and leave everything else constant.

I would not recommend using this flag for interactive use (ie. where a user is able to change /usr/bin/ls). In fact, it's an undocumented flag.

If there is no way to do so, the command line flag would only turn one kind of badness (slow startup) into another kind (incorrectness) and Bazel's ethos is that we'd rather be correct than fast if both at the same time is completely impossible.

How does this reasoning work for the other flags? (--experimental_check_external_repository_files=false, --experimental_check_output_files=false) ?

@lberki
Copy link
Contributor

lberki commented May 8, 2025

any ideas as to how Bazel should tell the "CRIU snapshot got revived, now everything is different" case apart from the "/usr/bin/ls really changed" one?

Bazel can't; that is exactly why we need the flag, because we can tell the cases apart and tell bazel what is going on. We want to use CRIU to provide 'warm bazel' CI infrastructure. In that case, we know that we are reviving a CRIU snapshot, so we know that all the file changes are bogus. We also know that the file system is not changed, because in a hermetic CI run, we only change the source under test and leave everything else constant.

Wouldn't this change just postpone the cost to the second Bazel invocation? (presumably you'd only pass this flag to the first invocation after the CRIU restore?)

I would not recommend using this flag for interactive use (ie. where a user is able to change /usr/bin/ls). In fact, it's an undocumented flag.

If there is no way to do so, the command line flag would only turn one kind of badness (slow startup) into another kind (incorrectness) and Bazel's ethos is that we'd rather be correct than fast if both at the same time is completely impossible.

How does this reasoning work for the other flags? (--experimental_check_external_repository_files=false, --experimental_check_output_files=false) ?

It doesn't, but surely you understand why I'm trying to minimize the number of such holes. Now that you mention it, wouldn't --experimental_check_external_repository_files=false work for your use case as well as what you propose?

@hanwen-flow
Copy link
Contributor Author

Wouldn't this change just postpone the cost to the second Bazel invocation? (presumably you'd only pass this flag to the first invocation after the CRIU restore?)

In case of CI, there is no second invocation. Once we get the test results out to github/buildkite/etc., the container is terminated, and Bazel passes away, blissfully unaware of the cruel, chaotic world around it. :-)

Now that you mention it, wouldn't --experimental_check_external_repository_files=false work for your use case as well as what you propose?

No, it's a separate code path. See

and the if that goes with the else.

@lberki
Copy link
Contributor

lberki commented May 12, 2025

Let me see if I understand this correctly:

  • If the condition of the if branch in line 3650 is true, it means that the knowledge of ExternalFilesKnowledge is not usable, and Bazel regenerates the whole thing
  • If the condition of the branch in line 3734 is true, it means that Bazel uses the knowledge in ExternalFilesKnowledge to verify what changed
  • If neither is true, no external files has changed and that's the happy path and that's what you want

From this, it looks like what you practically want is --noexperimental_check_external_repository_files to actually not check for external repository files and you are adding a new flag for it. But then why not fix --noexperimental_check_external_repository_files? Or do you want to single out files under the install base and still want to check other files in external repositories, files in external repositories that are symlinks to the output tree, etc.? (if so, the code doesn't seem to be doing that)

@hanwen-flow
Copy link
Contributor Author

From this, it looks like what you practically want is --noexperimental_check_external_repository_files to actually not check for external repository files and you are adding a new flag for it. But then why not fix --noexperimental_check_external_repository_files?

I want to disable all file dirtiness checks based on (ctime + inode number); your suggestion here also works for me.

Let me rephrase how I understand your explanation to make sure we are on the same page:

This piece of code does a two-step check, where the first if gathers data to use in the else if branch for the next run (this could use some more commenting). The host dependencies and embedded_tools are considered a special flavor of external repository files, so any of its dirtiness checks should be controlled with the existing flag.

(I suppose the checks for embedded_tools paths are just generated as bazel derefs symlinks in the toolchain setup, which for Java defaults uses the embedded tools by default, and for C++ uses /usr/bin/gcc and friends. There is probably no separate call to schedule dirtiness checks for embedded_tools as a whole)

The name "non-output external file" remains a mystery to me.

@lberki
Copy link
Contributor

lberki commented May 14, 2025

At least I know the answer to the mystery of "non-output external file": it's files that are not under a --package_path entry, not under an external repository under $OUTPUT_BASE/external and not in the output tree. IOW, it's a file Bazel doesn't have control over (/usr/bin/gcc, $HOME/.bashrc and the like). It's not very good naming, I'm afraid.

So you want to disable all inode/ctime-based checks everywhere, including in the source tree(s), external repositories and what you call "host files"? But then what would Bazel do? Are you planning to change the top-level targets or the configuration flags? (if not, nothing changes and then it would be a null build) Or did I misunderstand you and you still want the checks in the source tree to be done? (but then how come those would retain their inode number / ctime?)

I stared at that code a bit, including with debugger and it looks like my initial assessment was wrong: the first if (in line 3650, with the many || operators in its condition) triggers if there is any file which is missing diff awareness information, which is almost always.

@hanwen-flow
Copy link
Contributor Author

So you want to disable all inode/ctime-based checks everywhere, including in the source tree(s), external repositories and what you call "host files"? But then what would Bazel do?

IIUC, inotify (--watchfs) will report changed files? But yes, good point, I should check that.

@lberki
Copy link
Contributor

lberki commented May 14, 2025

So you want to disable all inode/ctime-based checks everywhere, including in the source tree(s), external repositories and what you call "host files"? But then what would Bazel do?

IIUC, inotify (--watchfs) will report changed files? But yes, good point, I should check that.

Yeah, --watchfs and --noexperimental_check_external_repository_files together make the condition false.

(Have you seen my above question as to what you expect Bazel to do after you disable all ctime/inode based invalidation? My understanding is that as long as you don't change the command line, the build will be a no-op)

@hanwen-flow
Copy link
Contributor Author

In case of CI, fetching and checking out the git commit under test will generate a write under the source tree, which would trigger the inotify based diff awareness, leading to a partial invalidation. If there is no write in the source tree, then obviously the build should be a null build.

@lberki
Copy link
Contributor

lberki commented May 14, 2025

Okay, so your game plan is:

  • Disable ctime/inode number based invalidation for files under $OUTPUT_BASE/external, "host files", source files and the output tree
  • Rely on inotify and --watchfs to invalidate the source tree

Did I get this right?

@hanwen-flow
Copy link
Contributor Author

Almost: if watchfs generates a reliable stream of diffs, I don't need to disable ctime/ino invalidation for source files.

@lberki
Copy link
Contributor

lberki commented May 14, 2025

Almost: if watchfs generates a reliable stream of diffs, I don't need to disable ctime/ino invalidation for source files.

If you can't rely on watchfs, you are in trouble anyway, because then you can't tell source files that really changed and those whose ctime/inode number changed due to CRIU and then you only have the choice between "all source files changed" and "no-op build".

Given this and the semantics of the existing flags, it looks like --experimental_check_non_output_external_files is indeed the best option.

I propose that you add a test case then I review it?

(cc @meteorcloudy and @Wyverald in case they disagree with this plan)

@hanwen-flow
Copy link
Contributor Author

If you can't rely on watchfs, you are in trouble anyway, because then you can't tell source files that really changed and those whose ctime/inode number changed due to CRIU and then you only have the choice between "all source files changed" and "no-op build".

not all hope is lost. Containers can control mtime, so the updated mtime timestamps are an indication that something changed, but I agree we'd need further hacks to make bazel work with a container-appropriate idea of file identity.

@hanwen-flow hanwen-flow force-pushed the nonoutput branch 2 times, most recently from dc0b468 to 1e2a8ea Compare May 19, 2025 08:48
@hanwen-flow
Copy link
Contributor Author

PTAL.

Added a test.

I've added a separate commit that more clearly describes what these files are. Hope it's OK to change the ordering of the enum.

I think it would be good to improve the naming, though, because by adding a command-line option, we are setting the name in stone.

Current name: "non output external files". Some ideas:

  • non-bazel managed files
  • external files (actually the best description, but unfortunately we have "external" repositories as well)
  • host files

WDYT ?

@meteorcloudy
Copy link
Member

@Wyverald is looking at the relevant code recently, please wait for his review.

@Wyverald
Copy link
Member

Wyverald commented May 19, 2025

Let me rephrase how I understand your explanation to make sure we are on the same page:

This piece of code does a two-step check, where the first if gathers data to use in the else if branch for the next run (this could use some more commenting).

This part seems incorrect. The difference between the two branches is simply whether we try to check dirtiness of all known values (if branch, line 3719) or only specific "seen" external files in the previous evaluation (else if branch, line 3760). The if branch doesn't pass information to the else if branch beyond noting which external files it saw.

I'm still reading through this code (and environs). Note to any interested parties -- the previous PR that introduced --experimental_check_external_repository_files is at #14404 and has a lot more background to go through. The more I read, the more I feel like our external file tracking code is just a tangled web of historical accidents that hasn't received much attention since it's not exercised inside Google. Maybe we need to think about this part more carefully given the repo contents cache suddenly turns a lot of EXTERNAL_REPO files into EXTERNAL files.

(That's not to say that this specific PR is gated on me figuring everything out -- I'm still happy to get it submitted, provided that we recognize it's an experimental flag and could go away with little notice.)

@hanwen-flow
Copy link
Contributor Author

Maybe we need to think about this part more carefully given the repo contents cache suddenly turns a lot of EXTERNAL_REPO files into EXTERNAL files.

that is interesting; that means that my comment is misleading. In the case of a write to the repo contents cache (eg. dependency is upgraded through a normal bzlmod update), would the update be noticed correctly? Or do the mechanics depend on a write to the cache triggering an update to EXTERNAL files, which causes further invalidation? That would be concerning.

@Wyverald
Copy link
Member

Maybe we need to think about this part more carefully given the repo contents cache suddenly turns a lot of EXTERNAL_REPO files into EXTERNAL files.

that is interesting; that means that my comment is misleading.

Which comment are you referring to?

In the case of a write to the repo contents cache (eg. dependency is upgraded through a normal bzlmod update), would the update be noticed correctly? Or do the mechanics depend on a write to the cache triggering an update to EXTERNAL files, which causes further invalidation? That would be concerning.

The repo contents cache as currently designed/implemented is "append-only", so there shouldn't be any need to worry here (IIUC, this PR only changes the behavior of detecting file changes, not additions). For a little bit more context: when you upgrade a bazel_dep, for example, the repo definition would change, which would make it point to a different repo contents cache entry instead of overwriting an existing one.

@hanwen-flow
Copy link
Contributor Author

      152:     /**
      153:      * None of the above. We encounter these paths when outputs, source files or external repos symlink
      154:      * to files outside of Bazel-managed directories. For example, C compilation by the host compiler may
      155:      * depend on /usr/bin/gcc. Bazel makes a best-effort attempt to detect changes in such files.
      156:      */
      157:     EXTERNAL,
 156  158:   }

arguably, the repo cache is bazel managed, so the above is a bit imprecise.

detecting file changes, not additions

I hadn't realized that addition are detected through a different path, but now that you say it, it makes sense.

@lberki
Copy link
Contributor

lberki commented May 21, 2025

I was about to say "looks good to me, merge this", but then I read up on @Wyverald 's comments about the external repository cache.

I wouldn't feel too bad about renaming an experimental option (that liberty is exactly why it's experimental), but we should nevertheless make an effort to come up with a good nomenclature. How about calling these "foreign" files? "external" is way too overloaded. That word still conflicts with rules_foreign_cc, but arguably it's the very same concept: files outside of the domain of knowledge of Bazel.

@Wyverald what do you mean by "not detecting additions"? My understanding is that Bazel still lists directories and dirties Skyframe if it detects an additional file.

Since this whole flag is about telling Bazel "trust me, you don't need to look there", the usual "correctness first" ethos doesn't apply here. And from my understanding of @hanwen-flow 's use case, it looks like all mtimes and inode numbers can change, including those of the true repository cache, so he presumably would want to turn this check off for the true repository cache, too.

So I propose that we rename EXTERNAL to FOREIGN, and maybe add another enum member for the true repository cache but either way, treat them the same way as far as this flag is concerned.

@Wyverald I agree with your "tangled web of historical accidents" assessment :(

@hanwen-flow
Copy link
Contributor Author

"foreign" is excellent.

yes, we disable ctime/inode checks for the repository cache too.

"not detecting additions"?

I meant: in our scenario, Bazel cannot get confused for additions, because there is no previous ino/ctime to compare to.

@Wyverald
Copy link
Member

I actually have a slight preference for keeping the name "external", not in the least because everything around it is called "external" (eg. ExternalFilesHelper). idk, maybe EXTERNAL_OTHER, to contrast with EXTERNAL_REPO. The "non-output" part of the flag name should really be "non-output non-repo", but that's a mouthful.

That's a long way of saying I don't have a great idea :P but I'm still happy to get this merged, pending some code comment fixes as you noted. Just waiting for Lukács's approval now.

@hanwen-flow
Copy link
Contributor Author

idk, maybe EXTERNAL_OTHER, to contrast with EXTERNAL_REPO.

works for me.

Clarify what these files are, and rename nonOutputExternal to
ExternalOther throughout.

RELNOTES: n/a
@hanwen-flow
Copy link
Contributor Author

I opened #26121 for the rename , which should not be squashed with the functional change.

This option disables the code path that checks EXTERNAL_OTHER files.
These are files outside of roots that Bazel tracks; they occur when
outputs/sources/repos symlink files from outside directories.

This is useful for process migration scenarios using containers. Here
ctime and inode numbers change from under the bazel server without
file content changes.

Addresses bazelbuild#25954.

RELNOTES: File change checks for non-output, non-repo external files
can now be disabled with the
`--experimental_check_external_other_files` flag.
@lberki
Copy link
Contributor

lberki commented May 22, 2025

I'm fine either way. I'll approve this change, but please either wait until @meteorcloudy has a chance to look at it (or else signal that you're impatient and then I'll ping him :) )

@lberki
Copy link
Contributor

lberki commented May 22, 2025

...I mean until @Wyverald has a chance to look at it.

@Wyverald Wyverald changed the title Add --experimental_check_non_output_external_files option Add --experimental_check_external_other_files option May 22, 2025
@Wyverald
Copy link
Member

I'll handle the import.

@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants