Skip to content

feat: data and pyi files in the venv #2936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

aignas
Copy link
Collaborator

@aignas aignas commented May 27, 2025

This adds the remaining of the files into the venv and should get us
reasonably close to handling 99% of the cases.

The expected differences from this and a venv built by uv would be:

  • The RECORD files are excluded from the venvs for better cache hit
    rate in bazel, however I am not sure if we should do that for actual
    wheels that are downloaded from the internet.

Tested:

  • Building the //docs and manually checking the symlinks.
  • Ensure that the .pyi files get included.
  • Ensure that the libs get included.
  • Tests that .dist_info gets included.
  • Tests that ensure that there are files only from one version of the package.

Work towards #2156

This adds the necessary `.dist-info` files into the mix
and should get us reasonably close to handling 99% of the cases.

The expected differences from this and a `venv` built by `uv` would be:
* Shared libraries are stored in `<package>.libs` in `uv` venvs. This
  can be achieved in `rules_python` by changing the `installer` settings
  in the `wheel_installer/wheel.py#unzip` function.
* The `RECORD` files are excluded from the `venv`s for better cache hit
  rate in `bazel`, however I am not sure if we should do that for actual
  wheels that are downloaded from the internet.

Tested:
- [x] Building the `//docs` and manually checking the symlinks.
- [ ] Unit tests

Work towards bazel-contrib#2156
@aignas aignas force-pushed the exp/distinfo-venv branch from ebed371 to 9dd7589 Compare May 28, 2025 14:16
@aignas
Copy link
Collaborator Author

aignas commented May 28, 2025

I am thinking I should also link other things passed as data for completeness.

I think excluding anything outside site-packages directory for now would be wise.

@aignas aignas changed the title wip: dist-info folders in the venv feat: dist-info folders in the venv May 28, 2025
@aignas aignas changed the title feat: dist-info folders in the venv feat: data files in the venv May 28, 2025
@aignas
Copy link
Collaborator Author

aignas commented May 30, 2025

I'll push some of the changes later but I have a hard time writing a test that would fullfill some of the requirements in the PR todo list.

@aignas aignas marked this pull request as ready for review May 30, 2025 09:00
@aignas aignas requested a review from rickeylev as a code owner May 30, 2025 09:00
@aignas aignas changed the title feat: data files in the venv feat: data and pyi files in the venv May 30, 2025
Copy link
Collaborator

@rickeylev rickeylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: "src" field name: Let's pick a better name. "src" is an established name in bazel/starlark, so I don't think it should be overloaded.

  • package
  • dist
  • link_group
  • group_name
  • group
  • group_key1 (e.g. "simple"), group_key2 (e.g. "1.0.0")
  • dist, version (simple, 1.0)

?

The basic change in logic here is associating a set of paths with a dist-info name, right? And only one of those sets of paths will be used? The idea being, given:

("foo", <foo v1 paths>)
("foo", <foo v2 paths>)

Only one of those tuples of info is used.

@aignas aignas requested review from rickeylev and groodt May 31, 2025 10:49
@@ -682,15 +682,30 @@ def _create_venv_symlinks(ctx, venv_dir_map):
def _build_link_map(entries):
# dict[str kind, dict[str rel_path, str link_to_path]]
link_map = {}

# Here we store venv paths by package
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type description makes it clear they're keyed by package, so no need to restate it in prose.

Suggested change
# Here we store venv paths by package

Comment on lines +693 to +694
# If we detect that we are adding a dist-info for an already existing package
# we need to pop all of the previous symlinks from the link_map
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the semantics to last-wins. We want first-wins so that the topological ordering has meaning.

pkg_venv_paths = pkg_map.setdefault(entry.src, {}).setdefault(entry.kind, [])
pkg_venv_paths.append(entry.venv_path)

# We overwrite duplicates by design. The dependency closer to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above: if an entry is being overwritten, it means the first-wins semantics aren't occurring.

def _get_imports_and_venv_symlinks(ctx, semantics):
imports = depset()
venv_symlinks = depset()
if VenvsSitePackages.is_enabled(ctx):
venv_symlinks = _get_venv_symlinks(ctx)
dist_info_metadata = _get_distinfo_metadata(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move the get_distinfo_metadata() call inside the _get_venv_symlinks function. The value isn't used outside that function and that function already has the necessary inputs to call it.

package = normalize_name(package)

repo_runfiles_dirname = runfiles_root_path(ctx, dist_info_metadata.short_path).partition("/")[0]
venv_symlinks.append(VenvSymlinkEntry(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a comment somewhere around here that the particular ordering matters -- the dist-info must come before the other symlinks to ensure things are properly respected.

venv_path = dist_info_dir,
))

for src in ctx.files.srcs + ctx.files.data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems wrong here, but I can't quite put my finger on it.

Files in srcs have to have the init.py logic applied to detect namespace package boundaries to auto-detect the proper paths to create.

Files in data shouldn't be part of that logic. If the file is within a directory covered by something in srcs (i.e *.py), then there's no need to look at the data file. If the data file isn't in a directory covered by something in srcs, then it shouldn't treat a .so/.pyi/.pyc file as some sort of boundary marker.

@@ -67,6 +67,11 @@ the venv to create the path under.

A runfiles-root relative path that `venv_path` will symlink to. If `None`,
it means to not create a symlink.
""",
"src": """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comment on renaming this field?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants