Skip to content

High CPU on large projects dominated by filesystem syscalls during import resolution (Windows) #3993

Description

@rchiodo

Summary

On a large project (the PyTorch repo: ~2,430 modules / ~1.32M lines), a cold full-project
analysis spends the majority of its CPU in filesystem syscalls during import/module
resolution
, not in type checking. On Windows this is heavily amplified because every probe
traverses the NTFS minifilter stack (Microsoft Defender + cloud-files / OneDrive filter).

I profiled a locally-built, symbolized release binary with an ETW sampling profiler (samply)
and symbolicated against the matching PDB. Sharing the breakdown plus a few mitigation ideas.

Environment

  • OS: Windows 11 (NTFS; Microsoft Defender real-time protection on; OneDrive cloud-files filter active)
  • Pyrefly: locally built target/release with debug symbols
  • Workload: PyTorch repository, ~2,430 modules, ~1,323,529 lines, ~17,760 type errors
  • Repro: pyrefly check (same engine the TSP/IDE server runs), CPU-weighted via per-sample CPU time

Methodology

  • pyrefly check --report-timings for phase/module timing
  • samply (ETW, elevated) for a full CPU sampling profile
  • Symbolicated pyrefly.exe frames against pyrefly.pdb; aggregated self/inclusive time using
    per-sample CPU deltas (so idle/blocked samples are not counted)

Phase timing (--report-timings, thread-time)

  • Solutions (type solving): ~28.4s (78%)
  • Answers (bindings): ~6.8s (19%)
  • Ast (parse): ~1.0s
  • Exports: ~0.1s

No single pathological module — heaviest was torch.testing._internal.common_methods_invocations
at ~1.6s; cost is a long tail spread across all modules.

Where the CPU actually goes (sampling self-time, CPU-weighted, ~94 CPU-seconds total)

  • ~57% Filesystem / kernel I/Ontoskrnl ~28%, Ntfs.sys ~12%, ntdll ~11%, FLTMGR + minifilters
  • ~29% Pyrefly type-checking compute (solver / bindings / types) — each function <1% self; long tail
  • ~8% Allocator churn (mimalloc: Type clone/drop, alloc/free)
  • ~2.7% Antivirus (Defender mssecflt.sys / WdFilter.sys)
  • remainder: memcpy/memset, glob matching, path UTF-16 conversion

Root cause

  • ~62% of all CPU is inclusive under import/module resolution, along the path
    LoaderFindCache::find_import -> find_import_internal -> find_one_part_in_root ->
    std::fs::metadata / File::open_native / DirEntryCache::file_exists.
  • ~96% of all filesystem self-time is under import resolution.
  • The dominant cost is per-candidate filesystem probing to locate modules (stat/open each
    candidate path across each search root), not reading file contents (parse is <1s) and not
    pure type math.
  • Amplifiers:
    • Large search path (project + typeshed + site-packages + many namespace packages) means each
      of ~2,430 modules is probed across many candidate roots.
    • On Windows each metadata/open traverses NTFS + Defender (mssecflt/WdFilter) + the
      cloud-files filter (cldflt, OneDrive), so kernel/minifilter time dominates per syscall.
  • Note: a DirEntryCache exists, yet std::fs::metadata is still ~52% inclusive — suggesting many
    probes stat per-candidate before (or instead of) consulting a cached directory listing.

IDE / TSP amplification

The same workspace opened in VS Code (Pylance spawning pyrefly tsp) accumulated ~1,196
CPU-seconds
and ~2.35 GB before going idle — roughly 12x the one-shot CLI (94 CPU-s). This
appears to be the IDE server pulling far more modules into scope (workspace indexing for
completions/auto-import), i.e. the same bottleneck amplified by scale, not a different one.

Suggested mitigations

  1. Collapse per-candidate probing into one directory enumeration per directory. Instead of
    stat/exists on each candidate file, readdir the directory once into an in-memory set and
    answer all candidates from it — resolving N modules in a directory becomes ~1 syscall, not N.
    (The existing DirEntryCache seems to be bypassed on the hot path.)
  2. Cache negative lookups (module-not-found per root) so the same missing candidates are not
    re-probed across many importers.
  3. Windows-specific: prefer directory enumeration (FindFirstFile/FindNextFile) over
    per-file CreateFile/stat to reduce the number of minifilter/Defender traversals.
  4. Environmental (for users, not a code fix): Defender real-time scanning and the OneDrive
    cloud-files filter measurably inflate per-syscall cost; analyzing a project outside a
    cloud-synced folder and/or excluding it from real-time scanning reduces wall time noticeably.

Notes

  • Symbol names are from a symbolicated release build and may differ slightly from source.
  • Happy to share the profile artifacts or re-run with specific flags if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    language-serverIssues specific to our IDE integration rather than type checkingpytorch

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions