[MAINTENANCE] dead code analyzer#11760
Conversation
✅ Deploy Preview for niobium-lead-7998 canceled.
|
for more information, see https://pre-commit.ci
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #11760 +/- ##
========================================
Coverage 84.66% 84.66%
========================================
Files 471 471
Lines 39170 39170
========================================
Hits 33165 33165
Misses 6005 6005 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a dead-code analysis utility that statically traces reachability from @public_api roots, with optional dynamic-import heuristics and test-file classification, and wires it into the repo’s Invoke tasks for easy execution.
Changes:
- Added
scripts/find_dead_code.py, an AST-based reachability analyzer (modules, symbols, dynamic-import heuristics, tests) with optional JSON output. - Added
scripts/dead_code_exceptions.jsonto configure always-reachable module patterns / suppressions. - Added an
invoke dead-codewrapper task intasks.py.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tasks.py | Adds an Invoke task that runs the dead-code analyzer script with CLI flags. |
| scripts/find_dead_code.py | Implements the multi-layer AST reachability analysis and report generation/printing. |
| scripts/dead_code_exceptions.json | Provides initial exception patterns for modules that should always be treated as reachable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| repo_root = pathlib.Path(__file__).parent | ||
| _exit_with_error_if_not_run_from_correct_dir(task_name="dead-code", correct_dir=repo_root) | ||
|
|
||
| cmd = f"{sys.executable} scripts/find_dead_code.py --json-output {json_output} --layer {layer}" | ||
| if verbose: | ||
| cmd += " --verbose" | ||
| ctx.run(cmd, echo=True) |
There was a problem hiding this comment.
ctx.run() executes via a shell here, but json_output and layer are interpolated into the command without quoting. This breaks for paths with spaces and also allows shell injection if someone passes a crafted value. Build the command using proper argument quoting (e.g., shlex.quote) or avoid the shell by invoking the script via subprocess.run([...], check=True) from the task.
| module_body_fqn = f"{mod_name}.{self.MODULE_BODY}" | ||
| module_refs: set[str] = set() | ||
| for node in ast.iter_child_nodes(tree): | ||
| if id(node) in top_level_defs: |
There was a problem hiding this comment.
_process_module_body() skips all top-level FunctionDef/ClassDef nodes entirely, but decorators (and class bases/metaclass expressions) are evaluated at import time. This means symbol reachability can produce false positives for side-effectful decorators (e.g., great_expectations/_version.py uses @register_vcs_handler(...) to populate HANDLERS at import time). Consider including references found in decorator_list (and for classes, bases/keywords) in the module-body root traversal so import-time side effects are modeled.
| if id(node) in top_level_defs: | |
| if id(node) in top_level_defs: | |
| # Top-level defs have bodies that are not executed at import time, | |
| # but their decorators and (for classes) base/metaclass expressions | |
| # are. Traverse those expressions to capture import-time references. | |
| decorator_roots = [] | |
| if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)): | |
| decorator_roots.extend(node.decorator_list) | |
| if isinstance(node, ast.ClassDef): | |
| decorator_roots.extend(node.bases) | |
| decorator_roots.extend(kw.value for kw in node.keywords) | |
| for root in decorator_roots: | |
| for child in ast.walk(root): | |
| if isinstance(child, ast.Name): | |
| name = child.id | |
| if name in import_map: | |
| module_refs.add(import_map[name]) | |
| elif name in info.defined_symbols: | |
| module_refs.add(f"{mod_name}.{name}") |
| "ignore_patterns": [], | ||
| "notes": { | ||
| "great_expectations.compatibility.*": "Compatibility shims for optional dependencies; loaded conditionally at runtime" | ||
| } |
There was a problem hiding this comment.
This config includes a notes mapping, but DynamicImportDetector.load_exceptions() never reads it (it only emits generic "Pattern matched" notes). Either remove the unused notes field to avoid confusion, or update the loader/reporting to surface these per-pattern explanations.
| "ignore_patterns": [], | |
| "notes": { | |
| "great_expectations.compatibility.*": "Compatibility shims for optional dependencies; loaded conditionally at runtime" | |
| } | |
| "ignore_patterns": [] |
| name="dead-code", | ||
| help={ | ||
| "json_output": "Path for JSON report output (default: dead_code_report.json)", | ||
| "layer": "Which layers to run: 1 (modules), 2 (symbols), 4 (tests), all (default: all)", |
There was a problem hiding this comment.
The task help text for layer omits layer 3, but scripts/find_dead_code.py accepts --layer 3 (dynamic-import augmentation). Update the help text (and optionally validate layer against the script’s choices) or remove the unsupported option so the wrapper’s CLI is self-consistent.
| "layer": "Which layers to run: 1 (modules), 2 (symbols), 4 (tests), all (default: all)", | |
| "layer": "Which layers to run: 1 (modules), 2 (symbols), 3 (dynamic-import augmentation), 4 (tests), all (default: all)", |
|
Is this PR still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions 🙇 |
Summary
Adds
scripts/find_dead_code.py, an AST-based static analyzer that identifies modules, symbols, and test files not reachable from@public_apiroots. It also addsscripts/dead_code_exceptions.jsonto configure known false-positive patterns, and aninvoke dead-codetask intasks.pyas a convenience wrapper.Run it with:
How
find_dead_code.pyworksThe script performs a multi-layer reachability analysis. The core idea: anything not reachable (by import chain or symbol reference) from a
@public_api-decorated symbol is a dead code candidate.Layer 1 — Module-level reachability (
ModuleGraphBuilder)This is the primary, high-confidence layer.
Discovery: Recursively finds every
.pyfile undergreat_expectations/and converts each to a dotted module name (e.g.great_expectations/profile/base.py→great_expectations.profile.base).AST parsing: For each module, the file is parsed with Python's
astmodule. Imports insideif TYPE_CHECKING:blocks are intentionally excluded — those only exist for the type checker and don't create runtime dependencies.Graph construction: Every
import Xandfrom X import Ystatement (after resolving relative imports) becomes a directed edge: this module depends on X.from . import foois treated speculatively as possibly importing a sibling submodule namedfoo.Root identification: Root modules are any module containing at least one
@public_api-decorated symbol, plus the top-levelgreat_expectationspackage itself.BFS reachability: A breadth-first search walks the dependency graph from all roots. When a module is visited, its ancestor
__init__.pypackages are also enqueued (Python loads all parent packages when importing a submodule). Any module not reached by this BFS is a dead module — reported with high confidence.Layer 3 — Dynamic import detection (
DynamicImportDetector)Layer 3 runs before the BFS (despite the numbering) because its job is to augment the graph so the BFS is more accurate. It handles two GX-specific runtime patterns that pure import-statement analysis cannot see:
instantiate_class_from_configcalls: Scans the AST of every module for calls to this function. When aconfig_defaults={"module_name": "great_expectations.some.module"}argument is found, it adds an edge from the calling module to the target. When the target is an unrecognized string prefix, it marks all matching submodules as reachable roots.class_namestring values in config dicts: Scans config-heavy modules (types, stores, abstract data context) for dict literals containing{"class_name": "SomeName"}. Looks up which module definesSomeNameand adds an edge, preventing false-positive dead code reports for dynamically instantiated classes.Layer 3 also loads
dead_code_exceptions.json. Modules matchingalways_reachable_modulespatterns are added as roots before the BFS. This handles cases likegreat_expectations.compatibility.*, which contains optional-dependency shims loaded conditionally at runtime in ways that can't be traced statically.Layer 2 — Symbol-level reachability (
SymbolGraphBuilder)This layer operates within the already-reachable module set to find dead functions and classes (reported with medium confidence, since there are more edge cases).
Symbol graph: For each reachable module, each top-level class, function, and async function becomes a node. The edges are references: if function
Acalls functionB(as aast.Namethat resolves through the import map), there is an edgeA → B.Module body node: A synthetic
<module_body>node is added for each module, representing code that runs at import time (not inside any function or class). This node is always a root since it executes unconditionally. It adds edges to any symbols it references.Root symbols: Symbols decorated with
@public_api, names exported in__init__.py__all__lists (resolving through re-exports), and all<module_body>nodes.BFS reachability: Same BFS as Layer 1, but over the symbol graph. Symbols not reachable from any root are reported as dead. Private symbols (starting with
_) and common noise names (logger,T,P, etc.) are suppressed.Layer 4 — Test file analysis (
TestAnalyzer)Scans
tests/fortest_*.pyfiles (skipping conftest and__init__) that import dead production modules. Each file is classified as:all imports dead): every GX import in the file points to a dead module — the test file exists solely to test removed code.some imports dead): the file imports a mix of live and dead modules — individual test functions may need removal.Output
The console summary prints module/symbol/test counts, dead module paths, dead symbol locations (
filepath:line :: SymbolName), and test classifications. With--json-output, a machine-readable JSON report is written for downstream tooling (e.g. thedead-code-removalClaude skill reads this file to select removal batches).dead_code_exceptions.jsonA small config file for permanently excluding known false positives.
always_reachable_modulesaccepts glob patterns matched against all discovered module names; matched modules are injected as roots before the BFS.ignore_patternscan suppress modules from the dead-module output even if they are unreachable.tasks.pyAdds
invoke dead-codeas a thin wrapper around the script, supporting--json-output,--layer, and--verboseoptions.