Skip to content

[MAINTENANCE] dead code analyzer#11760

Open
joshua-stauffer wants to merge 5 commits into
developfrom
m/_/dead_code_analyzer
Open

[MAINTENANCE] dead code analyzer#11760
joshua-stauffer wants to merge 5 commits into
developfrom
m/_/dead_code_analyzer

Conversation

@joshua-stauffer
Copy link
Copy Markdown
Member

@joshua-stauffer joshua-stauffer commented Apr 1, 2026

Summary

Adds scripts/find_dead_code.py, an AST-based static analyzer that identifies modules, symbols, and test files not reachable from @public_api roots. It also adds scripts/dead_code_exceptions.json to configure known false-positive patterns, and an invoke dead-code task in tasks.py as a convenience wrapper.

Run it with:

invoke dead-code --verbose
python scripts/find_dead_code.py --json-output dead_code_report.json --layer all --verbose

How find_dead_code.py works

The script performs a multi-layer reachability analysis. The core idea: anything not reachable (by import chain or symbol reference) from a @public_api-decorated symbol is a dead code candidate.

Layer 1 — Module-level reachability (ModuleGraphBuilder)

This is the primary, high-confidence layer.

  1. Discovery: Recursively finds every .py file under great_expectations/ and converts each to a dotted module name (e.g. great_expectations/profile/base.pygreat_expectations.profile.base).

  2. AST parsing: For each module, the file is parsed with Python's ast module. Imports inside if TYPE_CHECKING: blocks are intentionally excluded — those only exist for the type checker and don't create runtime dependencies.

  3. Graph construction: Every import X and from X import Y statement (after resolving relative imports) becomes a directed edge: this module depends on X. from . import foo is treated speculatively as possibly importing a sibling submodule named foo.

  4. Root identification: Root modules are any module containing at least one @public_api-decorated symbol, plus the top-level great_expectations package itself.

  5. BFS reachability: A breadth-first search walks the dependency graph from all roots. When a module is visited, its ancestor __init__.py packages are also enqueued (Python loads all parent packages when importing a submodule). Any module not reached by this BFS is a dead module — reported with high confidence.

Layer 3 — Dynamic import detection (DynamicImportDetector)

Layer 3 runs before the BFS (despite the numbering) because its job is to augment the graph so the BFS is more accurate. It handles two GX-specific runtime patterns that pure import-statement analysis cannot see:

  1. instantiate_class_from_config calls: Scans the AST of every module for calls to this function. When a config_defaults={"module_name": "great_expectations.some.module"} argument is found, it adds an edge from the calling module to the target. When the target is an unrecognized string prefix, it marks all matching submodules as reachable roots.

  2. class_name string values in config dicts: Scans config-heavy modules (types, stores, abstract data context) for dict literals containing {"class_name": "SomeName"}. Looks up which module defines SomeName and adds an edge, preventing false-positive dead code reports for dynamically instantiated classes.

Layer 3 also loads dead_code_exceptions.json. Modules matching always_reachable_modules patterns are added as roots before the BFS. This handles cases like great_expectations.compatibility.*, which contains optional-dependency shims loaded conditionally at runtime in ways that can't be traced statically.

Layer 2 — Symbol-level reachability (SymbolGraphBuilder)

This layer operates within the already-reachable module set to find dead functions and classes (reported with medium confidence, since there are more edge cases).

  1. Symbol graph: For each reachable module, each top-level class, function, and async function becomes a node. The edges are references: if function A calls function B (as a ast.Name that resolves through the import map), there is an edge A → B.

  2. Module body node: A synthetic <module_body> node is added for each module, representing code that runs at import time (not inside any function or class). This node is always a root since it executes unconditionally. It adds edges to any symbols it references.

  3. Root symbols: Symbols decorated with @public_api, names exported in __init__.py __all__ lists (resolving through re-exports), and all <module_body> nodes.

  4. BFS reachability: Same BFS as Layer 1, but over the symbol graph. Symbols not reachable from any root are reported as dead. Private symbols (starting with _) and common noise names (logger, T, P, etc.) are suppressed.

Layer 4 — Test file analysis (TestAnalyzer)

Scans tests/ for test_*.py files (skipping conftest and __init__) that import dead production modules. Each file is classified as:

  • High confidence (all imports dead): every GX import in the file points to a dead module — the test file exists solely to test removed code.
  • Medium confidence (some imports dead): the file imports a mix of live and dead modules — individual test functions may need removal.

Output

The console summary prints module/symbol/test counts, dead module paths, dead symbol locations (filepath:line :: SymbolName), and test classifications. With --json-output, a machine-readable JSON report is written for downstream tooling (e.g. the dead-code-removal Claude skill reads this file to select removal batches).

dead_code_exceptions.json

A small config file for permanently excluding known false positives. always_reachable_modules accepts glob patterns matched against all discovered module names; matched modules are injected as roots before the BFS. ignore_patterns can suppress modules from the dead-module output even if they are unreachable.

tasks.py

Adds invoke dead-code as a thin wrapper around the script, supporting --json-output, --layer, and --verbose options.

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 1, 2026

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit 72ee912
🔍 Latest deploy log https://app.netlify.com/projects/niobium-lead-7998/deploys/69cf722fc38ddc00086e12b9

@joshua-stauffer joshua-stauffer changed the title dead code analyzer [MAINTENANCE] dead code analyzer Apr 2, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.66%. Comparing base (65ee05f) to head (72ee912).
⚠️ Report is 80 commits behind head on develop.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop   #11760   +/-   ##
========================================
  Coverage    84.66%   84.66%           
========================================
  Files          471      471           
  Lines        39170    39170           
========================================
  Hits         33165    33165           
  Misses        6005     6005           
Flag Coverage Δ
3.10 73.56% <ø> (ø)
3.10 athena ?
3.10 aws_deps ?
3.10 big ?
3.10 clickhouse ?
3.10 filesystem ?
3.10 mysql ?
3.10 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.10 postgresql ?
3.10 spark ?
3.10 spark_connect ?
3.10 sql_server ?
3.10 trino ?
3.11 73.60% <ø> (ø)
3.11 athena ?
3.11 aws_deps ?
3.11 big ?
3.11 clickhouse ?
3.11 filesystem ?
3.11 mysql ?
3.11 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.11 postgresql ?
3.11 spark ?
3.11 spark_connect ?
3.11 sql_server ?
3.11 trino ?
3.12 73.59% <ø> (-0.02%) ⬇️
3.12 athena ?
3.12 aws_deps ?
3.12 big ?
3.12 filesystem ?
3.12 mysql ?
3.12 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.12 postgresql ?
3.12 spark ?
3.12 spark_connect ?
3.12 sql_server ?
3.12 trino ?
3.13 73.61% <ø> (+0.01%) ⬆️
3.13 athena 41.93% <ø> (ø)
3.13 aws_deps 45.18% <ø> (ø)
3.13 big 55.27% <ø> (ø)
3.13 bigquery 51.25% <ø> (ø)
3.13 clickhouse 41.94% <ø> (ø)
3.13 databricks 53.06% <ø> (ø)
3.13 filesystem 64.37% <ø> (ø)
3.13 gx-redshift 51.41% <ø> (ø)
3.13 mysql 51.81% <ø> (ø)
3.13 openpyxl or pyarrow or project or sqlite or aws_creds 59.97% <ø> (ø)
3.13 postgresql 55.22% <ø> (ø)
3.13 snowflake 53.90% <ø> (+<0.01%) ⬆️
3.13 spark 55.92% <ø> (ø)
3.13 spark_connect 46.85% <ø> (ø)
3.13 sql_server 53.23% <ø> (ø)
3.13 trino 48.75% <ø> (ø)
cloud 0.00% <ø> (ø)
docs-basic 59.52% <ø> (ø)
docs-creds-needed 58.11% <ø> (ø)
docs-spark 57.57% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings April 2, 2026 17:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a dead-code analysis utility that statically traces reachability from @public_api roots, with optional dynamic-import heuristics and test-file classification, and wires it into the repo’s Invoke tasks for easy execution.

Changes:

  • Added scripts/find_dead_code.py, an AST-based reachability analyzer (modules, symbols, dynamic-import heuristics, tests) with optional JSON output.
  • Added scripts/dead_code_exceptions.json to configure always-reachable module patterns / suppressions.
  • Added an invoke dead-code wrapper task in tasks.py.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tasks.py Adds an Invoke task that runs the dead-code analyzer script with CLI flags.
scripts/find_dead_code.py Implements the multi-layer AST reachability analysis and report generation/printing.
scripts/dead_code_exceptions.json Provides initial exception patterns for modules that should always be treated as reachable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tasks.py
Comment on lines +769 to +775
repo_root = pathlib.Path(__file__).parent
_exit_with_error_if_not_run_from_correct_dir(task_name="dead-code", correct_dir=repo_root)

cmd = f"{sys.executable} scripts/find_dead_code.py --json-output {json_output} --layer {layer}"
if verbose:
cmd += " --verbose"
ctx.run(cmd, echo=True)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.run() executes via a shell here, but json_output and layer are interpolated into the command without quoting. This breaks for paths with spaces and also allows shell injection if someone passes a crafted value. Build the command using proper argument quoting (e.g., shlex.quote) or avoid the shell by invoking the script via subprocess.run([...], check=True) from the task.

Copilot uses AI. Check for mistakes.
Comment thread scripts/find_dead_code.py
module_body_fqn = f"{mod_name}.{self.MODULE_BODY}"
module_refs: set[str] = set()
for node in ast.iter_child_nodes(tree):
if id(node) in top_level_defs:
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_process_module_body() skips all top-level FunctionDef/ClassDef nodes entirely, but decorators (and class bases/metaclass expressions) are evaluated at import time. This means symbol reachability can produce false positives for side-effectful decorators (e.g., great_expectations/_version.py uses @register_vcs_handler(...) to populate HANDLERS at import time). Consider including references found in decorator_list (and for classes, bases/keywords) in the module-body root traversal so import-time side effects are modeled.

Suggested change
if id(node) in top_level_defs:
if id(node) in top_level_defs:
# Top-level defs have bodies that are not executed at import time,
# but their decorators and (for classes) base/metaclass expressions
# are. Traverse those expressions to capture import-time references.
decorator_roots = []
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
decorator_roots.extend(node.decorator_list)
if isinstance(node, ast.ClassDef):
decorator_roots.extend(node.bases)
decorator_roots.extend(kw.value for kw in node.keywords)
for root in decorator_roots:
for child in ast.walk(root):
if isinstance(child, ast.Name):
name = child.id
if name in import_map:
module_refs.add(import_map[name])
elif name in info.defined_symbols:
module_refs.add(f"{mod_name}.{name}")

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +8
"ignore_patterns": [],
"notes": {
"great_expectations.compatibility.*": "Compatibility shims for optional dependencies; loaded conditionally at runtime"
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config includes a notes mapping, but DynamicImportDetector.load_exceptions() never reads it (it only emits generic "Pattern matched" notes). Either remove the unused notes field to avoid confusion, or update the loader/reporting to surface these per-pattern explanations.

Suggested change
"ignore_patterns": [],
"notes": {
"great_expectations.compatibility.*": "Compatibility shims for optional dependencies; loaded conditionally at runtime"
}
"ignore_patterns": []

Copilot uses AI. Check for mistakes.
Comment thread tasks.py
name="dead-code",
help={
"json_output": "Path for JSON report output (default: dead_code_report.json)",
"layer": "Which layers to run: 1 (modules), 2 (symbols), 4 (tests), all (default: all)",
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task help text for layer omits layer 3, but scripts/find_dead_code.py accepts --layer 3 (dynamic-import augmentation). Update the help text (and optionally validate layer against the script’s choices) or remove the unsupported option so the wrapper’s CLI is self-consistent.

Suggested change
"layer": "Which layers to run: 1 (modules), 2 (symbols), 4 (tests), all (default: all)",
"layer": "Which layers to run: 1 (modules), 2 (symbols), 3 (dynamic-import augmentation), 4 (tests), all (default: all)",

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Is this PR still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity.

It will be closed if no further activity occurs. Thank you for your contributions 🙇

@github-actions github-actions Bot added the stale Stale issues and PRs label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Stale issues and PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants