Skip to content

feat: add graphify dry-run command#157

Open
nuthalapativarun wants to merge 3 commits intosafishamsi:v3from
nuthalapativarun:feat/dry-run-command
Open

feat: add graphify dry-run command#157
nuthalapativarun wants to merge 3 commits intosafishamsi:v3from
nuthalapativarun:feat/dry-run-command

Conversation

@nuthalapativarun
Copy link
Copy Markdown

@nuthalapativarun nuthalapativarun commented Apr 9, 2026

Summary

Adds a graphify dry-run [path] CLI command that scans the corpus and prints a file-count/health summary without writing any output files or building the graph.

This is a safe preview step — useful for validating what graphify sees before committing to a full extraction run that may consume LLM tokens.

Usage

$ graphify dry-run ./my-project
Corpus scan: /abs/path/my-project

  Code files          23
  Documents            7
  Total               30  (~84,200 words)

Corpus looks healthy — no warnings.

No files were written. Run without dry-run to build the graph.

With a large corpus:

warning: Large corpus: 312 files · ~620,000 words. Semantic extraction
will be expensive (many Claude tokens). Consider running on a subfolder,
or use --no-semantic to run AST-only.

Implementation

  • graphify/__main__.py — new elif cmd == "dry-run" branch + help text entry
  • Reuses detect.detect() entirely — no new detection logic
  • graphify-out/ is never created or touched

Test plan

  • test_dry_run_prints_summary — file-count table appears in output
  • test_dry_run_no_files_writtengraphify-out/ is not created
  • test_dry_run_default_path — defaults to current directory when path omitted
  • test_dry_run_missing_path — exits non-zero for a missing path
  • test_dry_run_no_graphify_out_written — "No files were written" in output

graphify dry-run [path] scans the corpus with detect() and prints a
file-count table with corpus health warnings without writing any
output files or building the graph.
@nuthalapativarun
Copy link
Copy Markdown
Author

Hey @safishamsi — just checking in on this one. Happy to rebase or make any adjustments if needed. Let me know!

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, graphify dry-run calls graphify.detect.detect(), but detect() can create graphify-out/converted/*.md sidecar files when it encounters .docx/.xlsx files. This violates the dry-run promise and can cause unexpected filesystem writes during what is advertised as a no-write preview step.

Severity: action required | Category: correctness

How to fix: Make detect side-effect-free

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

graphify dry-run must not write any files, but it currently calls graphify.detect.detect() which may write office conversion sidecars into graphify-out/converted/.

Issue Context

  • graphify/__main__.py dry-run branch calls _detect(root) and prints “No files were written”.
  • graphify/detect.py converts .docx/.xlsx by writing markdown sidecars.

Fix Focus Areas

  • Add a dry_run/write_sidecars/convert_office boolean parameter to graphify.detect.detect() (default preserving current behavior).
  • Ensure that when the flag is disabled, detect() does not create directories or write any files (skip conversion, and optionally count words directly from the office file or as 0).
  • Call detect(..., write_sidecars=False) (or equivalent) from the dry-run CLI branch.

References

  • graphify/main.py[794-823]
  • graphify/detect.py[347-376]
  • graphify/detect.py[187-213]

Found by Qodo code review

detect() now accepts write_sidecars=False; when disabled, office files
are counted directly without calling convert_office_file() or touching
graphify-out/converted/. The dry-run CLI branch passes this flag so the
no-write promise holds even for .docx/.xlsx corpora.

Adds test_dry_run_office_no_sidecar_written to assert convert_office_file
is never called during dry-run.
@nuthalapativarun
Copy link
Copy Markdown
Author

Good catch @qodo-ai-reviewer — fixed in 0b3e6eb.

detect() now accepts a write_sidecars=False keyword argument. When disabled, office files (.docx/.xlsx) are counted directly without calling convert_office_file() or touching graphify-out/converted/. The dry-run CLI branch passes this flag, so the no-write promise holds even for corpora containing office files.

Added test_dry_run_office_no_sidecar_written which mocks convert_office_file and asserts it is never called during a dry-run invocation.

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, In dry-run mode, .docx/.xlsx files are counted via count_words() without creating sidecars, but missing optional office libraries causes a silent 0-word count and no skipped/warning signal, so dry-run can report a “healthy” corpus that the real run would not process correctly.

Severity: action required | Category: correctness

How to fix: Warn/skip office without deps

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

graphify dry-run calls detect(..., write_sidecars=False). For office files (.docx/.xlsx), this path currently counts words directly via count_words(p).

If optional office dependencies aren’t installed, docx_to_markdown() / xlsx_to_markdown() return an empty string, so count_words() returns 0 and dry-run prints “Corpus looks healthy — no warnings.” This hides that office content won’t actually be extracted/usable in non-dry-run runs.

Issue Context

The write_sidecars=True code path already treats office conversion failures as a “skipped” condition with an install hint. Dry-run should surface the same problem (or at least warn) rather than silently counting 0 words.

Fix Focus Areas

  • graphify/detect.py[302-410]
    • In the write_sidecars=False office branch, detect the “no office support” case (e.g., if conversion output is empty) and add an entry to skipped_sensitive (or a dedicated skipped_office) plus set a warning.
    • Consider not counting office files as successfully scanned if their text extraction failed.
  • graphify/main.py[794-823]
    • Optionally adjust dry-run messaging to explicitly mention office files skipped due to missing extras.
  • tests/test_dry_run.py[65-75]
    • Add a test that simulates missing python-docx / openpyxl (e.g., patch docx_to_markdown to return "") and asserts a warning/skipped message is surfaced.

We noticed a couple of other issues in this PR as well - happy to share if helpful.


Found by Qodo code review

In write_sidecars=False mode, probe office files via docx_to_markdown/
xlsx_to_markdown (which return '' on ImportError). Empty result means
the real run would also extract nothing — add to skipped list with an
install hint instead of silently counting 0 words.

__main__.py surfaces a dedicated 'Skipped (office deps missing)' line
with pip install hint, and suppresses 'Corpus looks healthy' when
office files were skipped.

Adds test_dry_run_office_missing_deps_warns to assert the warning and
install hint appear when docx_to_markdown is patched to return ''.

Closes feedback from qodo-ai-reviewer on PR safishamsi#157.
@nuthalapativarun
Copy link
Copy Markdown
Author

Fixed in feb29c3.

In the write_sidecars=False branch, detect() now probes each office file by calling docx_to_markdown/xlsx_to_markdown in-memory (no writes). Both functions already return "" on ImportError, so an empty result means the real run would also extract nothing. Those files are added to skipped_sensitive with a pip install graphify[office] hint instead of being silently counted as 0 words.

__main__.py splits skipped entries into a dedicated "Skipped (office deps missing)" line with the install hint, and suppresses "Corpus looks healthy" when office files were skipped — so dry-run now accurately reflects what a real run would produce.

Added test_dry_run_office_missing_deps_warns which patches docx_to_markdown to return "" and asserts the warning and install hint appear in output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants