Skip to content

perf(site): Avoid file-tree walk when only database usage is needed#548

Open
balamurali27 wants to merge 3 commits into
developfrom
fix/database-usage-skip-file-walk
Open

perf(site): Avoid file-tree walk when only database usage is needed#548
balamurali27 wants to merge 3 commits into
developfrom
fix/database-usage-skip-file-walk

Conversation

@balamurali27

@balamurali27 balamurali27 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Paired PR: frappe/press#6765

Problem

get_usage (served by the /info endpoint) always computes full disk usage, including a recursive walk of the site's public, private, and backups directories to sum file sizes. On sites with large file trees this walk exceeds the gunicorn timeout and kills agent web workers (WORKER TIMEOUT).

Press's "refresh database usage" polls /info frequently but only needs the database size — yet pays for the whole file walk each time.

Changes

1. du -sb instead of a Python walk (get_size)
The old get_size ran islink/isfile/isdir/getsize per entry — several stat syscalls per file. Shelling to du -sb does the same apparent-size sum in C with one stat per file. ignore_dirs maps to du --exclude.

2. database_only flag (/info route → fetch_site_infoget_usage)
The /info route accepts ?database_only=1; in that mode get_usage returns only the (cheap) database size and skips the file walks entirely.

Relationship to prior work

Complements the DB-size optimizations in #399 / #447 / #364, which made the database portion cheap but left the public/private/backups file walk in place. This skips that walk for callers that don't need it.

Paired change

frappe/press#6765 sends ?database_only=1 from the database-usage refresh path. The two can deploy in either order — an older Press simply omits the param and gets the full walk (current behaviour).

🤖 Generated with Claude Code

balamurali27 and others added 2 commits June 22, 2026 13:36
get_size recursively listed every entry and ran islink/isfile/isdir/
getsize on each — several stat syscalls per file. On sites with large
file trees this is slow enough to time out gunicorn workers. Shelling
out to `du -sb` does the same apparent-size sum in C with one stat per
file. ignore_dirs is mapped to du's --exclude.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
The /info endpoint always computes full disk usage, including the
recursive file-tree walk for public/private/backups. Callers that only
need the database size (Press's database-usage refresh) pay for the
whole walk, which times out workers on large sites.

Accept a database_only query param on the /info route and thread it to
get_usage; when set, return only the (cheap) database size and skip the
file walks entirely.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@mergify

mergify Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@greptile-apps

greptile-apps Bot commented Jun 22, 2026

Copy link
Copy Markdown

Confidence Score: 4/5

The change is correct and the fast path works as intended; the main residual risk is in agent/utils.py where du can still block a gunicorn worker indefinitely on a stalled mount — the same class of problem the PR set out to solve.

The database_only short-circuit is clean and the du -sb replacement is logically equivalent to the old walk. The subprocess call in get_size has no timeout, so a slow or unresponsive filesystem can still pin a worker for the full-walk code path — a concern already raised in review but not yet addressed in this diff.

agent/utils.py — the subprocess.check_output call lacks a timeout

Important Files Changed

Filename Overview
agent/utils.py Replaces Python recursive walk with du -sb; no timeout guard on the subprocess call (flagged in previous thread)
agent/site.py Adds database_only short-circuit to get_usage and threads the flag through fetch_site_info; logic is correct and well-commented
agent/web.py Parses ?database_only query arg and forwards it; truthy-string handling covers the intended values ("1", "true", "True")

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Press
    participant web.py as web.py (/info)
    participant site.py as site.py (fetch_site_info)
    participant get_usage as get_usage
    participant db as get_database_size()
    participant du as du -sb

    Press->>web.py: "GET /info?database_only=1"
    web.py->>site.py: "fetch_site_info(database_only=True)"
    site.py->>get_usage: "get_usage(database_only=True)"
    get_usage->>db: get_database_size()
    db-->>get_usage: size (bytes)
    get_usage-->>site.py: "{database: size}"
    site.py-->>web.py: "{config, timezone, usage}"
    web.py-->>Press: "{data: {...}}"

    Press->>web.py: GET /info (no param)
    web.py->>site.py: "fetch_site_info(database_only=False)"
    site.py->>get_usage: "get_usage(database_only=False)"
    get_usage->>db: get_database_size()
    db-->>get_usage: size
    get_usage->>du: du -sb public/
    du-->>get_usage: public size
    get_usage->>du: "du -sb --exclude=backups private/"
    du-->>get_usage: private size
    get_usage->>du: du -sb private/backups/
    du-->>get_usage: backups size
    get_usage-->>site.py: "{database, public, private, backups}"
    site.py-->>web.py: "{config, timezone, usage}"
    web.py-->>Press: "{data: {...}}"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Press
    participant web.py as web.py (/info)
    participant site.py as site.py (fetch_site_info)
    participant get_usage as get_usage
    participant db as get_database_size()
    participant du as du -sb

    Press->>web.py: "GET /info?database_only=1"
    web.py->>site.py: "fetch_site_info(database_only=True)"
    site.py->>get_usage: "get_usage(database_only=True)"
    get_usage->>db: get_database_size()
    db-->>get_usage: size (bytes)
    get_usage-->>site.py: "{database: size}"
    site.py-->>web.py: "{config, timezone, usage}"
    web.py-->>Press: "{data: {...}}"

    Press->>web.py: GET /info (no param)
    web.py->>site.py: "fetch_site_info(database_only=False)"
    site.py->>get_usage: "get_usage(database_only=False)"
    get_usage->>db: get_database_size()
    db-->>get_usage: size
    get_usage->>du: du -sb public/
    du-->>get_usage: public size
    get_usage->>du: "du -sb --exclude=backups private/"
    du-->>get_usage: private size
    get_usage->>du: du -sb private/backups/
    du-->>get_usage: backups size
    get_usage-->>site.py: "{database, public, private, backups}"
    site.py-->>web.py: "{config, timezone, usage}"
    web.py-->>Press: "{data: {...}}"
Loading

Reviews (2): Last reviewed commit: "docs(site): Trim file-walk comments; rat..." | Re-trigger Greptile

Comment thread agent/utils.py Outdated
Comment on lines +87 to +91
Shells out to `du` (C) instead of walking the tree in Python: a recursive
Python stat of every file was timing out gunicorn workers on sites with
large file trees. `du -b` reports apparent size (st_size), matching the
old behaviour. ignore_dirs is applied at the top level only, as before.
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The docstring claims ignore_dirs is still applied "at the top level only, as before", but du --exclude=PATTERN matches the pattern against basenames at every depth in the tree — not just immediate children. In practice this doesn't matter for the current sole caller (ignore_dirs=["backups"], where "backups" only exists at the top of the private tree), but the comment documents wrong behaviour and could mislead future callers that pass names which happen to appear deeper in the tree.

Suggested change
Shells out to `du` (C) instead of walking the tree in Python: a recursive
Python stat of every file was timing out gunicorn workers on sites with
large file trees. `du -b` reports apparent size (st_size), matching the
old behaviour. ignore_dirs is applied at the top level only, as before.
"""
Shells out to `du` (C) instead of walking the tree in Python: a recursive
Python stat of every file was timing out gunicorn workers on sites with
large file trees. `du -b` reports apparent size (st_size), matching the
old behaviour. Note: `du --exclude` matches basenames at every depth in
the tree, not just the top level (unlike the old Python walk).
"""
Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/utils.py
Line: 87-91

Comment:
The docstring claims `ignore_dirs` is still applied "at the top level only, as before", but `du --exclude=PATTERN` matches the pattern against basenames **at every depth** in the tree — not just immediate children. In practice this doesn't matter for the current sole caller (`ignore_dirs=["backups"]`, where "backups" only exists at the top of the private tree), but the comment documents wrong behaviour and could mislead future callers that pass names which happen to appear deeper in the tree.

```suggestion
    Shells out to `du` (C) instead of walking the tree in Python: a recursive
    Python stat of every file was timing out gunicorn workers on sites with
    large file trees. `du -b` reports apparent size (st_size), matching the
    old behaviour. Note: `du --exclude` matches basenames at every depth in
    the tree, not just the top level (unlike the old Python walk).
    """
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread agent/utils.py
Comment on lines +97 to +98
output = subprocess.check_output(command, text=True)
return int(output.split(maxsplit=1)[0])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 subprocess.check_output without a timeout blocks indefinitely if du stalls on an unresponsive NFS/FUSE mount, holding the gunicorn worker in the same way the old Python walk did. A reasonable upper bound (e.g. 120 s) would surface the failure quickly instead of silently pinning a worker.

Suggested change
output = subprocess.check_output(command, text=True)
return int(output.split(maxsplit=1)[0])
output = subprocess.check_output(command, text=True, timeout=120)
return int(output.split(maxsplit=1)[0])
Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/utils.py
Line: 97-98

Comment:
`subprocess.check_output` without a `timeout` blocks indefinitely if `du` stalls on an unresponsive NFS/FUSE mount, holding the gunicorn worker in the same way the old Python walk did. A reasonable upper bound (e.g. 120 s) would surface the failure quickly instead of silently pinning a worker.

```suggestion
    output = subprocess.check_output(command, text=True, timeout=120)
    return int(output.split(maxsplit=1)[0])
```

How can I resolve this? If you propose a fix, please make it concise.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant