Skip to content

Adding harvester robustness #19373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

wallentx
Copy link
Contributor

@wallentx wallentx commented Mar 12, 2025

This fixes #19383

This PR addresses the issue where the harvester abandons an entire plot directory when encountering I/O errors in subdirectories during recursive plot scanning.
It also introduces some new CLI commands and RPC endpoints which are useful for recovering from such a scenario.

Features Added/Changed:

  1. Cache Clear Method:

    • Added a new clear() method to the Cache class that removes all entries from the cache
    • This allows for a complete cache reset without having to manually delete the ~/.chia/mainnet/cache/plot_manager_v2.dat file, and restarting the harvester
  2. New harvester RPC Endpoints:

    • Added a new /refresh_plots endpoint to the harvester RPC API for normal plot refreshing
    • Added a new /hard_refresh_plots endpoint that clears the plot cache and triggers a full refresh
    • These endpoints allow programmatic refreshing of plots without restarting the harvester
  3. New CLI Commands:

    • Added chia plots refresh command to trigger a normal refresh of plots
    • Added chia plots refresh --hard option to perform a hard refresh (clearing cache)
    • Provides users with convenient ways to refresh plots without restarting the harvester

These changes improve the harvester's resilience to I/O errors by providing a way to clear the cache and force a complete refresh without restarting the service. This is particularly valuable for large farms with multiple drive mounts where occasional I/O errors are inevitable.

@wallentx
Copy link
Contributor Author

I have some tests that don't fully work yet in another branch of my fork if you really want them.

@wallentx wallentx marked this pull request as ready for review March 12, 2025 23:57
@wallentx wallentx requested a review from a team as a code owner March 12, 2025 23:57
Copy link
Contributor

@arvidn arvidn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this PR have two independent changes. Making traversing plot directories more robust to failure seems uncontroversial. By splitting that out into a serparate PR, I think it's easier to land them, as we don't have to wait for both parts to be ready before landing anything

Comment on lines +252 to +263
try:
files = glob.glob(str(directory / "**" / "*.plot"), recursive=True)
for file in files:
try:
filepath = Path(file).resolve()
if filepath.is_file() and not filepath.name.startswith("._"):
all_files.append(filepath)
except Exception as e:
# If we can't process a specific file, log and continue
log.warning(f"Error processing file {file}: {e}")
continue
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems unnecessary to keep the glob() path. The idea is that the "manual" traversal works too, right? It's much easier to test if there's only one implementation

@@ -23,6 +23,7 @@ def get_routes(self) -> dict[str, Endpoint]:
return {
"/get_plots": self.get_plots,
"/refresh_plots": self.refresh_plots,
"/hard_refresh_plots": self.hard_refresh_plots,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would seem more intuitive to add an argument to /refresh_plots, something like force=True, rather than adding a whole new endpoint. Maybe there's a good reason.

Comment on lines +127 to +131
try:
# Get all plot files in this directory
plot_files = get_filenames(directory, recursive_scan, recursive_follow_links)
all_files[directory] = plot_files
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with your change, does get_filenames() ever throw an exception?

@wallentx
Copy link
Contributor Author

Thanks @arvidn . Just want to make sure:
Make a separate PR for "preventing the failure" and keep this PR for "how to recover the state in the event of a failure", so that I get to keep the benefit of your review?

@arvidn
Copy link
Contributor

arvidn commented Mar 26, 2025

yeah. It seems the RPC to force-refresh the plots is somewhat independent from making directory traversal more robust

@emlowe
Copy link
Contributor

emlowe commented Apr 2, 2025

The manual method using os.walk and iterating might be slow, but I have no data for this. Have you tried to benchmark this at all?

I concur with Arvid that two PRs, one where you replace glob with walk and a second where you introduced new chia commands, are a good idea. As the new chia commands would work regardless of the iteration method

Copy link
Contributor

This PR has been flagged as stale due to no activity for over 60 days. It will not be automatically closed, but it has been given a stale-pr label and should be manually reviewed by the relevant parties.

@github-actions github-actions bot added the stale-pr Flagged as stale and in need of manual review label May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-pr stale-pr Flagged as stale and in need of manual review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Harvester abandons entire directory tree when encountering I/O errors during recursive plot scanning
4 participants