-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement storage- and relationship-aware cleanup #973
Conversation
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found.
📢 Thoughts on this report? Let us know! |
✅ All tests successful. No failed tests were found. 📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #973 +/- ##
==========================================
- Coverage 97.75% 97.74% -0.02%
==========================================
Files 451 458 +7
Lines 36812 36985 +173
==========================================
+ Hits 35986 36150 +164
- Misses 826 835 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
73f9f21
to
cc68239
Compare
This PR includes changes to |
cc68239
to
302fb5e
Compare
❌ 1 Tests Failed:
View the top 1 failed tests by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
❌ 1 Tests Failed:
View the top 1 failed tests by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
This implements a generic mechanism to resolve the model relationship graph based on Django metadata. That graph is then sorted in topological order, and all models are being deleted. Special manual deletion code is written for all the models that have some files in storage, in order to make sure that files are also being cleaned up properly.
40ba152
to
3653469
Compare
3653469
to
90bf1e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot's to unpack here... I like this in general. I think it's pretty cool and the code looks quite sophisticated.
Just so I understand this better, we currently have 2 entry points to use as "root" nodes when traversing the graph of dependencies: the repo or the owner. Is that correct?
services/cleanup/models.py
Outdated
) -> int: | ||
cleaned_files = 0 | ||
|
||
# TODO: maybe reuse the executor across calls? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly stand to gainfrom not having to initialize and tear down the pool every time.
But it seems that the calls to cleanup_files_batched
themselves are not concurrent. Is that on purpose? Like we are now it seems that at any given time there would always be at most 1 pool running anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it is running only with one pool at a time.
The regular deletions are still running serially in topological sort order. So it does not concurrently delete two unrelated models.
I was thinking about that a bit, but I thought it would be too complex to actually implement. Also reusing the thread pool would mean that I would have to introduce some kind of "waitpool" to make sure all the file deletes are complete before returning, which still keeping the pool alive.
I wanted to avoid that complexity for now.
break | ||
|
||
buckets_paths: dict[str, list[str]] = defaultdict(list) | ||
for ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[very nit] maybe a NamedTuple
would make the code more ergonomic here (but less performant for sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t fully understand? Would you prefer to not destruct the tuple into variables right away, but rather access them through a NamedTuple
object? I’m not sure that would make things more readable 🤔
- Owner entry from db | ||
- Cascading deletes of repos, pulls, and branches for the owner | ||
""" | ||
acks_late = True # retry the task when the worker dies for whatever reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't retry if the task ends with an Exception... just saying, if that is the expectation in "whatever"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have more details about the execution model of celery? I feel like after all this time I still don’t understand how it works and behaves under various circumstances.
For these tasks, I would want them to "resume" until they are completed.
This means "resuming" on timeouts (not sure if you can configure them to not have a timeout at all?).
It means resuming when the worker is killed for whatever reason. It also means resuming on any kind of random infrastructure exception.
Even for a fixable programmer error, I would like these tasks to always run to "successful" completion.
I’m not entirely sure how to achieve that with celery 🤷🏻♂️
Right now yes, those two need a bit of extra treatment, and they already have associated tasks for them. But you can just put any arbitrary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I particularly like the usage of cleanup_context
in a contextmanager. Very nice 👍
This implements a generic framework for "delete anything referencing this object".
In order to do so, this builds a relationship graph based on Django models/metadata, augmented with some manually defined relationships which are not encoded in django models.
This graph is then processed in topological sort order, meaning that it will delete leaf models first, before advancing to other models, thus avoiding cascading deletes on the database side.
It is also possible to hook up custom code for specific models. This has been used to properly delete any kind of file in storage that is being referenced by these models.
Previous code has existed to delete all the "repo files", which did so by first enumerating all the files in storage matching a "per repository" naming pattern. Even though that code might have worked for smaller repositories, I assume that it would not have worked properly for larger repositories with a huge amount of files.
The new storage deletion code is using batches and a thread pool to delete files instead, and it also covers file name patterns (and different storage buckets in the case of BA) that do not match the "repo files" pattern.
This new generic deletion framework is being hooked up, replacing the existing
FlushRepo
andDeleteOwner
tasks.The tasks are build in such a way that they can time out and be rerun, and they will just pick up where they left off (going over the same topological sort order), assuming that the whole task is not being run in a transaction. (I hope not, but given how many problems we have with DB locking and transactions, who knows 🤷🏻♂️ )
This is work towards codecov/engineering-team#1127