-
-
Notifications
You must be signed in to change notification settings - Fork 81
[R&D] feat: add object storage cleaner script and testing #1445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently QFieldCloud keeps all project files using the non-legacy storage even if they are deleted by the user. This is due to enabled versioning on Exoscale and the lack of Exoscale mechanism to delete files after certain amount of time after they are deleted. This causes excessive file storage and therefore costs. - Add `storage_cleaner.py` for scanning and deleting logically deleted objects in S3 compatable storages - Add `test.py` for testing the script
gounux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, looks promising !
This is a first review round, did not checkout locally neither tested yet, mainly some typings & formalism comments.
suricactus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of the review:
The script needs a few iterations before it gets in a better shape. So be patient :)
There are a few things that are still not addressed since the preliminary review via chat.
In short, I think the script is more complex than it should be. Both in terms of algorithmic approach, as well as way of writing it. A bit more OOP than my taste, this is a script in the end.
Note this script will be used by people who are not necessarily experts in Python. Write the script like someone who sees python for the first time will read it.
Please ping me when the current comments are addressed for a second round. Feel free to simplify things that you recognize as complex or not obvious, but were not commented in this round.
|
|
||
| return status == "Enabled" | ||
|
|
||
| def _iter_all_versions(self) -> Iterator[VersionRef]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get the purpose of this and the VersionRef object.
Why do we need to "rename" the objects in any way, since they are already properly structured?
If you need it just for typing, use https://youtype.github.io/boto3_stubs_docs/mypy_boto3_s3/service_resource/#objectversion .
I am also curious can't we abandon this whole thing and just use Bucket.list_versions? See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/collections.html and https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/bucket/object_versions.html#filter .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get the purpose of this and the VersionRef object.
Versions have a Size attribute, DeleteMarkers do not. Accessing .size on a Delete Marker object from the raw API would raise an error.
I am also curious can't we abandon this whole thing and just use Bucket.list_versions? See
I'm not sure I got this right, but I couldn't find anything related to Bucket.list_versions am I missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right- there is no size attribute- but I still find
size = 0
if v.is_delete_marker:
# delete markers do not have a size attribute
size += 0
else:
size += v["size"]
OR
# delete markers...
size += v.get("size", 0) # might be getattr
This way we drop the need of two classes and a special function. Less code is more, even if we have the false perception that the classes will cahe explicit typing etc.
Again, this is a script we must be able to debug easily by anyone, most likely people outside our team. Keep it as simple as possible. The native types from boto3 will always be better documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the data classes makes it easier to read, at least from my perspective, Since if I removed VersionRef I would need to manually map out the structure, but when using the data class, I can clearly check the reference and it's structure.
This is my personal way of how I view code and anything with definition / types will make me more comfortable even if it takes a couple more lines of code.
At the begging I didn't lean into having an OOP structure at all, but due to the complexity of this script, making it OOP and have data classes made it a bit easier and clear for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to overrule here if it is simpler for you. Don't take this comment as a blocker, but more as inspiration.
I will just summarize my suggestions that will make this script and it's maintenence easier based on my experience:
- avoid OOP, especially in scripts. Use imperative code, it's simpler mental model. The complexity in OOP comes mostly from state management, so even if you stick with OOP, avoid using
selfa s a default storage for stuff whenever possible. My experience showed me that passing state as readonly parameters/return values is not immediately obvious at writing time, but easier to follow at reading. - avoid introducing new types when types already exist and are well documented. If this tiny wrapper helps you to follow, I am ok with it, I find it redundant though.
- Keep it very obvious what is going on by reducing the block nesting using early returns. Reduce the number of methods/functions to avoid jumping around the code.
Please check with @toebivankenoebi who will most likely deploy the script in the end how he feels about the OOP approach at a bit later stage when the script is ready for a semi-final review.
| aggregate = LogicallyDeletedObject( | ||
| key=key, | ||
| deleted_at=latest.last_modified, | ||
| total_size_bytes=total_size, | ||
| versions_count=count, | ||
| ) | ||
| return aggregate, versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| aggregate = LogicallyDeletedObject( | |
| key=key, | |
| deleted_at=latest.last_modified, | |
| total_size_bytes=total_size, | |
| versions_count=count, | |
| ) | |
| return aggregate, versions | |
| aggregate = LogicallyDeletedObject( | |
| key=key, | |
| deleted_at=latest.last_modified, | |
| total_size_bytes=total_size, | |
| versions_count=count, | |
| versions=versions, | |
| ) | |
| return aggregate |
Avoid tuple returns, they are hard to read, especially when you have an object under your conrol.
Prefer a newline before returns :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versions are used for the clean function where we actually delete them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use tuple returns
| # Filter options | ||
| parser.add_argument( | ||
| "--deleted-after", | ||
| type=parse_dt, | ||
| help="Only process objects deleted after this date (ISO 8601 format)", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that? We don't need a fixed date, we need a rolling date. So it should be --since-days, otherwise how the caller of the script will calculate the proper date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that this was introduced in the first revision of the script, and we moved forward with a specific date time argument rather than a rolling days window.
So instead of using --since-days 30 we use --deleted-after 2025-12-01
The reason behind this approach, is to have more control over the date, and also easier to test locally without mocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for this misunderstanding, I either didn't mean that or I just said something that does not make sense at all. Feel free to point it to me over chat so I can print it and hang it on my wall of shame.
--since-days/--deleted-for-days should be the case, as the script is always rolling forward and a strict date cannot be passed. In theory the arg can be smarter like --deleted-for 30 days or --deleted-for 30 seconds. Make sure the days and seconds are in their full name, so there is no accidental overlook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea, while it will have some complexity, but I agree with you it's worth it. since it will make it easier for a cron with a predefined calls rather than calculating the dates again.
Thank you for the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please still kill it :)
- Add tests for handling large data uploads and complex version histories.
- Enhance progress tracking by adding rogress callback - Refactored scan and clean methods for better readability and maintainability.
- Add logging functionality to replace print. - Enhanced output formatting for bucket information and progress updates. - Introduced a TRACE logging level for detailed operation logs. - Updated the main function to support debug logging and log file options.
7356a59 to
9c72672
Compare
- Implemented a new command-line argument '--since' to filter deleted objects based on a specified duration. - Added a helper function to parse various time formats for the 'since' argument. - Enhanced tests to validate the functionality of the 'since' filter.
…n object storage cleaner
- Changed regex pattern to accept full time unit words instead of abbreviations.
|
|
||
| def setup_logging(debug: bool = False, log_file: str | None = None) -> None: | ||
| """Setup logging for the script.""" | ||
| level = TRACE_LEVEL if log_file else (logging.DEBUG if debug else logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use inline if in QFieldCloud code. This is unreadable by the common folks. Do not apply such magic, make as many of the configuration required and explicit. The log level, if not by default, should be passed as a cli param/envvar and no magical logic should be applied.
| # 4. File Handler (The Audit Log) | ||
| if log_file: | ||
| Path(log_file).parent.mkdir(parents=True, exist_ok=True) | ||
| file_handler = RotatingFileHandler( | ||
| log_file, maxBytes=1000 * 1024 * 1024, backupCount=5, encoding="utf-8" | ||
| ) | ||
| file_handler.setLevel(TRACE_LEVEL) | ||
| file_handler.setFormatter( | ||
| logging.Formatter( | ||
| "%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S" | ||
| ) | ||
| ) | ||
| logger.addHandler(file_handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave this to the outer level, there are log rotators that are doing this much better than this code. Just kill this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure If I'm getting it quite well, I will appreciate if you can share other log rotators, so I have a reference.
The main idea behind this, was enabling log files, so we can trace what will be deleted and what is deleted. and taking advantage of the current logging setup.
e.g :
for v in versions:
logger.log(
TRACE_LEVEL,
"Deleting object: Key=%s Version=%s",
v["Key"],
v["VersionId"],
)and
for v in versions:
logger.log(
TRACE_LEVEL,
"Version: Key=%s VersionId=%s Size=%s LastModified=%s",
v.key,
v.version_id,
v.size,
v.last_modified,
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One run of the script should produce one log file. We can easily do that with a timestamp suffix.
If we want cleanup or something more advanced, our good friend https://linux.die.net/man/8/logrotate exists.
| # 2. Console Handler (The User UI) | ||
| console_handler = logging.StreamHandler(sys.stdout) | ||
| console_handler.setLevel(logging.DEBUG if debug else logging.INFO) | ||
| console_handler.setFormatter(logging.Formatter("%(message)s")) | ||
|
|
||
| # Filter out errors (they go to stderr) | ||
| console_handler.addFilter(lambda record: record.levelno <= logging.INFO) | ||
| logger.addHandler(console_handler) | ||
|
|
||
| # 3. Stderr Handler (Errors) | ||
| error_handler = logging.StreamHandler(sys.stderr) | ||
| error_handler.setLevel(logging.WARNING) | ||
| error_handler.setFormatter(logging.Formatter("%(levelname)s: %(message)s")) | ||
| logger.addHandler(error_handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a single output from the script, ideally to STDOUT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the main motivation behind using Logging over all was to improve quality and have this flexibility, since why we didn't just simply leave it to print anyways from the beginning ?
Main motivation behind the separation of handler, is when using piping or redirecting the output, we don't get the user facing output to the files or pipe.
Also this approach will allow us to advance the logger later on, and for something like the log reporting. I think if these are invalid then print was just fine even if it's not the go to practice, It will get the job done and will be easy to read.
Let me know If I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flexibility here is a debt that I am not sure with the current team we can pay.
Single file with a lot of grepping is easier to manage than multiple files. It might sound ridiculous, but that is the sad truth of my experience managing multiple scripts.
The purpose of switching from print to logging has multiple benefits, including log level control, something you can never do with prints without if-ing all the time. Multiple output streams is something that we may benefit in the future, but nobody meant to be implemented right away.
By the way, check the https://docs.python.org/3/library/logging.config.html#dictionary-schema-details which is doing similar things, without the need to write any code.
I think I know what was the main (good) motivation behind all this, but I am pretty sure we need it as of now. I personally don't know when and how this script is going to fail, I guess nobody in the universe knows. Write the basic logs, write them in a easy to manage way. Let's first collect some experience with the real world problems before we try to solve probably non-existing ones by adding too much branching and clumsiness in the code.
|
|
||
| def _print_progress(self, delta: int, label: str = "Deletable") -> None: | ||
| self._scanned_count += delta | ||
| if sys.stdout.isatty(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not do any branching on printing based on tty-ness. Just print a regular log from the caller instead. Kill this function all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this function is very helpful, since it gives the feel that the script is actually running, since If you have 100k+ files, this script will take time to run a scan or a prune.
and without any output / feedback, I think it will be a bit confusing.
Regarding "printing based on tty-ness". this was a workaround so looks better and not keep printing new lines and also to avoid anything going to pipes or logging files.
Let me know if you think that there is a better solution for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just print to STDOUT and you don't have to solve any of these problems.
suricactus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving forward, but there are more things to be removed:
- please do not use inline ifs (binary/ternary operators) in Python code related to qfield and qfieldcloud. Python syntax for it is unreadable for people coming from other languages and it is confusing even for regular pythonists sometimes. The only reason we don't have https://github.com/afonasev/flake8-if-expr enabled is because it is slow and that one can argue they might be useful in simple situations.
- simplify logging logic: it should just log simple strings with minimal dynamic part.
- avoid flushing the buffers: there is PYTHONUNBUFFERED for people who really need it
- print everything line by line to stdout and leave the caller to do whatever they want with the buffer. It must be easy for grepping.
- do not add functionality that is very well handled by other tools, e.g. log rotation.
- avoid complex logic with logging and if statements when log appears. That's why different log levels exist. It should be very simple to follow what is going on from the log without any evaluation of if statements.
- please remove the "Connection options" params and use envvars for that purpose.
- avoid using tuple as return values, unless well argumented and documented why there is no other option.
- avoid using default variables for scripts. If you need a parameter, they should be explicit. If you don't need it explicit, then drop the parameter.
- in scripts like this do not use inline
fors andmap,filterandreduce. Functional code is less readable in case of emergency.
Once addressed, please consider the following:
- remove the OOP abstraction for state management and using pure imperative code.
- avoid using the non-native container object
VersionRef.
The short summary is: start small and only add functionality if really required.
| parser.add_argument("--region", help="Region") | ||
| parser.add_argument("--profile", help="Profile") | ||
| parser.add_argument("--endpoint-url", help="Custom S3 endpoint URL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kill all of these args
| # Filter options | ||
| parser.add_argument( | ||
| "--deleted-after", | ||
| type=parse_dt, | ||
| help="Only process objects deleted after this date (ISO 8601 format)", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please still kill it :)
| # Action flags | ||
| parser.add_argument( | ||
| "--info", action="store_true", help="Show bucket information (default behavior)" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kill it
| parser.add_argument( | ||
| "--prune", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| parser.add_argument( | |
| "--prune", | |
| parser.add_argument( | |
| "--permanently-delete-versions", |
Be scary when there is a chance to destroy data. Should be very very explicit what it is going to do.
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--debug", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "--debug", | |
| "--log-level", |
Make it use an enum (or something native, no idea) that accepts the built-in log levels.
| parser.add_argument( | ||
| "--log-file", | ||
| default=None, | ||
| help="Write a rotating logbook file (e.g. /tmp/s3-cleaner.log)", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kill
| return None | ||
|
|
||
| # Sort by time descending (latest first) | ||
| versions.sort(key=lambda x: x.last_modified, reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I am lazy, note for myself:
export AWS_ENDPOINT_URL=http://172.17.0.1:8009 AWS_ACCESS_KEY_ID=rustfsadmin AWS_ACCESS_SECRET_KEY=rustfsadmin
import boto3
s3 = boto3.resource("s3")
bucket_name = "qfieldcloud-local"
bucket = s3.Bucket(bucket_name)
bucket.object_versions.delete()
bucket.delete()
bucket = s3.create_bucket(Bucket=bucket_name)
s3.BucketVersioning(bucket_name).enable()
root_key = "file"
special_key = "SPECIAL__FILE__000"
bucket.put_object(Key=special_key, Body="v1")
bucket.put_object(Key=special_key, Body="v2")
for i in range(1, 1000):
key = f"{root_key}__{i:>03d}"
bucket.put_object(Key=key, Body=key)
bucket.Object(special_key).delete()
bucket.Object("file__998").delete()
for idx, v in enumerate(bucket.object_versions.all()):
print(f"{idx:>03d}", v.key, v.version_id, v.size)
|
I am closing this, unless we want to keep it open for some reason? |
Currently QFieldCloud keeps all project files using the non-legacy storage even if they are deleted by the user. This is due to enabled versioning on Exoscale and the lack of Exoscale mechanism to delete files after certain amount of time after they are deleted. This causes excessive file storage and therefore costs.
storage_cleaner.pyfor scanning and deleting logically deleted objects in S3 compatable storagestest.pyfor testing the scriptREADME.MDA small CLI to analyze and clean logically deleted objects in versioned S3-compatible buckets.
This script is provider-agnostic works with any S3-compatible service that supports versioning such as
MinIO, etc.It helps you to the following:
Requirements
Usage
1. No action provided or
--info1. Scanning for logically deleted objects
1. Deleting logically deleted objects
Arguments
--noinputSkip confirmation prompt (use with --purge for automation)--deleted-afterOnly process objects deleted after this date (ISO 8601 format, e.g. 2024-12-01:00:00:00) Cannot be used with--deleted-since.--deleted-sinceTime duration ago (e.g. '3 days', '2 weeks'). Cannot be used with--deleted-after.--prefixKey prefix to scan--profileObject storage profile--debugEnable debug logging--log-fileWrite a rotating logbook file (e.g. /tmp/s3-cleaner.log)--scanScan only--pruneDelete found logically deleted objects (requires confirmation unless --noinput)--infoShow bucket information (default behavior)By default it will load the
defaultlocal S3 configurationTesting
Added multiple test cases
Demo
Scan
Prune