[R&D] feat: add object storage cleaner script and testing #1445

Rakanhf · 2025-12-15T16:18:41Z

Currently QFieldCloud keeps all project files using the non-legacy storage even if they are deleted by the user. This is due to enabled versioning on Exoscale and the lack of Exoscale mechanism to delete files after certain amount of time after they are deleted. This causes excessive file storage and therefore costs.

Add storage_cleaner.py for scanning and deleting logically deleted objects in S3 compatable storages
Add test.py for testing the script
Add a README.MD

A small CLI to analyze and clean logically deleted objects in versioned S3-compatible buckets.

This script is provider-agnostic works with any S3-compatible service that supports versioning such as MinIO, etc.

It helps you to the following:

Detect whether a bucket has versioning configured.
Find keys whose latest version is a delete marker (logically deleted).
Report how much storage their historical versions consume.
Deleting all versions for those keys to free space.

Requirements

boto3
pytest (just for testing)

Usage

1. No action provided or `--info`

python storage_cleaner.py my-versioned-bucket-name

No action specified. Use --info, --scan, or --prune.
Use --help for usage information.
Defaulting to --info.

Bucket Information:
--------------------------------------------------------------------------------
Bucket:            my-versioned-bucket-name
Versioning Status: Enabled
Region:            us-east-1
Endpoint URL:      http://localhost:8009
Prefix Filter:     No prefix specified
Deleted After:     No deleted after date specified
--------------------------------------------------------------------------------
✓ Connection successful.
✓ Bucket is accessible.

Use --scan to find logically deleted objects.

1. Scanning for logically deleted objects

python storage_cleaner.py my-versioned-bucket-name --scan

1. Deleting logically deleted objects

python storage_cleaner.py my-versioned-bucket-name --prune

Arguments

--noinput Skip confirmation prompt (use with --purge for automation)
--deleted-after Only process objects deleted after this date (ISO 8601 format, e.g. 2024-12-01:00:00:00) Cannot be used with --deleted-since.
--deleted-since Time duration ago (e.g. '3 days', '2 weeks'). Cannot be used with --deleted-after.
--prefix Key prefix to scan
--profile Object storage profile
--debug Enable debug logging
--log-file Write a rotating logbook file (e.g. /tmp/s3-cleaner.log)
--scan Scan only
--prune Delete found logically deleted objects (requires confirmation unless --noinput)
--info Show bucket information (default behavior)

By default it will load the default local S3 configuration

Testing

Added multiple test cases

Upload files to object storage - no deleted files (shouldn't find any logically deleted objects) -> Nothing deleted
Upload files to object storage - delete some files (should find logically deleted objects) -> Deleted files
Upload files to object storage - delete some files - restore (shouldn't find any logically deleted objects) -> Nothing deleted
Upload files, delete some, test deleted-after filter.
File created, deleted, created again, deleted again. Ensure ALL versions (history) are cleaned up.

Demo

Scan

Prune

Currently QFieldCloud keeps all project files using the non-legacy storage even if they are deleted by the user. This is due to enabled versioning on Exoscale and the lack of Exoscale mechanism to delete files after certain amount of time after they are deleted. This causes excessive file storage and therefore costs. - Add `storage_cleaner.py` for scanning and deleting logically deleted objects in S3 compatable storages - Add `test.py` for testing the script

duke-nyuki · 2025-12-15T16:18:47Z

Task linked: QF-6039 [filestorage] Write a script that actually deletes the files from the object storage

gounux

Thanks a lot, looks promising !

This is a first review round, did not checkout locally neither tested yet, mainly some typings & formalism comments.

scripts/object_storage_cleaner/storage_cleaner.py

suricactus

Summary of the review:

The script needs a few iterations before it gets in a better shape. So be patient :)

There are a few things that are still not addressed since the preliminary review via chat.

In short, I think the script is more complex than it should be. Both in terms of algorithmic approach, as well as way of writing it. A bit more OOP than my taste, this is a script in the end.

Note this script will be used by people who are not necessarily experts in Python. Write the script like someone who sees python for the first time will read it.

Please ping me when the current comments are addressed for a second round. Feel free to simplify things that you recognize as complex or not obvious, but were not commented in this round.

scripts/object_storage_cleaner/storage_cleaner.py

suricactus · 2025-12-16T13:22:21Z

scripts/object_storage_cleaner/storage_cleaner.py

+
+        return status == "Enabled"
+
+    def _iter_all_versions(self) -> Iterator[VersionRef]:


I still don't get the purpose of this and the VersionRef object.

Why do we need to "rename" the objects in any way, since they are already properly structured?

If you need it just for typing, use https://youtype.github.io/boto3_stubs_docs/mypy_boto3_s3/service_resource/#objectversion .

I am also curious can't we abandon this whole thing and just use Bucket.list_versions? See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/collections.html and https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/bucket/object_versions.html#filter .

I still don't get the purpose of this and the VersionRef object.

Versions have a Size attribute, DeleteMarkers do not. Accessing .size on a Delete Marker object from the raw API would raise an error.

I am also curious can't we abandon this whole thing and just use Bucket.list_versions? See

I'm not sure I got this right, but I couldn't find anything related to Bucket.list_versions am I missing something ?

You are right- there is no size attribute- but I still find

size = 0 if v.is_delete_marker: # delete markers do not have a size attribute size += 0 else: size += v["size"]

OR

# delete markers... size += v.get("size", 0) # might be getattr

This way we drop the need of two classes and a special function. Less code is more, even if we have the false perception that the classes will cahe explicit typing etc.

Again, this is a script we must be able to debug easily by anyone, most likely people outside our team. Keep it as simple as possible. The native types from boto3 will always be better documented.

I think having the data classes makes it easier to read, at least from my perspective, Since if I removed VersionRef I would need to manually map out the structure, but when using the data class, I can clearly check the reference and it's structure.

This is my personal way of how I view code and anything with definition / types will make me more comfortable even if it takes a couple more lines of code.

At the begging I didn't lean into having an OOP structure at all, but due to the complexity of this script, making it OOP and have data classes made it a bit easier and clear for me.

I don't want to overrule here if it is simpler for you. Don't take this comment as a blocker, but more as inspiration.

I will just summarize my suggestions that will make this script and it's maintenence easier based on my experience:

avoid OOP, especially in scripts. Use imperative code, it's simpler mental model. The complexity in OOP comes mostly from state management, so even if you stick with OOP, avoid using self a s a default storage for stuff whenever possible. My experience showed me that passing state as readonly parameters/return values is not immediately obvious at writing time, but easier to follow at reading.

avoid introducing new types when types already exist and are well documented. If this tiny wrapper helps you to follow, I am ok with it, I find it redundant though.

Keep it very obvious what is going on by reducing the block nesting using early returns. Reduce the number of methods/functions to avoid jumping around the code.

Please check with @toebivankenoebi who will most likely deploy the script in the end how he feels about the OOP approach at a bit later stage when the script is ready for a semi-final review.

scripts/object_storage_cleaner/storage_cleaner.py

suricactus · 2025-12-16T13:37:55Z

scripts/object_storage_cleaner/storage_cleaner.py

+        aggregate = LogicallyDeletedObject(
+            key=key,
+            deleted_at=latest.last_modified,
+            total_size_bytes=total_size,
+            versions_count=count,
+        )
+        return aggregate, versions


Suggested change

aggregate = LogicallyDeletedObject(

key=key,

deleted_at=latest.last_modified,

total_size_bytes=total_size,

versions_count=count,

)

return aggregate, versions

aggregate = LogicallyDeletedObject(

key=key,

deleted_at=latest.last_modified,

total_size_bytes=total_size,

versions_count=count,

versions=versions,

)

return aggregate

Avoid tuple returns, they are hard to read, especially when you have an object under your conrol.

Prefer a newline before returns :)

versions are used for the clean function where we actually delete them

Please do not use tuple returns

scripts/object_storage_cleaner/storage_cleaner.py

suricactus · 2025-12-16T13:44:08Z

scripts/object_storage_cleaner/storage_cleaner.py

+    # Filter options
+    parser.add_argument(
+        "--deleted-after",
+        type=parse_dt,
+        help="Only process objects deleted after this date (ISO 8601 format)",
+    )


Why is that? We don't need a fixed date, we need a rolling date. So it should be --since-days, otherwise how the caller of the script will calculate the proper date?

I remember that this was introduced in the first revision of the script, and we moved forward with a specific date time argument rather than a rolling days window.

So instead of using --since-days 30 we use --deleted-after 2025-12-01

The reason behind this approach, is to have more control over the date, and also easier to test locally without mocking

I am sorry for this misunderstanding, I either didn't mean that or I just said something that does not make sense at all. Feel free to point it to me over chat so I can print it and hang it on my wall of shame.

--since-days/--deleted-for-days should be the case, as the script is always rolling forward and a strict date cannot be passed. In theory the arg can be smarter like --deleted-for 30 days or --deleted-for 30 seconds. Make sure the days and seconds are in their full name, so there is no accidental overlook.

This is a great idea, while it will have some complexity, but I agree with you it's worth it. since it will make it easier for a cron with a predefined calls rather than calculating the dates again.

Thank you for the suggestion

Please still kill it :)

- Add tests for handling large data uploads and complex version histories.

- Enhance progress tracking by adding rogress callback - Refactored scan and clean methods for better readability and maintainability.

- Add logging functionality to replace print. - Enhanced output formatting for bucket information and progress updates. - Introduced a TRACE logging level for detailed operation logs. - Updated the main function to support debug logging and log file options.

- Implemented a new command-line argument '--since' to filter deleted objects based on a specified duration. - Added a helper function to parse various time formats for the 'since' argument. - Enhanced tests to validate the functionality of the 'since' filter.

…n object storage cleaner

…ate references

scripts/object_storage_cleaner/storage_cleaner.py

- Changed regex pattern to accept full time unit words instead of abbreviations.

suricactus · 2025-12-27T01:52:04Z

scripts/object_storage_cleaner/storage_cleaner.py

+
+def setup_logging(debug: bool = False, log_file: str | None = None) -> None:
+    """Setup logging for the script."""
+    level = TRACE_LEVEL if log_file else (logging.DEBUG if debug else logging.INFO)


Please do not use inline if in QFieldCloud code. This is unreadable by the common folks. Do not apply such magic, make as many of the configuration required and explicit. The log level, if not by default, should be passed as a cli param/envvar and no magical logic should be applied.

suricactus · 2025-12-27T01:52:46Z

scripts/object_storage_cleaner/storage_cleaner.py

+    # 4. File Handler (The Audit Log)
+    if log_file:
+        Path(log_file).parent.mkdir(parents=True, exist_ok=True)
+        file_handler = RotatingFileHandler(
+            log_file, maxBytes=1000 * 1024 * 1024, backupCount=5, encoding="utf-8"
+        )
+        file_handler.setLevel(TRACE_LEVEL)
+        file_handler.setFormatter(
+            logging.Formatter(
+                "%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S"
+            )
+        )
+        logger.addHandler(file_handler)


Leave this to the outer level, there are log rotators that are doing this much better than this code. Just kill this functionality.

I'm not sure If I'm getting it quite well, I will appreciate if you can share other log rotators, so I have a reference.

The main idea behind this, was enabling log files, so we can trace what will be deleted and what is deleted. and taking advantage of the current logging setup.

e.g :

for v in versions: logger.log( TRACE_LEVEL, "Deleting object: Key=%s Version=%s", v["Key"], v["VersionId"], )

and

for v in versions: logger.log( TRACE_LEVEL, "Version: Key=%s VersionId=%s Size=%s LastModified=%s", v.key, v.version_id, v.size, v.last_modified, )

One run of the script should produce one log file. We can easily do that with a timestamp suffix.

If we want cleanup or something more advanced, our good friend https://linux.die.net/man/8/logrotate exists.

suricactus · 2025-12-27T01:53:32Z

scripts/object_storage_cleaner/storage_cleaner.py

+    # 2. Console Handler (The User UI)
+    console_handler = logging.StreamHandler(sys.stdout)
+    console_handler.setLevel(logging.DEBUG if debug else logging.INFO)
+    console_handler.setFormatter(logging.Formatter("%(message)s"))
+
+    # Filter out errors (they go to stderr)
+    console_handler.addFilter(lambda record: record.levelno <= logging.INFO)
+    logger.addHandler(console_handler)
+
+    # 3. Stderr Handler (Errors)
+    error_handler = logging.StreamHandler(sys.stderr)
+    error_handler.setLevel(logging.WARNING)
+    error_handler.setFormatter(logging.Formatter("%(levelname)s: %(message)s"))
+    logger.addHandler(error_handler)


Have a single output from the script, ideally to STDOUT.

I believe that the main motivation behind using Logging over all was to improve quality and have this flexibility, since why we didn't just simply leave it to print anyways from the beginning ?

Main motivation behind the separation of handler, is when using piping or redirecting the output, we don't get the user facing output to the files or pipe.

Also this approach will allow us to advance the logger later on, and for something like the log reporting. I think if these are invalid then print was just fine even if it's not the go to practice, It will get the job done and will be easy to read.

Let me know If I'm missing something.

Flexibility here is a debt that I am not sure with the current team we can pay.

Single file with a lot of grepping is easier to manage than multiple files. It might sound ridiculous, but that is the sad truth of my experience managing multiple scripts.

The purpose of switching from print to logging has multiple benefits, including log level control, something you can never do with prints without if-ing all the time. Multiple output streams is something that we may benefit in the future, but nobody meant to be implemented right away.

By the way, check the https://docs.python.org/3/library/logging.config.html#dictionary-schema-details which is doing similar things, without the need to write any code.

I think I know what was the main (good) motivation behind all this, but I am pretty sure we need it as of now. I personally don't know when and how this script is going to fail, I guess nobody in the universe knows. Write the basic logs, write them in a easy to manage way. Let's first collect some experience with the real world problems before we try to solve probably non-existing ones by adding too much branching and clumsiness in the code.

suricactus · 2025-12-27T01:55:09Z

scripts/object_storage_cleaner/storage_cleaner.py

+
+    def _print_progress(self, delta: int, label: str = "Deletable") -> None:
+        self._scanned_count += delta
+        if sys.stdout.isatty():


Do not do any branching on printing based on tty-ness. Just print a regular log from the caller instead. Kill this function all together.

I think that this function is very helpful, since it gives the feel that the script is actually running, since If you have 100k+ files, this script will take time to run a scan or a prune.

and without any output / feedback, I think it will be a bit confusing.

Regarding "printing based on tty-ness". this was a workaround so looks better and not keep printing new lines and also to avoid anything going to pipes or logging files.

Let me know if you think that there is a better solution for this.

Just print to STDOUT and you don't have to solve any of these problems.

#1445 (comment)

scripts/object_storage_cleaner/storage_cleaner.py

suricactus

We are moving forward, but there are more things to be removed:

please do not use inline ifs (binary/ternary operators) in Python code related to qfield and qfieldcloud. Python syntax for it is unreadable for people coming from other languages and it is confusing even for regular pythonists sometimes. The only reason we don't have https://github.com/afonasev/flake8-if-expr enabled is because it is slow and that one can argue they might be useful in simple situations.
simplify logging logic: it should just log simple strings with minimal dynamic part.
avoid flushing the buffers: there is PYTHONUNBUFFERED for people who really need it
print everything line by line to stdout and leave the caller to do whatever they want with the buffer. It must be easy for grepping.
do not add functionality that is very well handled by other tools, e.g. log rotation.
avoid complex logic with logging and if statements when log appears. That's why different log levels exist. It should be very simple to follow what is going on from the log without any evaluation of if statements.
please remove the "Connection options" params and use envvars for that purpose.
avoid using tuple as return values, unless well argumented and documented why there is no other option.
avoid using default variables for scripts. If you need a parameter, they should be explicit. If you don't need it explicit, then drop the parameter.
in scripts like this do not use inline fors and map, filter and reduce. Functional code is less readable in case of emergency.

Once addressed, please consider the following:

remove the OOP abstraction for state management and using pure imperative code.
avoid using the non-native container object VersionRef.

The short summary is: start small and only add functionality if really required.

suricactus · 2025-12-29T15:20:48Z

scripts/object_storage_cleaner/storage_cleaner.py

+    parser.add_argument("--region", help="Region")
+    parser.add_argument("--profile", help="Profile")
+    parser.add_argument("--endpoint-url", help="Custom S3 endpoint URL")


Kill all of these args

suricactus · 2025-12-29T15:21:06Z

scripts/object_storage_cleaner/storage_cleaner.py

+    # Filter options
+    parser.add_argument(
+        "--deleted-after",
+        type=parse_dt,
+        help="Only process objects deleted after this date (ISO 8601 format)",
+    )


Please still kill it :)

suricactus · 2025-12-29T15:21:23Z

scripts/object_storage_cleaner/storage_cleaner.py

+    # Action flags
+    parser.add_argument(
+        "--info", action="store_true", help="Show bucket information (default behavior)"
+    )


suricactus · 2025-12-29T15:22:19Z

scripts/object_storage_cleaner/storage_cleaner.py

+    parser.add_argument(
+        "--prune",


Suggested change

parser.add_argument(

"--prune",

parser.add_argument(

"--permanently-delete-versions",

Be scary when there is a chance to destroy data. Should be very very explicit what it is going to do.

suricactus · 2025-12-29T15:23:17Z

scripts/object_storage_cleaner/storage_cleaner.py

+    )
+
+    parser.add_argument(
+        "--debug",


Suggested change

"--debug",

"--log-level",

Make it use an enum (or something native, no idea) that accepts the built-in log levels.

suricactus · 2025-12-29T15:23:24Z

scripts/object_storage_cleaner/storage_cleaner.py

+    parser.add_argument(
+        "--log-file",
+        default=None,
+        help="Write a rotating logbook file (e.g. /tmp/s3-cleaner.log)",
+    )


suricactus · 2025-12-29T16:22:19Z

scripts/object_storage_cleaner/storage_cleaner.py

+            return None
+
+        # Sort by time descending (latest first)
+        versions.sort(key=lambda x: x.last_modified, reverse=True)


Because I am lazy, note for myself:

export AWS_ENDPOINT_URL=http://172.17.0.1:8009 AWS_ACCESS_KEY_ID=rustfsadmin AWS_ACCESS_SECRET_KEY=rustfsadmin

import boto3 s3 = boto3.resource("s3") bucket_name = "qfieldcloud-local" bucket = s3.Bucket(bucket_name) bucket.object_versions.delete() bucket.delete() bucket = s3.create_bucket(Bucket=bucket_name) s3.BucketVersioning(bucket_name).enable() root_key = "file" special_key = "SPECIAL__FILE__000" bucket.put_object(Key=special_key, Body="v1") bucket.put_object(Key=special_key, Body="v2") for i in range(1, 1000): key = f"{root_key}__{i:>03d}" bucket.put_object(Key=key, Body=key) bucket.Object(special_key).delete() bucket.Object("file__998").delete() for idx, v in enumerate(bucket.object_versions.all()): print(f"{idx:>03d}", v.key, v.version_id, v.size)

suricactus · 2026-01-28T20:31:23Z

I am closing this, unless we want to keep it open for some reason?

Rakanhf changed the title ~~[WIP] feat: add object storage cleaner script and testing~~ feat: add object storage cleaner script and testing Dec 15, 2025

gounux reviewed Dec 16, 2025

View reviewed changes

suricactus reviewed Dec 16, 2025

View reviewed changes

Rakanhf changed the title ~~feat: add object storage cleaner script and testing~~ [WIP] feat: add object storage cleaner script and testing Dec 17, 2025

Rakanhf added enhancement New feature or request do not merge labels Dec 17, 2025

Rakanhf added 9 commits December 17, 2025 18:50

refactor: update type hints to use python3.9+

b293022

refactor: simplify datetime parsing

ef52fd3

refactor: remove access key parameters from ObjectStorageScanner

bc552b1

refactor: improve version count calculation

4e393c0

test: enhance object storage cleaner tests and fix config and assertions

a7c7cda

- Add tests for handling large data uploads and complex version histories.

refactor: imrpove code readability

d59d0ca

refactor: format code

79282fa

refactor: enhance object storage cleaner functionality and output

32457b1

- Enhance progress tracking by adding rogress callback - Refactored scan and clean methods for better readability and maintainability.

Rakanhf force-pushed the QF-6039-object-storage-cleanup-script branch from 7356a59 to 9c72672 Compare December 23, 2025 23:42

Rakanhf added 5 commits December 24, 2025 23:46

refactor: improve object storage cleaner code structure

a4fc11f

test: add tests for stray delete markers and high frequency updates i…

c917c9d

…n object storage cleaner

refactor: rename ObjectStorageScanner to ObjectStorageCleaner and upd…

85dd922

…ate references

refactor: rename command line arguments for clarity

35aa088

suricactus reviewed Dec 24, 2025

View reviewed changes

scripts/object_storage_cleaner/storage_cleaner.py Show resolved Hide resolved

Rakanhf added 3 commits December 25, 2025 01:18

feat: add README for Object Storage Cleaner utility

e19080c

refactor: update duration parsing in storage cleaner

36fc03b

- Changed regex pattern to accept full time unit words instead of abbreviations.

refactor: format readme document

7d9b40f

Rakanhf changed the title ~~[WIP] feat: add object storage cleaner script and testing~~ feat: add object storage cleaner script and testing Dec 24, 2025

suricactus reviewed Dec 27, 2025

View reviewed changes

scripts/object_storage_cleaner/storage_cleaner.py Show resolved Hide resolved

suricactus reviewed Dec 27, 2025

View reviewed changes

scripts/object_storage_cleaner/storage_cleaner.py Outdated Show resolved Hide resolved

Rakanhf added 2 commits December 29, 2025 12:58

fix: adjust pagination limit and format code

a9890c5

refactor: enhance code quality

3b15a85

suricactus reviewed Dec 29, 2025

View reviewed changes

Rakanhf changed the title ~~feat: add object storage cleaner script and testing~~ [R&D] feat: add object storage cleaner script and testing Jan 5, 2026

suricactus closed this Jan 28, 2026


		return status == "Enabled"

		def _iter_all_versions(self) -> Iterator[VersionRef]:

Uh oh!

[R&D] feat: add object storage cleaner script and testing #1445

[R&D] feat: add object storage cleaner script and testing #1445

Uh oh!

Conversation

Rakanhf commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Usage

1. No action provided or --info

1. Scanning for logically deleted objects

1. Deleting logically deleted objects

Arguments

Testing

Demo

Scan

Prune

Uh oh!

duke-nyuki commented Dec 15, 2025

Uh oh!

gounux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suricactus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suricactus Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Rakanhf commented Dec 15, 2025 •

edited

Loading

1. No action provided or `--info`

suricactus Dec 27, 2025 •

edited

Loading