Skip to content

Conversation

@Rakanhf
Copy link
Contributor

@Rakanhf Rakanhf commented Dec 15, 2025

Currently QFieldCloud keeps all project files using the non-legacy storage even if they are deleted by the user. This is due to enabled versioning on Exoscale and the lack of Exoscale mechanism to delete files after certain amount of time after they are deleted. This causes excessive file storage and therefore costs.

  • Add storage_cleaner.py for scanning and deleting logically deleted objects in S3 compatable storages
  • Add test.py for testing the script
  • Add a README.MD

A small CLI to analyze and clean logically deleted objects in versioned S3-compatible buckets.

This script is provider-agnostic works with any S3-compatible service that supports versioning such as MinIO, etc.

It helps you to the following:

  1. Detect whether a bucket has versioning configured.
  2. Find keys whose latest version is a delete marker (logically deleted).
  3. Report how much storage their historical versions consume.
  4. Deleting all versions for those keys to free space.

Requirements

  • boto3
  • pytest (just for testing)

Usage

1. No action provided or --info
python storage_cleaner.py my-versioned-bucket-name
No action specified. Use --info, --scan, or --prune.
Use --help for usage information.
Defaulting to --info.

Bucket Information:
--------------------------------------------------------------------------------
Bucket:            my-versioned-bucket-name
Versioning Status: Enabled
Region:            us-east-1
Endpoint URL:      http://localhost:8009
Prefix Filter:     No prefix specified
Deleted After:     No deleted after date specified
--------------------------------------------------------------------------------
✓ Connection successful.
✓ Bucket is accessible.

Use --scan to find logically deleted objects.
1. Scanning for logically deleted objects
python storage_cleaner.py my-versioned-bucket-name --scan
1. Deleting logically deleted objects
python storage_cleaner.py my-versioned-bucket-name --prune

Arguments

  1. --noinput Skip confirmation prompt (use with --purge for automation)
  2. --deleted-after Only process objects deleted after this date (ISO 8601 format, e.g. 2024-12-01:00:00:00) Cannot be used with --deleted-since.
  3. --deleted-since Time duration ago (e.g. '3 days', '2 weeks'). Cannot be used with --deleted-after.
  4. --prefix Key prefix to scan
  5. --profile Object storage profile
  6. --debug Enable debug logging
  7. --log-file Write a rotating logbook file (e.g. /tmp/s3-cleaner.log)
  8. --scan Scan only
  9. --prune Delete found logically deleted objects (requires confirmation unless --noinput)
  10. --info Show bucket information (default behavior)

By default it will load the default local S3 configuration

Testing

Added multiple test cases

  1. Upload files to object storage - no deleted files (shouldn't find any logically deleted objects) -> Nothing deleted
  2. Upload files to object storage - delete some files (should find logically deleted objects) -> Deleted files
  3. Upload files to object storage - delete some files - restore (shouldn't find any logically deleted objects) -> Nothing deleted
  4. Upload files, delete some, test deleted-after filter.
  5. File created, deleted, created again, deleted again. Ensure ALL versions (history) are cleaned up.

Demo

Scan

Screencast from 2025-12-25 00-35-29

Prune

Screencast from 2025-12-25 00-37-48

Currently QFieldCloud keeps all project files using the non-legacy storage even if they are deleted by the user.
This is due to enabled versioning on Exoscale and the lack of Exoscale mechanism to delete files after certain amount of time after they are deleted.
This causes excessive file storage and therefore costs.

- Add `storage_cleaner.py` for scanning and deleting logically deleted objects in S3 compatable storages
- Add `test.py` for testing the script
@duke-nyuki
Copy link
Collaborator

@Rakanhf Rakanhf changed the title [WIP] feat: add object storage cleaner script and testing feat: add object storage cleaner script and testing Dec 15, 2025
Copy link
Member

@gounux gounux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, looks promising !

This is a first review round, did not checkout locally neither tested yet, mainly some typings & formalism comments.

Copy link
Collaborator

@suricactus suricactus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of the review:

The script needs a few iterations before it gets in a better shape. So be patient :)

There are a few things that are still not addressed since the preliminary review via chat.

In short, I think the script is more complex than it should be. Both in terms of algorithmic approach, as well as way of writing it. A bit more OOP than my taste, this is a script in the end.

Note this script will be used by people who are not necessarily experts in Python. Write the script like someone who sees python for the first time will read it.

Please ping me when the current comments are addressed for a second round. Feel free to simplify things that you recognize as complex or not obvious, but were not commented in this round.


return status == "Enabled"

def _iter_all_versions(self) -> Iterator[VersionRef]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get the purpose of this and the VersionRef object.

Why do we need to "rename" the objects in any way, since they are already properly structured?

If you need it just for typing, use https://youtype.github.io/boto3_stubs_docs/mypy_boto3_s3/service_resource/#objectversion .

I am also curious can't we abandon this whole thing and just use Bucket.list_versions? See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/collections.html and https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/bucket/object_versions.html#filter .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get the purpose of this and the VersionRef object.

Versions have a Size attribute, DeleteMarkers do not. Accessing .size on a Delete Marker object from the raw API would raise an error.

I am also curious can't we abandon this whole thing and just use Bucket.list_versions? See

I'm not sure I got this right, but I couldn't find anything related to Bucket.list_versions am I missing something ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right- there is no size attribute- but I still find

size = 0
if v.is_delete_marker:
  # delete markers do not have a size attribute
  size += 0
else:
  size += v["size"]

OR

# delete markers...
size += v.get("size", 0) # might be getattr

This way we drop the need of two classes and a special function. Less code is more, even if we have the false perception that the classes will cahe explicit typing etc.

Again, this is a script we must be able to debug easily by anyone, most likely people outside our team. Keep it as simple as possible. The native types from boto3 will always be better documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the data classes makes it easier to read, at least from my perspective, Since if I removed VersionRef I would need to manually map out the structure, but when using the data class, I can clearly check the reference and it's structure.

This is my personal way of how I view code and anything with definition / types will make me more comfortable even if it takes a couple more lines of code.

At the begging I didn't lean into having an OOP structure at all, but due to the complexity of this script, making it OOP and have data classes made it a bit easier and clear for me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to overrule here if it is simpler for you. Don't take this comment as a blocker, but more as inspiration.

I will just summarize my suggestions that will make this script and it's maintenence easier based on my experience:

  • avoid OOP, especially in scripts. Use imperative code, it's simpler mental model. The complexity in OOP comes mostly from state management, so even if you stick with OOP, avoid using self a s a default storage for stuff whenever possible. My experience showed me that passing state as readonly parameters/return values is not immediately obvious at writing time, but easier to follow at reading.
  • avoid introducing new types when types already exist and are well documented. If this tiny wrapper helps you to follow, I am ok with it, I find it redundant though.
  • Keep it very obvious what is going on by reducing the block nesting using early returns. Reduce the number of methods/functions to avoid jumping around the code.

Please check with @toebivankenoebi who will most likely deploy the script in the end how he feels about the OOP approach at a bit later stage when the script is ready for a semi-final review.

Comment on lines 212 to 218
aggregate = LogicallyDeletedObject(
key=key,
deleted_at=latest.last_modified,
total_size_bytes=total_size,
versions_count=count,
)
return aggregate, versions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
aggregate = LogicallyDeletedObject(
key=key,
deleted_at=latest.last_modified,
total_size_bytes=total_size,
versions_count=count,
)
return aggregate, versions
aggregate = LogicallyDeletedObject(
key=key,
deleted_at=latest.last_modified,
total_size_bytes=total_size,
versions_count=count,
versions=versions,
)
return aggregate

Avoid tuple returns, they are hard to read, especially when you have an object under your conrol.

Prefer a newline before returns :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versions are used for the clean function where we actually delete them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use tuple returns

Comment on lines 460 to 465
# Filter options
parser.add_argument(
"--deleted-after",
type=parse_dt,
help="Only process objects deleted after this date (ISO 8601 format)",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? We don't need a fixed date, we need a rolling date. So it should be --since-days, otherwise how the caller of the script will calculate the proper date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that this was introduced in the first revision of the script, and we moved forward with a specific date time argument rather than a rolling days window.

So instead of using --since-days 30 we use --deleted-after 2025-12-01

The reason behind this approach, is to have more control over the date, and also easier to test locally without mocking

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for this misunderstanding, I either didn't mean that or I just said something that does not make sense at all. Feel free to point it to me over chat so I can print it and hang it on my wall of shame.

--since-days/--deleted-for-days should be the case, as the script is always rolling forward and a strict date cannot be passed. In theory the arg can be smarter like --deleted-for 30 days or --deleted-for 30 seconds. Make sure the days and seconds are in their full name, so there is no accidental overlook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, while it will have some complexity, but I agree with you it's worth it. since it will make it easier for a cron with a predefined calls rather than calculating the dates again.

Thank you for the suggestion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please still kill it :)

@Rakanhf Rakanhf changed the title feat: add object storage cleaner script and testing [WIP] feat: add object storage cleaner script and testing Dec 17, 2025
@Rakanhf Rakanhf added enhancement New feature or request do not merge labels Dec 17, 2025
- Add tests for handling large data uploads and complex version histories.
- Enhance progress tracking by adding rogress callback
- Refactored scan and clean methods for better readability and maintainability.
- Add logging functionality to replace print.
- Enhanced output formatting for bucket information and progress updates.
- Introduced a TRACE logging level for detailed operation logs.
- Updated the main function to support debug logging and log file options.
@Rakanhf Rakanhf force-pushed the QF-6039-object-storage-cleanup-script branch from 7356a59 to 9c72672 Compare December 23, 2025 23:42
- Implemented a new command-line argument '--since' to filter deleted objects based on a specified duration.
- Added a helper function to parse various time formats for the 'since' argument.
- Enhanced tests to validate the functionality of the 'since' filter.
@Rakanhf Rakanhf changed the title [WIP] feat: add object storage cleaner script and testing feat: add object storage cleaner script and testing Dec 24, 2025

def setup_logging(debug: bool = False, log_file: str | None = None) -> None:
"""Setup logging for the script."""
level = TRACE_LEVEL if log_file else (logging.DEBUG if debug else logging.INFO)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use inline if in QFieldCloud code. This is unreadable by the common folks. Do not apply such magic, make as many of the configuration required and explicit. The log level, if not by default, should be passed as a cli param/envvar and no magical logic should be applied.

Comment on lines +433 to +445
# 4. File Handler (The Audit Log)
if log_file:
Path(log_file).parent.mkdir(parents=True, exist_ok=True)
file_handler = RotatingFileHandler(
log_file, maxBytes=1000 * 1024 * 1024, backupCount=5, encoding="utf-8"
)
file_handler.setLevel(TRACE_LEVEL)
file_handler.setFormatter(
logging.Formatter(
"%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S"
)
)
logger.addHandler(file_handler)
Copy link
Collaborator

@suricactus suricactus Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this to the outer level, there are log rotators that are doing this much better than this code. Just kill this functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure If I'm getting it quite well, I will appreciate if you can share other log rotators, so I have a reference.

The main idea behind this, was enabling log files, so we can trace what will be deleted and what is deleted. and taking advantage of the current logging setup.

e.g :

           for v in versions:
                logger.log(
                    TRACE_LEVEL,
                    "Deleting object: Key=%s Version=%s",
                    v["Key"],
                    v["VersionId"],
                )

and

                for v in versions:
                    logger.log(
                        TRACE_LEVEL,
                        "Version: Key=%s VersionId=%s Size=%s LastModified=%s",
                        v.key,
                        v.version_id,
                        v.size,
                        v.last_modified,
                    )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One run of the script should produce one log file. We can easily do that with a timestamp suffix.

If we want cleanup or something more advanced, our good friend https://linux.die.net/man/8/logrotate exists.

Comment on lines +418 to +431
# 2. Console Handler (The User UI)
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setLevel(logging.DEBUG if debug else logging.INFO)
console_handler.setFormatter(logging.Formatter("%(message)s"))

# Filter out errors (they go to stderr)
console_handler.addFilter(lambda record: record.levelno <= logging.INFO)
logger.addHandler(console_handler)

# 3. Stderr Handler (Errors)
error_handler = logging.StreamHandler(sys.stderr)
error_handler.setLevel(logging.WARNING)
error_handler.setFormatter(logging.Formatter("%(levelname)s: %(message)s"))
logger.addHandler(error_handler)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a single output from the script, ideally to STDOUT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the main motivation behind using Logging over all was to improve quality and have this flexibility, since why we didn't just simply leave it to print anyways from the beginning ?

Main motivation behind the separation of handler, is when using piping or redirecting the output, we don't get the user facing output to the files or pipe.

Also this approach will allow us to advance the logger later on, and for something like the log reporting. I think if these are invalid then print was just fine even if it's not the go to practice, It will get the job done and will be easy to read.

Let me know If I'm missing something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flexibility here is a debt that I am not sure with the current team we can pay.

Single file with a lot of grepping is easier to manage than multiple files. It might sound ridiculous, but that is the sad truth of my experience managing multiple scripts.

The purpose of switching from print to logging has multiple benefits, including log level control, something you can never do with prints without if-ing all the time. Multiple output streams is something that we may benefit in the future, but nobody meant to be implemented right away.

By the way, check the https://docs.python.org/3/library/logging.config.html#dictionary-schema-details which is doing similar things, without the need to write any code.

I think I know what was the main (good) motivation behind all this, but I am pretty sure we need it as of now. I personally don't know when and how this script is going to fail, I guess nobody in the universe knows. Write the basic logs, write them in a easy to manage way. Let's first collect some experience with the real world problems before we try to solve probably non-existing ones by adding too much branching and clumsiness in the code.


def _print_progress(self, delta: int, label: str = "Deletable") -> None:
self._scanned_count += delta
if sys.stdout.isatty():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not do any branching on printing based on tty-ness. Just print a regular log from the caller instead. Kill this function all together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this function is very helpful, since it gives the feel that the script is actually running, since If you have 100k+ files, this script will take time to run a scan or a prune.

and without any output / feedback, I think it will be a bit confusing.

Regarding "printing based on tty-ness". this was a workaround so looks better and not keep printing new lines and also to avoid anything going to pipes or logging files.

Let me know if you think that there is a better solution for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just print to STDOUT and you don't have to solve any of these problems.

#1445 (comment)

Copy link
Collaborator

@suricactus suricactus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving forward, but there are more things to be removed:

  • please do not use inline ifs (binary/ternary operators) in Python code related to qfield and qfieldcloud. Python syntax for it is unreadable for people coming from other languages and it is confusing even for regular pythonists sometimes. The only reason we don't have https://github.com/afonasev/flake8-if-expr enabled is because it is slow and that one can argue they might be useful in simple situations.
  • simplify logging logic: it should just log simple strings with minimal dynamic part.
  • avoid flushing the buffers: there is PYTHONUNBUFFERED for people who really need it
  • print everything line by line to stdout and leave the caller to do whatever they want with the buffer. It must be easy for grepping.
  • do not add functionality that is very well handled by other tools, e.g. log rotation.
  • avoid complex logic with logging and if statements when log appears. That's why different log levels exist. It should be very simple to follow what is going on from the log without any evaluation of if statements.
  • please remove the "Connection options" params and use envvars for that purpose.
  • avoid using tuple as return values, unless well argumented and documented why there is no other option.
  • avoid using default variables for scripts. If you need a parameter, they should be explicit. If you don't need it explicit, then drop the parameter.
  • in scripts like this do not use inline fors and map, filter and reduce. Functional code is less readable in case of emergency.

Once addressed, please consider the following:

  • remove the OOP abstraction for state management and using pure imperative code.
  • avoid using the non-native container object VersionRef.

The short summary is: start small and only add functionality if really required.

Comment on lines +524 to +526
parser.add_argument("--region", help="Region")
parser.add_argument("--profile", help="Profile")
parser.add_argument("--endpoint-url", help="Custom S3 endpoint URL")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kill all of these args

Comment on lines 460 to 465
# Filter options
parser.add_argument(
"--deleted-after",
type=parse_dt,
help="Only process objects deleted after this date (ISO 8601 format)",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please still kill it :)

Comment on lines +542 to +545
# Action flags
parser.add_argument(
"--info", action="store_true", help="Show bucket information (default behavior)"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kill it

Comment on lines +548 to +549
parser.add_argument(
"--prune",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parser.add_argument(
"--prune",
parser.add_argument(
"--permanently-delete-versions",

Be scary when there is a chance to destroy data. Should be very very explicit what it is going to do.

)

parser.add_argument(
"--debug",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"--debug",
"--log-level",

Make it use an enum (or something native, no idea) that accepts the built-in log levels.

Comment on lines +567 to +571
parser.add_argument(
"--log-file",
default=None,
help="Write a rotating logbook file (e.g. /tmp/s3-cleaner.log)",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kill

return None

# Sort by time descending (latest first)
versions.sort(key=lambda x: x.last_modified, reverse=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I am lazy, note for myself:

export AWS_ENDPOINT_URL=http://172.17.0.1:8009 AWS_ACCESS_KEY_ID=rustfsadmin AWS_ACCESS_SECRET_KEY=rustfsadmin
import boto3

s3 = boto3.resource("s3")
bucket_name = "qfieldcloud-local"

bucket = s3.Bucket(bucket_name)
bucket.object_versions.delete()
bucket.delete()

bucket = s3.create_bucket(Bucket=bucket_name)
s3.BucketVersioning(bucket_name).enable()

root_key = "file"
special_key = "SPECIAL__FILE__000"
bucket.put_object(Key=special_key, Body="v1")
bucket.put_object(Key=special_key, Body="v2")

for i in range(1, 1000):
    key = f"{root_key}__{i:>03d}"
    bucket.put_object(Key=key, Body=key)


bucket.Object(special_key).delete()
bucket.Object("file__998").delete()


for idx, v in enumerate(bucket.object_versions.all()):
    print(f"{idx:>03d}", v.key, v.version_id, v.size)

@Rakanhf Rakanhf changed the title feat: add object storage cleaner script and testing [R&D] feat: add object storage cleaner script and testing Jan 5, 2026
@suricactus
Copy link
Collaborator

I am closing this, unless we want to keep it open for some reason?

@suricactus suricactus closed this Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants