-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Component(s)
exporter/awss3
Is your feature request related to a problem? Please describe.
Current behavior
Currently the awss3exporter uses the collector's clock to determine the s3 object's key based on the s3uploader.s3_partition_format config (see now in line 88):
opentelemetry-collector-contrib/exporter/awss3exporter/internal/upload/writer.go
Lines 86 to 92 in 3a7efac
| uploadInput := &s3.PutObjectInput{ | |
| Bucket: aws.String(overrideBucket), | |
| Key: aws.String(sw.builder.Build(now, overridePrefix)), | |
| Body: content, | |
| StorageClass: sw.storageClass, | |
| ACL: sw.acl, | |
| } |
Why it's problematic for my use-case
I have a use-case, where the difference between ingestion time and the timestamp of the ingested signals can be pretty huge, given that the producer is a mobile app with disk buffering enabled. Sometimes my devices go offline/are turned off for longer periods of time and then transmit their telemetry data only a couple of days later.
The lack of guarantees for ingestion times totally mess up my partitions, I have no way to know, in which objects I will find the telemetry about a chosen time period, it could be at any partition with timestamps after the investigated interval. As a result, I cannot query my logs efficiently.
Describe the solution you'd like
I would expect to find in the partition 2026/01/28/ data that was generated (and not ingested) on 28 January.
Describe alternatives you've considered
Proposal 1: take the first time-stamp in the batch
This would be the least intrusive change, because the logs could be batched using batchperresourceattr the same way as they currently are:
| Logs: batchperresourceattr.NewBatchPerResourceLogs(cfg.ResourceAttrsToS3.S3Prefix, logsExporter), |
This approach would not be perfect, if I partition by days for example, some batches might span over midnight, meaning that the records of the batch happening after midnight would be put in the wrong partition. However, the assumption is that the timestamps are pretty close to each other within a batch, so despite this imperfection the partitioning would work much better for my use-case than the current behavior.
Proposal 2: batch based on partitions
This would mean that instead of using the batchperresourceattr package
Backward compatibility in both cases
Regardless of the approach chosen, I'd introduce a new config parameter to make the new behavior transparently configurable (e.g. s3uploader.partition_by with values ingestion, batch_timestamp). IMO the new setting (batch_timestamp) is a better choice for default value, but I'm interested in hearing different opinions.
Additional context
I am trying to access the logs with Amazon Athena, but this issue prevents me from being able to create efficient queries based on the chosen time interval.
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.