Skip to content

Partition based on signal timestamp, not ingestion time #45691

@bencehornak

Description

@bencehornak

Component(s)

exporter/awss3

Is your feature request related to a problem? Please describe.

Current behavior

Currently the awss3exporter uses the collector's clock to determine the s3 object's key based on the s3uploader.s3_partition_format config (see now in line 88):

uploadInput := &s3.PutObjectInput{
Bucket: aws.String(overrideBucket),
Key: aws.String(sw.builder.Build(now, overridePrefix)),
Body: content,
StorageClass: sw.storageClass,
ACL: sw.acl,
}

Why it's problematic for my use-case

I have a use-case, where the difference between ingestion time and the timestamp of the ingested signals can be pretty huge, given that the producer is a mobile app with disk buffering enabled. Sometimes my devices go offline/are turned off for longer periods of time and then transmit their telemetry data only a couple of days later.

The lack of guarantees for ingestion times totally mess up my partitions, I have no way to know, in which objects I will find the telemetry about a chosen time period, it could be at any partition with timestamps after the investigated interval. As a result, I cannot query my logs efficiently.

Describe the solution you'd like

I would expect to find in the partition 2026/01/28/ data that was generated (and not ingested) on 28 January.

Describe alternatives you've considered

Proposal 1: take the first time-stamp in the batch

This would be the least intrusive change, because the logs could be batched using batchperresourceattr the same way as they currently are:

Logs: batchperresourceattr.NewBatchPerResourceLogs(cfg.ResourceAttrsToS3.S3Prefix, logsExporter),

This approach would not be perfect, if I partition by days for example, some batches might span over midnight, meaning that the records of the batch happening after midnight would be put in the wrong partition. However, the assumption is that the timestamps are pretty close to each other within a batch, so despite this imperfection the partitioning would work much better for my use-case than the current behavior.

Proposal 2: batch based on partitions

This would mean that instead of using the batchperresourceattr package

Backward compatibility in both cases

Regardless of the approach chosen, I'd introduce a new config parameter to make the new behavior transparently configurable (e.g. s3uploader.partition_by with values ingestion, batch_timestamp). IMO the new setting (batch_timestamp) is a better choice for default value, but I'm interested in hearing different opinions.

Additional context

I am trying to access the logs with Amazon Athena, but this issue prevents me from being able to create efficient queries based on the chosen time interval.

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions