Skip to content

Segment generation further #20

@digitaldogsbody

Description

@digitaldogsbody

Since the change to bucketing by updated month instead of prefix, the total generation time has significantly increased. This is as a result of some month buckets having 10m+ records.

This has several undesirable side-effects:

  • Generation takes 8+ hours, as opposed to the 2-3 it was taking before
    • Longer generation time means more chance for discrepancy if records are updated during the process
    • Potential for significant increase in time if one of the large months needs to be regenerated
  • Inefficient use of resources in the generator container
    • For the majority of the generation process, almost all the workers are sitting idle
  • Increased load and resource drain on OpenSearch
    • Each query has to load and sort the whole month to retrieve the subset. Whilst we are using the most efficient way of doing this possible, it is still a large increase on resource usage

One potential approach to improving this aspect is to split the bucketing further, either weekly or daily. This would substantially increase the number of jobs (between 4x and 31x depending on the split) but each job would be much smaller.

There are two main options for generating the new jobs:

  1. Make the initial request to OS for the aggregates use smaller buckets
  2. Keep the current OS request and split each month into smaller jobs in the control thread

Both have pros and cons, and both will require some non-trivial amount of scaffolding so it's hard to know which is the best approach before implementation.

Considerations:

  • There has to be a method for two workers to work on parts of the same month at once without clashing with each other (e.g overwriting files), so some post-process harmonisation will be required (i.e combining multiple CSVs)
  • Splitting ourselves could make a "hybrid" strategy possible - essentially picking a size metric and then splitting each month into weeks or days (or even leaving it whole if it's small enough). This could be a large advantage in the efficiency of the jobs but might increase the scaffolding requirements.
  • There's a potential footgun with splitting the records into batches of 10,000 because those boundaries will almost certainly fall between different sections of each month and we really want to avoid having to move records around after generation to gzip
  • Need to work out how to track/compute results for each month and if regeneration metrics should apply to total month or just the slices

Metadata

Metadata

Labels

enhancementNew feature or requestinvestigationWork that will require investigation before/during implementation

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions