Skip to content

refactor: extractors can return multiple samples #17064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

trevorwhitney
Copy link
Collaborator

@trevorwhitney trevorwhitney commented Apr 8, 2025

What this PR does / why we need it:

This PR refactors the SampleExtractor interface to allow Process to return multiple samples for a specific line. This refactor is a prerequisite for #17149, which allows extractors in a Multi-Variant query to pre-process common stages in an extraction pipeline (such as line/label filters, logfmt, etc) and then run only the extraction specific stage over the pre-process log line, which should significantly reduce the amount of work done when process a multi-variant query.

Originally, the multi-variant query work refactored sample evaluation to be able to accept multiple extractors. this was the wrong approach as it limits our ability to pre-process a log line. The extractors do not expose much about their internal pipeline, so it would be difficult to find common stages. Furthermore, the extractor does not know the context of the line being processed until the call to .Process(), making it hard for the extractors to share state per log line between them.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@trevorwhitney trevorwhitney changed the title Refactor extractors multiple samples refactor: extractors can return multiple samples Apr 8, 2025
@trevorwhitney trevorwhitney marked this pull request as ready for review April 11, 2025 17:32
@trevorwhitney trevorwhitney requested a review from a team as a code owner April 11, 2025 17:32
commit 7e9be11
Author: Trevor Whitney <[email protected]>
Date:   Fri Apr 11 11:56:36 2025 -0600

    Squashed commit of the following:

    commit bd43313
    Author: Robert Fratto <[email protected]>
    Date:   Fri Apr 11 09:52:42 2025 -0400

        chore: align benchmark results between chunks and dataobjs (#17127)

    commit 88beefb
    Author: Robert Fratto <[email protected]>
    Date:   Fri Apr 11 08:09:33 2025 -0400

        fix(logql): Fix inconsistency with parsed field short circuiting (#17104)

        PR #8724 originally changed the behaviour of LogQL so that the first extracted
        field takes precedence over any later extracted field with the same name.

        However, this PR had two subtle issues due to the caching that parse stages
        perform for key lookups:

        * The precedence logic did not always apply to duplicate fields within the same
          parse stage (notably for logfmt).

        * The caching behaviour could incorrectly cause stages to permanently ignore
          fields for future log lines where that stage should otherwise perform
          extraction.

        Additionally, once structured metadata was introduced, it was incorrectly being
        flagged as an "extracted field." Combined with the caching behaviour described
        above, this means that parsed fields only take precedence over structured
        metadata if the first log line encountered by the engine doesn't have that
        field also set as structured metadata.

        This can be demonstrated with the following scenario:

        1. A log line from a stream that uses structured metadata for `trace_id` is
           encountered first. This (incorrectly) causes the trace_id field to be
           ignored in all parse stages.

        2. All later log lines from streams without structured metadata can no longer
           extract or filter on a parsed `trace_id` for the lifetime of the query, due
           to the caching from step 1.

        The fix is two-fold:

        * Decouple skipping previously extracted fields from the logic which caches
          interned keys.

        * Only mark parsed labels as extracted fields (previously all labels).

        Because the check for whether a field is already extracted is now called more
        times, the list of extracted fields has been updated to a map to allow for
        faster checks.

    commit 7312ccb
    Author: Karsten Jeschkies <[email protected]>
    Date:   Fri Apr 11 11:46:23 2025 +0200

        refactor(stringlabels): Support stringlabels distributor tests (#17123)

    commit 5bba0c2
    Author: Karsten Jeschkies <[email protected]>
    Date:   Fri Apr 11 11:06:04 2025 +0200

        refactor(stringlabels): Support stringlabels ingester tests (#17121)

    commit 8e42118
    Author: benclive <[email protected]>
    Date:   Fri Apr 11 09:47:52 2025 +0100

        chore: Collect basic stats from dataobj queries (#17111)

    commit d197cda
    Author: George Robinson <[email protected]>
    Date:   Fri Apr 11 09:33:36 2025 +0100

        feat: add enforceLimits to ingest_limits.go (#17117)

    commit b0931b1
    Author: George Robinson <[email protected]>
    Date:   Fri Apr 11 09:23:27 2025 +0100

        chore: change the result of exceedsLimits (#17112)

    commit bcceded
    Author: George Robinson <[email protected]>
    Date:   Fri Apr 11 09:09:12 2025 +0100

        chore: fix private interface had exported methods (#17115)

    commit 75593e0
    Author: Karsten Jeschkies <[email protected]>
    Date:   Fri Apr 11 07:41:52 2025 +0200

        refactor(stringlabels): Support stringlabels in loghttp, pattern and ruler tests. (#17102)

commit e616f89
Author: Trevor Whitney <[email protected]>
Date:   Thu Apr 10 14:58:00 2025 -0600

    fix: lint and format

commit e52bca1
Author: Trevor Whitney <[email protected]>
Date:   Thu Apr 10 12:54:32 2025 -0600

    fix: fix a few more signatures

commit 6c5cbde
Merge: b289991 c7ffeb5
Author: Trevor Whitney <[email protected]>
Date:   Thu Apr 10 12:11:49 2025 -0600

    Merge branch 'main' into refactor-extractors-multiple-samples

commit b289991
Merge: 9bc6909 2aed4c3
Author: Trevor Whitney <[email protected]>
Date:   Tue Apr 8 15:29:34 2025 -0600

    Merge branch 'main' into refactor-extractors-multiple-samples

commit 9bc6909
Author: Trevort Whitney <[email protected]>
Date:   Tue Apr 8 09:57:28 2025 -0600

    refactor: extractors can return multiple samples

commit 4c44821
Author: Trevor Whitney <[email protected]>
Date:   Mon Apr 7 11:28:28 2025 -0600

    fix: slices.Delete usage

commit b150156
Merge: b1b41c5 9e9f534
Author: Trevor Whitney <[email protected]>
Date:   Mon Apr 7 10:46:25 2025 -0600

    Merge branch 'main' into multi-variant-series-limits

commit b1b41c5
Author: Trevor Whitney <[email protected]>
Date:   Mon Apr 7 10:45:53 2025 -0600

    refactor: apply feedback from review

commit d668449
Merge: 63911a5 0a3230f
Author: Trevor Whitney <[email protected]>
Date:   Tue Apr 1 11:53:26 2025 -0600

    Merge branch 'main' into multi-variant-series-limits

commit 63911a5
Author: Trevor Whitney <[email protected]>
Date:   Tue Apr 1 11:48:30 2025 -0600

    perf: mutate response data in place

commit c16f134
Author: Trevor Whitney <[email protected]>
Date:   Thu Mar 27 17:23:13 2025 -0600

    feat: return varaints not exceeding limits

commit 127f29e
Author: Trevor Whitney <[email protected]>
Date:   Tue Mar 25 17:15:08 2025 -0600

    fix: lint and format

commit b534760
Merge: d45a53a c40053a
Author: Trevor Whitney <[email protected]>
Date:   Tue Mar 25 17:02:29 2025 -0600

    Merge branch 'main' into multi-variant-series-limits

commit d45a53a
Author: Trevor Whitney <[email protected]>
Date:   Tue Mar 25 17:01:31 2025 -0600

    feat: fix limit enforcement in step evaluator

commit 2202364
Merge: 8ffc96d 0497c6b
Author: Trevor Whitney <[email protected]>
Date:   Mon Mar 24 14:23:31 2025 -0600

    Merge branch 'main' into multi-variant-series-limits

commit 8ffc96d
Author: Trevor Whitney <[email protected]>
Date:   Mon Mar 24 14:16:29 2025 -0600

    chore: apply series limit per variant for MV queries
@trevorwhitney trevorwhitney force-pushed the refactor-extractors-multiple-samples branch from 7e9be11 to eb615eb Compare April 11, 2025 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant