Skip to content

Conversation

@trevorwhitney
Copy link
Collaborator

What this PR does / why we need it:

Add a race tolerance to the query-tee. This allows us to handicap races in favor or Thor. If the v1 backend wins, but Thor finishes within the race tolerance of v1 (default is 100ms), then we call that a win for Thor. The logic here being that 100ms is undetectable by a user, so it helps us better measure if "thor is good enough" (to replace chunks) rather than just "is thor faster".

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

Copilot AI review requested due to automatic review settings December 12, 2025 20:24
@trevorwhitney trevorwhitney requested a review from a team as a code owner December 12, 2025 20:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a race tolerance feature to the query-tee tool to better measure if Thor (a backend) is "good enough" to replace the chunks backend. The feature introduces a configurable tolerance period (default 100ms) that handicaps races in favor of non-preferred backends. If the preferred backend (v1) wins a race but a non-preferred backend (Thor) finishes within the tolerance window, Thor is declared the winner. This helps determine if performance differences are negligible from a user perspective.

Key Changes:

  • Added RaceTolerance configuration field with a default of 100ms
  • Modified race logic to wait for the tolerance period when the preferred backend wins first, giving non-preferred backends a chance to win if they finish within tolerance
  • Refactored Do method by extracting helper functions (shouldSample, makeBackendRequests, finishRace) for improved code organization and readability

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tools/querytee/proxy.go Adds RaceTolerance configuration field and flag registration with 100ms default
tools/querytee/handler_factory.go Threads the RaceTolerance configuration through the handler factory
tools/querytee/fanout_handler.go Implements race tolerance logic in the Do method and refactors code into helper functions
tools/querytee/fanout_handler_test.go Updates existing race mode test to validate the new tolerance behavior
tools/querytee/proxy_endpoint.go Adds TODO comment about detecting Grafana as an issuer (unrelated to main changes)
Comments suppressed due to low confidence (2)

tools/querytee/fanout_handler_test.go:257

  • Missing test coverage for critical edge case: When race tolerance is enabled and the preferred backend wins, the code waits for the tolerance period to see if the non-preferred backend finishes. However, there's no test for the scenario where the non-preferred backend returns an error or fails within the tolerance period. This is critical because the current code would incorrectly return the failed response (see bug in lines 130-138). Add a test case that validates the behavior when the preferred backend succeeds quickly, but the non-preferred backend fails within the race tolerance window.
func TestFanOutHandler_Do_RaceModeReturnsNonPreferredIfWithinTolerance(t *testing.T) {
	backend1 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
		time.Sleep(10 * time.Millisecond)
		w.Header().Set("Content-Type", "application/json")
		w.WriteHeader(http.StatusOK)
		_, err := w.Write([]byte(`{"status":"success","data":{"resultType":"streams","result":[{"stream":{"backend":"1"},"values":[["1000000000","log line 1"]]}]}}`))
		require.NoError(t, err)
	}))
	defer backend1.Close()

	backend2 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
		time.Sleep(50 * time.Millisecond)
		w.Header().Set("Content-Type", "application/json")
		w.WriteHeader(http.StatusOK)
		_, err := w.Write([]byte(`{"status":"success","data":{"resultType":"streams","result":[{"stream":{"backend":"2"},"values":[["1000000000","log line 2"]]}]}}`))
		require.NoError(t, err)
	}))
	defer backend2.Close()

	backend1URL, _ := url.Parse(backend1.URL)
	backend2URL, _ := url.Parse(backend2.URL)

	backends := []*ProxyBackend{
		NewProxyBackend("backend-1", backend1URL, 5*time.Second, true),
		NewProxyBackend("backend-2", backend2URL, 5*time.Second, false),
	}

	handler := NewFanOutHandler(FanOutHandlerConfig{
		Backends:      backends,
		Codec:         queryrange.DefaultCodec,
		Logger:        log.NewNopLogger(),
		Metrics:       NewProxyMetrics(prometheus.NewRegistry()),
		RouteName:     "test_route",
		EnableRace:    true,
		RaceTolerance: 100 * time.Millisecond,
	})

	req := &queryrange.LokiRequest{
		Query:   `{app="test"}`,
		StartTs: time.Now().Add(-1 * time.Hour),
		EndTs:   time.Now(),
		Limit:   100,
	}

	ctx := user.InjectOrgID(context.Background(), "test-tenant")
	resp, err := handler.Do(ctx, req)
	require.NoError(t, err)
	require.NotNil(t, resp)

	lokiResp, ok := resp.(*queryrange.LokiResponse)
	require.True(t, ok)
	require.Equal(t, "success", lokiResp.Status)
	require.Len(t, lokiResp.Data.Result, 1)
	require.Contains(t, lokiResp.Data.Result[0].Labels, `backend="2"`)

	time.Sleep(100 * time.Millisecond)
}

tools/querytee/fanout_handler_test.go:257

  • Missing test coverage for the race tolerance timeout scenario. The test verifies that a non-preferred backend can win if it finishes within the tolerance period, but there's no test for when the non-preferred backend takes longer than the tolerance period. Add a test case where the preferred backend finishes first and the non-preferred backend takes longer than the race tolerance (e.g., preferred at 10ms, non-preferred at 150ms, tolerance at 100ms), to verify that the preferred backend is correctly declared the winner when the timeout expires.
func TestFanOutHandler_Do_RaceModeReturnsNonPreferredIfWithinTolerance(t *testing.T) {
	backend1 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
		time.Sleep(10 * time.Millisecond)
		w.Header().Set("Content-Type", "application/json")
		w.WriteHeader(http.StatusOK)
		_, err := w.Write([]byte(`{"status":"success","data":{"resultType":"streams","result":[{"stream":{"backend":"1"},"values":[["1000000000","log line 1"]]}]}}`))
		require.NoError(t, err)
	}))
	defer backend1.Close()

	backend2 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
		time.Sleep(50 * time.Millisecond)
		w.Header().Set("Content-Type", "application/json")
		w.WriteHeader(http.StatusOK)
		_, err := w.Write([]byte(`{"status":"success","data":{"resultType":"streams","result":[{"stream":{"backend":"2"},"values":[["1000000000","log line 2"]]}]}}`))
		require.NoError(t, err)
	}))
	defer backend2.Close()

	backend1URL, _ := url.Parse(backend1.URL)
	backend2URL, _ := url.Parse(backend2.URL)

	backends := []*ProxyBackend{
		NewProxyBackend("backend-1", backend1URL, 5*time.Second, true),
		NewProxyBackend("backend-2", backend2URL, 5*time.Second, false),
	}

	handler := NewFanOutHandler(FanOutHandlerConfig{
		Backends:      backends,
		Codec:         queryrange.DefaultCodec,
		Logger:        log.NewNopLogger(),
		Metrics:       NewProxyMetrics(prometheus.NewRegistry()),
		RouteName:     "test_route",
		EnableRace:    true,
		RaceTolerance: 100 * time.Millisecond,
	})

	req := &queryrange.LokiRequest{
		Query:   `{app="test"}`,
		StartTs: time.Now().Add(-1 * time.Hour),
		EndTs:   time.Now(),
		Limit:   100,
	}

	ctx := user.InjectOrgID(context.Background(), "test-tenant")
	resp, err := handler.Do(ctx, req)
	require.NoError(t, err)
	require.NotNil(t, resp)

	lokiResp, ok := resp.(*queryrange.LokiResponse)
	require.True(t, ok)
	require.Equal(t, "success", lokiResp.Status)
	require.Len(t, lokiResp.Data.Result, 1)
	require.Contains(t, lokiResp.Data.Result[0].Labels, `backend="2"`)

	time.Sleep(100 * time.Millisecond)
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant