Skip to content

[+] implement parallel source discovery#1378

Open
pashagolub wants to merge 3 commits intomasterfrom
parallel-source-discovery
Open

[+] implement parallel source discovery#1378
pashagolub wants to merge 3 commits intomasterfrom
parallel-source-discovery

Conversation

@pashagolub
Copy link
Copy Markdown
Collaborator

@pashagolub pashagolub commented Apr 24, 2026

Improves dead-source handling with parallel resolution and instance_up=0 on discovery failure.

Sources.ResolveDatabases() previously resolved each source sequentially. A single slow or unresponsive source (e.g. a continuous-discovery endpoint behind a firewall) would block discovery of all subsequent sources for the full connection timeout duration.

Sources are now resolved concurrently using sync.WaitGroup.Go(). Results are collected into a pre-allocated indexed slice to preserve deterministic ordering. Per-source error logging with source name is included in the resolver itself.

When a SourcePostgresContinuous or SourcePatroni source fails to resolve any databases, LoadSources() now emits instance_up=0 to the configured sinks. This makes the failure visible in dashboards and alerting, consistent with how unreachable directly-monitored sources are handled.

@pashagolub pashagolub self-assigned this Apr 24, 2026
@pashagolub pashagolub requested a review from 0xgouda April 24, 2026 12:27
@pashagolub pashagolub added the sources What sources and in what way to monitor label Apr 24, 2026
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 24, 2026

Coverage Report for CI Build 25027401738

Coverage decreased (-0.2%) to 83.177%

Details

  • Coverage decreased (-0.2%) from the base build.
  • Patch coverage: 23 uncovered changes across 2 files (5 of 28 lines covered, 17.86%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
internal/sources/resolver.go 21 0 0.0%
internal/reaper/reaper.go 6 4 66.67%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 5338
Covered Lines: 4440
Line Coverage: 83.18%
Coverage Strength: 0.95 hits per line

💛 - Coveralls

@pashagolub pashagolub force-pushed the parallel-source-discovery branch from a3814f5 to a715672 Compare April 24, 2026 18:41
@0xgouda
Copy link
Copy Markdown
Collaborator

0xgouda commented Apr 27, 2026

Is this related to #1377?

@pashagolub
Copy link
Copy Markdown
Collaborator Author

Is this related to #1377?

In some way. I couldn't reproduce that issue but I found that misconfigured discovery sources could cause a huge delays.

Improves dead-source handling with parallel resolution and instance_up=0
on discovery failure.

`Sources.ResolveDatabases()` previously resolved each source
sequentially. A single slow or unresponsive source
(e.g. a continuous-discovery endpoint behind a firewall) would block
discovery of all subsequent sources for the full connection timeout
duration.

Sources are now resolved concurrently using `sync.WaitGroup.Go()`.
Results are collected into a pre-allocated indexed slice to preserve
deterministic ordering. Per-source error logging with source name is
included in the resolver itself.

When a `SourcePostgresContinuous` or `SourcePatroni` source fails to
resolve any databases, `LoadSources()` now emits `instance_up=0` to
the configured sinks. This makes the failure visible in dashboards and
alerting, consistent with how unreachable directly-monitored sources are
handled.
@0xgouda 0xgouda force-pushed the parallel-source-discovery branch from a715672 to 75ea9bd Compare April 28, 2026 00:40
Comment on lines +48 to +50
if onError != nil {
onError(srcs[i].Name)
}
Copy link
Copy Markdown
Collaborator

@0xgouda 0xgouda Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern here that can be reproduced with the following steps:

  1. Define a target that happens to be unreachable and is of the kind postgres-continuous-discovery
  2. pgwatch writes instance_up=0 for the target for a while with dbname = sourceName
  3. The target becomes alive
  4. pgwatch runs for a while and now writes the updated instance_up = 1, but with a new dbname = sourceName + _ + realDbname
  5. The target is down again and its instance_up = 0 is written with dbname = sourceName + _ + realDbname

So the full instance uptime history becomes a bit disconnected, with different dbname[s].

But generally, I think that's the best we can do, just wanted to note this behaviour.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! We could use source instead of dbname. This way we will know for sure at which point that happened

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more?

Copy link
Copy Markdown
Collaborator Author

@pashagolub pashagolub Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// WriteInstanceDown writes instance_up = 0 metric to sinks for the given source
func (r *Reaper) WriteInstanceDown(md *sources.SourceConn) {
	r.measurementCh <- metrics.MeasurementEnvelope{
		DBName:     md.Name,
		MetricName: specialMetricInstanceUp,
		Data: metrics.Measurements{metrics.Measurement{
			metrics.EpochColumnName: time.Now().UnixNano(),
			"kind": string(md.Kind),
		//  ^^^^^^^^^^^^^^^^^^^^^^^
			specialMetricInstanceUp: 0},
		},
	}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way grafana could distinguish regular databases vs all other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sources What sources and in what way to monitor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants