[+] implement parallel source discovery#1378
Conversation
Coverage Report for CI Build 25027401738Coverage decreased (-0.2%) to 83.177%Details
Uncovered Changes
Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
a3814f5 to
a715672
Compare
|
Is this related to #1377? |
In some way. I couldn't reproduce that issue but I found that misconfigured discovery sources could cause a huge delays. |
Improves dead-source handling with parallel resolution and instance_up=0 on discovery failure. `Sources.ResolveDatabases()` previously resolved each source sequentially. A single slow or unresponsive source (e.g. a continuous-discovery endpoint behind a firewall) would block discovery of all subsequent sources for the full connection timeout duration. Sources are now resolved concurrently using `sync.WaitGroup.Go()`. Results are collected into a pre-allocated indexed slice to preserve deterministic ordering. Per-source error logging with source name is included in the resolver itself. When a `SourcePostgresContinuous` or `SourcePatroni` source fails to resolve any databases, `LoadSources()` now emits `instance_up=0` to the configured sinks. This makes the failure visible in dashboards and alerting, consistent with how unreachable directly-monitored sources are handled.
a715672 to
75ea9bd
Compare
| if onError != nil { | ||
| onError(srcs[i].Name) | ||
| } |
There was a problem hiding this comment.
I have a concern here that can be reproduced with the following steps:
- Define a target that happens to be unreachable and is of the kind
postgres-continuous-discovery - pgwatch writes
instance_up=0for the target for a while withdbname = sourceName - The target becomes alive
- pgwatch runs for a while and now writes the updated
instance_up = 1, but with a newdbname = sourceName + _ + realDbname - The target is down again and its
instance_up = 0is written withdbname = sourceName + _ + realDbname
So the full instance uptime history becomes a bit disconnected, with different dbname[s].
But generally, I think that's the best we can do, just wanted to note this behaviour.
There was a problem hiding this comment.
Good catch! We could use source instead of dbname. This way we will know for sure at which point that happened
There was a problem hiding this comment.
// WriteInstanceDown writes instance_up = 0 metric to sinks for the given source
func (r *Reaper) WriteInstanceDown(md *sources.SourceConn) {
r.measurementCh <- metrics.MeasurementEnvelope{
DBName: md.Name,
MetricName: specialMetricInstanceUp,
Data: metrics.Measurements{metrics.Measurement{
metrics.EpochColumnName: time.Now().UnixNano(),
"kind": string(md.Kind),
// ^^^^^^^^^^^^^^^^^^^^^^^
specialMetricInstanceUp: 0},
},
}
}There was a problem hiding this comment.
this way grafana could distinguish regular databases vs all other
Improves dead-source handling with parallel resolution and
instance_up=0on discovery failure.Sources.ResolveDatabases()previously resolved each source sequentially. A single slow or unresponsive source (e.g. a continuous-discovery endpoint behind a firewall) would block discovery of all subsequent sources for the full connection timeout duration.Sources are now resolved concurrently using
sync.WaitGroup.Go(). Results are collected into a pre-allocated indexed slice to preserve deterministic ordering. Per-source error logging with source name is included in the resolver itself.When a
SourcePostgresContinuousorSourcePatronisource fails to resolve any databases,LoadSources()now emitsinstance_up=0to the configured sinks. This makes the failure visible in dashboards and alerting, consistent with how unreachable directly-monitored sources are handled.