Track in-transit items in pipeline#584
Track in-transit items in pipeline#584AltayAkkus wants to merge 2 commits intointernetarchive:mainfrom
Conversation
| PostprocessorRoutines.promGauge = globalPromStats.postprocessorRoutines | ||
| FinisherRoutines.promGauge = globalPromStats.finisherRoutines | ||
|
|
||
| PreprocessorInTransit.promGauge = globalPromStats.preprocessorInTransit |
There was a problem hiding this comment.
We need to wire in the prometheus gauges at runtime. Hopefully this will not cause a race condition.
| github.com/inconshreveable/mousetrap v1.1.0 // indirect | ||
| github.com/klauspost/compress v1.18.4 // indirect | ||
| github.com/klauspost/cpuid/v2 v2.0.12 // indirect | ||
| github.com/kylelemons/godebug v1.1.0 // indirect |
There was a problem hiding this comment.
added due to prometheus/testutil in counter_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #584 +/- ##
==========================================
- Coverage 56.42% 56.32% -0.11%
==========================================
Files 133 133
Lines 6747 6777 +30
==========================================
+ Hits 3807 3817 +10
- Misses 2561 2587 +26
+ Partials 379 373 -6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Additional refactoring of stats.goThe same pattern I refactored for I only refactored the Routine counters because their interfaces allow
Additionally some of these methods use I think that we could add additional interfaces to the |
Identify components slowing down ZenoAlthough we implemented the same endpoint as internetarchive/warcprox we cannot really answer the question of "which components are slowing down Zeno?" Atleast on my machine and with my workloads (crawling Proposal: Add processing timestamps to modeltype Item struct {
id string
url *URL
seedVia string
status ItemState
source ItemSource
childrenMu sync.RWMutex
children []*Item
parent *Item
err error
// new
traceTime bool // marks item for timestamping
stageTimes map[ItemState]time.Time // timestamps per stage
stageTimesMu sync.RWMutex
}
// old SetStatus
// func (i *Item) SetStatus(status ItemState) { i.status = status }
func (i *Item) SetStatus(status ItemState) {
i.stageTimesMu.Lock()
defer i.stageTimesMu.Unlock()
// Always update state
i.status = status
// Only track timestamps if enabled
if !i.traceTime {
return
}
// Lazy init
if i.stageTimes == nil {
i.stageTimes = make(map[ItemState]time.Time)
}
// Only set if not already set
if _, exists := i.stageTimes[state]; !exists {
i.stageTimes[state] = time.Now()
}
}To reduce load, we could control how many/which items shall be sampled (fractional, stochastic, etc.) by setting the We could add a Zeno/internal/pkg/stats/methods.go Lines 179 to 184 in 6e932c2 That would be quite the elegant solution for profiling Zeno's entire pipeline on actual workloads. |
Refers to #471
@vbanos suggested tracking in-transit items per component to identify bottlenecks. We already expose similar metrics
*Routinesvia Prometheus, but the current implementation requires separate increment/decrement methods per component and is somewhat clunky.Zeno/internal/pkg/stats/methods.go
Lines 34 to 113 in 6e932c2
GaugedCounter
In the first commit I refactored the existing
*Routinesto use GaugedCounter, a thin wrapper aroundcounterandprometheus.GaugeVecwith Add/Done semantics (similar to a WaitGroup).Adding counters with
GaugedCounterhas way less friction compared to the previous approach.In the second commit I added the stats
Each counter increments when an item enters a component and decrements when it leaves.
We can now monitor the queue pressure of each component in the pipeline 🥳