Track in-transit items in pipeline by AltayAkkus · Pull Request #584 · internetarchive/Zeno

AltayAkkus · 2026-03-23T21:58:36Z

Refers to #471

@vbanos suggested tracking in-transit items per component to identify bottlenecks. We already expose similar metrics *Routines via Prometheus, but the current implementation requires separate increment/decrement methods per component and is somewhat clunky.

Zeno/internal/pkg/stats/methods.go

Lines 34 to 113 in 6e932c2

    
           ////////////////////////// 
        
           // PreprocessorRoutines // 
        
           ////////////////////////// 
        
           // PreprocessorRoutinesIncr increments the PreprocessorRoutines counter by 1. 
        
           func PreprocessorRoutinesIncr() { 
        
           	globalStats.PreprocessorRoutines.incr(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.preprocessorRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc() 
        
           	} 
        
           } 
        
           // PreprocessorRoutinesDecr decrements the PreprocessorRoutines counter by 1. 
        
           func PreprocessorRoutinesDecr() { 
        
           	globalStats.PreprocessorRoutines.decr(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.preprocessorRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Dec() 
        
           	} 
        
           } 
        
           ////////////////////////// 
        
           //  ArchiverRoutines    // 
        
           ////////////////////////// 
        
           // ArchiverRoutinesIncr increments the ArchiverRoutines counter by 1. 
        
           func ArchiverRoutinesIncr() { 
        
           	globalStats.ArchiverRoutines.incr(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.archiverRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc() 
        
           	} 
        
           } 
        
           // ArchiverRoutinesDecr decrements the ArchiverRoutines counter by 1. 
        
           func ArchiverRoutinesDecr() { 
        
           	globalStats.ArchiverRoutines.decr(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.archiverRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Dec() 
        
           	} 
        
           } 
        
           ////////////////////////// 
        
           // PostprocessorRoutines // 
        
           ////////////////////////// 
        
           // PostprocessorRoutinesIncr increments the PostprocessorRoutines counter by 1. 
        
           func PostprocessorRoutinesIncr() { 
        
           	globalStats.PostprocessorRoutines.incr(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.postprocessorRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc() 
        
           	} 
        
           } 
        
           // PostprocessorRoutinesDecr decrements the PostprocessorRoutines counter by 1. 
        
           func PostprocessorRoutinesDecr() { 
        
           	globalStats.PostprocessorRoutines.decr(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.postprocessorRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Dec() 
        
           	} 
        
           } 
        
           ////////////////////////// 
        
           // FinisherRoutines // 
        
           ////////////////////////// 
        
           // FinisherRoutinesIncr increments the FinisherRoutines counter by 1. 
        
           func FinisherRoutinesIncr() { 
        
           	globalStats.FinisherRoutines.incr(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.finisherRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc() 
        
           	} 
        
           } 
        
           // FinisherRoutinesDecr decrements the FinisherRoutines counter by 1. 
        
           func FinisherRoutinesDecr() { 
        
           	globalStats.FinisherRoutines.decr(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.finisherRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Dec() 
        
           	} 
        
           }

GaugedCounter

In the first commit I refactored the existing *Routines to use GaugedCounter, a thin wrapper around counter and prometheus.GaugeVec with Add/Done semantics (similar to a WaitGroup).
Adding counters with GaugedCounter has way less friction compared to the previous approach.

In the second commit I added the stats

	PreprocessorInTransit  *GaugedCounter
	ArchiverInTransit      *GaugedCounter
	PostprocessorInTransit *GaugedCounter
	FinisherInTransit      *GaugedCounter

Each counter increments when an item enters a component and decrements when it leaves.

We can now monitor the queue pressure of each component in the pipeline 🥳

AltayAkkus · 2026-03-23T22:02:31Z

 			PostprocessorRoutines.promGauge = globalPromStats.postprocessorRoutines
 			FinisherRoutines.promGauge = globalPromStats.finisherRoutines

+			PreprocessorInTransit.promGauge = globalPromStats.preprocessorInTransit


We need to wire in the prometheus gauges at runtime. Hopefully this will not cause a race condition.

AltayAkkus · 2026-03-23T22:03:22Z

 	github.com/inconshreveable/mousetrap v1.1.0 // indirect
 	github.com/klauspost/compress v1.18.4 // indirect
 	github.com/klauspost/cpuid/v2 v2.0.12 // indirect
+	github.com/kylelemons/godebug v1.1.0 // indirect


added due to prometheus/testutil in counter_test.go

codecov-commenter · 2026-03-23T22:04:14Z

Codecov Report

❌ Patch coverage is 40.00000% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.32%. Comparing base (6e932c2) to head (6978f43).

Files with missing lines	Patch %	Lines
internal/pkg/stats/stats.go	25.00%	24 Missing ⚠️
internal/pkg/stats/prometheus.go	0.00%	20 Missing ⚠️
internal/pkg/postprocessor/postprocessor.go	66.66%	2 Missing ⚠️
internal/pkg/archiver/worker.go	80.00%	1 Missing ⚠️
internal/pkg/preprocessor/preprocessor.go	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #584      +/-   ##
==========================================
- Coverage   56.42%   56.32%   -0.11%     
==========================================
  Files         133      133              
  Lines        6747     6777      +30     
==========================================
+ Hits         3807     3817      +10     
- Misses       2561     2587      +26     
+ Partials      379      373       -6

Flag	Coverage Δ
e2etests	`41.68% <32.50%> (-0.13%)`	⬇️
unittests	`29.20% <22.50%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AltayAkkus · 2026-03-23T22:17:55Z

Additional refactoring of stats.go

The same pattern I refactored for *Routines exists for many other stats
The methods.go is full of Methods which mutate a atomic int, and sync it with it's corresponding Prometheus gauge.
(WarcWritingQueueSizeSet, MeanHTTPRespTimeAdd, MeanProcessBodyTimeAdd, MeanWaitOnFeedbackTimeAdd, CFMitigatedIncr, and many more)

I only refactored the Routine counters because their interfaces allow Incr and Decr, other counters

only allow increment

Zeno/internal/pkg/stats/methods.go

Lines 270 to 295 in 6e932c2

    
           // SeencheckFailuresIncr increments the SeencheckFailures counter by 1. 
        
           func SeencheckFailuresIncr() { 
        
           	globalStats.SeencheckFailures.Add(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.seencheckFailures.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc() 
        
           	} 
        
           } 
        
           // CFMitigatedIncr increments the CFMitigated counter by 1. 
        
           func CFMitigatedIncr() { 
        
           	globalStats.cfMitigated.Add(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.cfMitigated.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc() 
        
           	} 
        
           } 
        
           // AkamaiMitigatedIncr increments the AkamaiMitigated counter by 1. 
        
           func AkamaiMitigatedIncr() { 
        
           	globalStats.akamaiMitigated.Add(1) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.akamaiMitigated.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc() 
        
           	} 
        
           }

only allow set

WarcWritingQueueSizeSet
WARCDataTotalBytesSet
WARCCDXDedupeTotalBytesSet
WARCDoppelgangerDedupeTotalBytesSet
WARCLocalDedupeTotalBytesSet
WARCCDXDedupeTotalSet
WARCDoppelgangerDedupeTotalSet
WARCLocalDedupeTotalSet

only allow add

Zeno/internal/pkg/stats/methods.go

Lines 178 to 208 in 6e932c2

    
           // MeanHTTPRespTimeAdd adds the given value to the MeanHTTPRespTime. 
        
           func MeanHTTPRespTimeAdd(value time.Duration) { 
        
           	globalStats.MeanHTTPResponseTime.add(uint64(value.Milliseconds())) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.meanHTTPRespTime.WithLabelValues(config.Get().JobPrometheus, hostname, version).Observe(float64(value)) 
        
           	} 
        
           } 
        
           ////////////////////////// 
        
           // MeanProcessBodyTime  // 
        
           ////////////////////////// 
        
           // MeanProcessBodyTimeAdd adds the given value to the MeanProcessBodyTime. 
        
           func MeanProcessBodyTimeAdd(value time.Duration) { 
        
           	globalStats.MeanProcessBodyTime.add(uint64(value.Milliseconds())) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.meanProcessBodyTime.WithLabelValues(config.Get().JobPrometheus, hostname, version).Observe(float64(value)) 
        
           	} 
        
           } 
        
           //////////////////////////// 
        
           // MeanWaitOnFeedbackTime // 
        
           //////////////////////////// 
        
           // MeanWaitOnFeedbackTimeAdd adds the given value to the MeanWaitOnFeedbackTime. 
        
           func MeanWaitOnFeedbackTimeAdd(value time.Duration) { 
        
           	globalStats.MeanWaitOnFeedbackTime.add(uint64(value.Milliseconds())) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.meanWaitOnFeedbackTime.WithLabelValues(config.Get().JobPrometheus, hostname, version).Observe(float64(value)) 
        
           	} 
        
           }

Additionally some of these methods use prometheus.HistogramVec or prometheus.CounterVec.

I think that we could add additional interfaces to the counter.go so that we can remove most of these java-esque (getter/setter) methods, and improve metric collection in Zeno. @yzqzss @NGTmeaty thoughts?

AltayAkkus · 2026-03-23T22:50:20Z

Identify components slowing down Zeno

Although we implemented the same endpoint as internetarchive/warcprox we cannot really answer the question of "which components are slowing down Zeno?"

Every 0.5s: curl -s http://localhost:9090/metrics | grep in_transit              me.local: Mon Mar 23 23:21:29 2026
                                                                                                                      in 0.013s (0)
# HELP zeno_archiver_in_transit Number of items currently being processed by the archiver
# TYPE zeno_archiver_in_transit gauge
zeno_archiver_in_transit{hostname="me.local",project="testjob3",version="unknown_version"} 16
# HELP zeno_finisher_in_transit Number of items currently being processed by the finisher
# TYPE zeno_finisher_in_transit gauge
zeno_finisher_in_transit{hostname="me.local",project="testjob3",version="unknown_version"} 0
# HELP zeno_postprocessor_in_transit Number of items currently being processed by the postprocessor
# TYPE zeno_postprocessor_in_transit gauge
zeno_postprocessor_in_transit{hostname="me.local",project="testjob3",version="unknown_version"} 0
# HELP zeno_preprocessor_in_transit Number of items currently being processed by the preprocessor
# TYPE zeno_preprocessor_in_transit gauge
zeno_preprocessor_in_transit{hostname="me.local",project="testjob3",version="unknown_version"} 0

Atleast on my machine and with my workloads (crawling https://france.fr) you have to be very lucky or sample very aggressively to have some in-transit items outside of the archiver, just because everything else is so much faster.

Proposal: Add processing timestamps to model

type Item struct {
	id         string
	url        *URL
	seedVia    string
	status     ItemState
	source     ItemSource

	childrenMu sync.RWMutex
	children   []*Item
	parent     *Item
	err        error

    // new
	traceTime      bool                          // marks item for timestamping
	stageTimes     map[ItemState]time.Time       // timestamps per stage
	stageTimesMu   sync.RWMutex                 
}

// old SetStatus
// func (i *Item) SetStatus(status ItemState) { i.status = status }
func (i *Item) SetStatus(status ItemState) { 
	i.stageTimesMu.Lock()
	defer i.stageTimesMu.Unlock()

	// Always update state
	i.status = status

	// Only track timestamps if enabled
	if !i.traceTime {
		return
	}

	// Lazy init 
	if i.stageTimes == nil {
		i.stageTimes = make(map[ItemState]time.Time)
	}

	// Only set if not already set
	if _, exists := i.stageTimes[state]; !exists {
		i.stageTimes[state] = time.Now()
	}
}

To reduce load, we could control how many/which items shall be sampled (fractional, stochastic, etc.) by setting the traceTime bool when creating the item, and collect the timestamps in the finisher.

We could add a prometheus.HistogramVec for each stage, let the finisher calculate the time delta's between the stages, and .Observe these processing times. Like we already do for

Zeno/internal/pkg/stats/methods.go

Lines 179 to 184 in 6e932c2

    
           func MeanHTTPRespTimeAdd(value time.Duration) { 
        
           	globalStats.MeanHTTPResponseTime.add(uint64(value.Milliseconds())) 
        
           	if globalPromStats != nil { 
        
           		globalPromStats.meanHTTPRespTime.WithLabelValues(config.Get().JobPrometheus, hostname, version).Observe(float64(value)) 
        
           	} 
        
           }

That would be quite the elegant solution for profiling Zeno's entire pipeline on actual workloads.
Just give me a thumbs up and I'll come back to it :)

AltayAkkus added 2 commits March 23, 2026 21:59

refactor: verbose inc/decr methods with GaugedCounter

63fd5fe

feat: add in-transit counters for pipeline components

6978f43

AltayAkkus commented Mar 23, 2026

View reviewed changes

AltayAkkus marked this pull request as ready for review March 24, 2026 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track in-transit items in pipeline#584

Track in-transit items in pipeline#584
AltayAkkus wants to merge 2 commits intointernetarchive:mainfrom
AltayAkkus:feat-itemflow

AltayAkkus commented Mar 23, 2026 •

edited

Loading

Uh oh!

AltayAkkus Mar 23, 2026

Uh oh!

AltayAkkus Mar 23, 2026

Uh oh!

codecov-commenter commented Mar 23, 2026

Uh oh!

AltayAkkus commented Mar 23, 2026

Uh oh!

AltayAkkus commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	//////////////////////////
	// PreprocessorRoutines //
	//////////////////////////

	// PreprocessorRoutinesIncr increments the PreprocessorRoutines counter by 1.
	func PreprocessorRoutinesIncr() {
	globalStats.PreprocessorRoutines.incr(1)
	if globalPromStats != nil {
	globalPromStats.preprocessorRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc()
	}
	}

	// PreprocessorRoutinesDecr decrements the PreprocessorRoutines counter by 1.
	func PreprocessorRoutinesDecr() {
	globalStats.PreprocessorRoutines.decr(1)
	if globalPromStats != nil {
	globalPromStats.preprocessorRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Dec()
	}
	}

	//////////////////////////
	// ArchiverRoutines //
	//////////////////////////

	// ArchiverRoutinesIncr increments the ArchiverRoutines counter by 1.
	func ArchiverRoutinesIncr() {
	globalStats.ArchiverRoutines.incr(1)
	if globalPromStats != nil {
	globalPromStats.archiverRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc()
	}
	}

	// ArchiverRoutinesDecr decrements the ArchiverRoutines counter by 1.
	func ArchiverRoutinesDecr() {
	globalStats.ArchiverRoutines.decr(1)
	if globalPromStats != nil {
	globalPromStats.archiverRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Dec()
	}
	}

	//////////////////////////
	// PostprocessorRoutines //
	//////////////////////////

	// PostprocessorRoutinesIncr increments the PostprocessorRoutines counter by 1.
	func PostprocessorRoutinesIncr() {
	globalStats.PostprocessorRoutines.incr(1)
	if globalPromStats != nil {
	globalPromStats.postprocessorRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc()
	}
	}

	// PostprocessorRoutinesDecr decrements the PostprocessorRoutines counter by 1.
	func PostprocessorRoutinesDecr() {
	globalStats.PostprocessorRoutines.decr(1)
	if globalPromStats != nil {
	globalPromStats.postprocessorRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Dec()
	}
	}

	//////////////////////////
	// FinisherRoutines //
	//////////////////////////

	// FinisherRoutinesIncr increments the FinisherRoutines counter by 1.
	func FinisherRoutinesIncr() {
	globalStats.FinisherRoutines.incr(1)
	if globalPromStats != nil {
	globalPromStats.finisherRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Inc()
	}
	}

	// FinisherRoutinesDecr decrements the FinisherRoutines counter by 1.
	func FinisherRoutinesDecr() {
	globalStats.FinisherRoutines.decr(1)
	if globalPromStats != nil {
	globalPromStats.finisherRoutines.WithLabelValues(config.Get().JobPrometheus, hostname, version).Dec()
	}
	}

Conversation

AltayAkkus commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GaugedCounter

Uh oh!

AltayAkkus Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

AltayAkkus Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 23, 2026

Codecov Report

Uh oh!

AltayAkkus commented Mar 23, 2026

Additional refactoring of stats.go

Uh oh!

AltayAkkus commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Identify components slowing down Zeno

Proposal: Add processing timestamps to model

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AltayAkkus commented Mar 23, 2026 •

edited

Loading

AltayAkkus commented Mar 23, 2026 •

edited

Loading