Description
Is there an existing issue for this?
- I have searched the existing issues
Feature description
After the most recent batch of performance fixes, discovery jobs which pull metrics from large AWS environments have very reasonable resource utilization 🎉 . The currently limitation which can still cause longer scrape times is that all ListMetrics
calls must complete before any GetMetricData
calls are made. Since the APIs have independent rate limits (ListMetrics
is 25 TPS and GetMetricData
is ~50 TPS) we can safely start calling GetMetricData
before ListMetrics
completes.
I think the most idiomatic go way of going about this is via introducing channels to runDiscoveryJob
. The current challenge with doing this is the current code is really complex and relatively untestable. I think before doing this it needs to be refactored it to dramatically reduce the risk of such an impactful change. I would like to start by decomposing the main steps of runDiscoveryJob
in to smaller composable/testable "dataflows" listed below
- GetResources
- ListMetrics
- AssociateMetricsToResources
- GetMetricData
At this point we should have the ability to have solid test coverage across the complex logic used by each flow and that runDiscoveryJob
is going to flow the data appropriately. After this introducing channels should hopefully be as simple as introducing a new strategy for how runDiscoveryJob
composes the flow of data which can be gated behind a feature flag. This level of decoupling will make it much easier to to test the complex test cases channels require like shutdown, and error propagation.
If this pattern works out well it I think it should be adaptable to reduce the amount of code copy CustomNamespace require. A CustomNamespace job should be a composition of the ListMetrics
and GetMetricData
dataflows.
What might the configuration look like?
Ideally, no configuration changes are required