Description
Right now, to help with CI performance and reliability our CI pipelines leverage DevOps cache to cache the dependencies required by a CI job to reduce the number of calls being made to central repositories to download dependencies. Historically, we saw general flakiness with central repositories, possibly due to throttling, given the number of CI jobs being ran and how many dependencies that have to be downloaded. As the repository has grown and the number of unique jobs has increased, we've began seeing more and more calls to central repositories to download dependencies as the velocity of dependency changes and number of unique cache refreshes has increased.
At the time of writing this issue, we roughly have 180 unique CI matrices where each matrix runs (Linux, macOS, Windows) x (Java 8 and Java 17) with an additional run of Java 11, and each of these runs uses a different unique cache, so somewhere around 1260 plus likely another few hundred for miscellaneous jobs giving a grand total of 1500. Every time there is a change to dependencies this means we have all those caches to refresh meaning there will be a lot of calls to central repositories, reintroducing flakiness around these downloads.
Given that, we should look into redesigning caching to use fewer large caches that contain all dependencies of the repository in the cache. Instead of having a cache that is specific to each job we could reduce this to (Linux, macOS, Windows) x (Java 8, Java 11, Java 17) giving a total of 9 caches, possibly less as Java 11 runs are only done on Linux but let's use 9 for a fuller comparison.
While the smaller caches mean quicker downloading the overall number of caches means there is much more data being data as many dependencies are shared across most SDKs, where many caches now are around 150-200MB meaning 1500 x 200MB is a rough 300GB of cache. The largest cache now with the smaller caching design is roughly 600MB this can be extrapolated to be around 1-1.5GB, and the new design would only be around 10-15GB hugely reducing storage requirements. Additionally, this design means that the PR changing dependencies will repopulate the cache with the updated dependencies meaning there won't be requirements for other pipelines to update caches as well. Downside is that downloading the cache will increase in time, 150-200MB usually took 5-10s, 600MB usually took 12-15s, so let's say 1-1.5GB takes 30-45s, but this may not matter as downloading dependencies from central repositories takes much longer than downloading a cache.
- Creating tooling that is able to download all dependencies of Azure SDKs for Java.
- Update CI caching to use new tooling speculatively.
- The DevOps Cache task offers a way to set a variable flag is there was a cache hit or miss that can be used to determine if full dependencies download needs to be performed.
- Validate pipeline performance remains roughly in line with the larger cache by comparing cache hit scenarios download performance.
- Validate pipelines not included in the initial job updating the cache don't update the cache themselves.