Skip to content

Commit 42d13f4

Browse files
committed
docs(outage): resolved
Signed-off-by: Eric Bode <eric.bode@foundries.io>
1 parent da1b8cd commit 42d13f4

1 file changed

Lines changed: 22 additions & 0 deletions

File tree

outage/2025-06-12-gcs.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,18 @@ We use CloudFlare to make sure end users can access our Web UI over at, amongst
1010
We use Google Cloud Storage to store all our data in the cloud, amongst others, artifacts from our CI.
1111
The latter is causing our core infrastructure to be down.
1212

13+
### CloudFlare
14+
15+
The root cause analysis for CloudFlare seems to be:
16+
17+
> Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency.
18+
19+
Which could be due the fact they might be running parts on Google Cloud Platform as well, or it is just a coincidence.
20+
21+
### Google
22+
23+
Google has not released a root cause analysis report as of this time, but when they do it will be included here as well.
24+
1325
### Timeline of Events
1426

1527
- **18:31 UTC** - It is noticed that CloudFlare has an incident
@@ -18,3 +30,13 @@ The latter is causing our core infrastructure to be down.
1830
- **19:13 UTC** - CloudFlare states that services are recovering
1931
- **19:30 UTC** - Google states that services have recovered except us-central1 region
2032
- **19:30 UTC** - Our CI status page is active again, service seems to be recovering
33+
- **19:45 UTC** - Temporary measure removed
34+
- **21:31 UTC** - CloudFlare reports to be fully operational
35+
- **21:35 UTC** - Fix for added resiliency added to our CI codebase
36+
- **01:27 UTC** - Google reports to be fully operational
37+
38+
## Lessons Learned
39+
40+
Identified a non-integral part of our CI codebase that had the potential to disrupt and cause a huge fall out if a third party service happens to be down or suffer a partial outage.
41+
A fix is now put in place to help further to be more resilient when a similar outage happens again. Work is also ongoing to scan all of our services for similar points of failure
42+
and address then accordingly.

0 commit comments

Comments
 (0)