You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: outage/2025-06-12-gcs.md
+22Lines changed: 22 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,6 +10,18 @@ We use CloudFlare to make sure end users can access our Web UI over at, amongst
10
10
We use Google Cloud Storage to store all our data in the cloud, amongst others, artifacts from our CI.
11
11
The latter is causing our core infrastructure to be down.
12
12
13
+
### CloudFlare
14
+
15
+
The root cause analysis for CloudFlare seems to be:
16
+
17
+
> Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency.
18
+
19
+
Which could be due the fact they might be running parts on Google Cloud Platform as well, or it is just a coincidence.
20
+
21
+
### Google
22
+
23
+
Google has not released a root cause analysis report as of this time, but when they do it will be included here as well.
24
+
13
25
### Timeline of Events
14
26
15
27
-**18:31 UTC** - It is noticed that CloudFlare has an incident
@@ -18,3 +30,13 @@ The latter is causing our core infrastructure to be down.
18
30
-**19:13 UTC** - CloudFlare states that services are recovering
19
31
-**19:30 UTC** - Google states that services have recovered except us-central1 region
20
32
-**19:30 UTC** - Our CI status page is active again, service seems to be recovering
33
+
-**19:45 UTC** - Temporary measure removed
34
+
-**21:31 UTC** - CloudFlare reports to be fully operational
35
+
-**21:35 UTC** - Fix for added resiliency added to our CI codebase
36
+
-**01:27 UTC** - Google reports to be fully operational
37
+
38
+
## Lessons Learned
39
+
40
+
Identified a non-integral part of our CI codebase that had the potential to disrupt and cause a huge fall out if a third party service happens to be down or suffer a partial outage.
41
+
A fix is now put in place to help further to be more resilient when a similar outage happens again. Work is also ongoing to scan all of our services for similar points of failure
0 commit comments