Degraded performance on Jan 9th, 2026 #144
bmesuere
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In the afternoon of Jan 9th 2026, Dodona experienced a disruption of service. This post mortem aims to analyze the root causes and the steps we are taking to prevent similar incidents in the future.
Incident Summary
Root Cause Analysis
There were 2 separate issues that contributed to the disruption.
Issue 1: workers overwhelmed by a spike in submissions
Submissions are processed by a pool of workers. The number of workers in this pool is dynamically adjusted based on the current load (reactive scaling) and on predicted load (predictive scaling). Starting new workers takes a few minutes and happens in steps. Between each scaling event, there is a cooldown period of 5 minutes to prevent excessive scaling. There are always a minimum number of workers running to handle the base load.
At 13:23, we witnessed an increase in submissions due to the start of an exam with a big group of students. Because exams are not predictable in our load prediction model, no additional workers were started in advance. In addition, we had just scaled down to the minimum amount of workers at 13:16. In addition, we noticed a mismatch between the reported number of workers by Azure and the actual number of running workers. This meant that we had fewer operational workers than expected. As a result, the workers were quickly overwhelmed by the sudden spike in submissions, leading to long wait times for users.
Because we had just scaled down to the minimum number of workers, and because of the cooldown period, no new workers could be started until 13:21. At that time, additional workers were started, but this was insufficient to handle the backlog of submissions that had been built up. After each cooldown period, additional workers were started, reaching a maximum at 13:49.
The total queue length peaked at 191 submissions waiting to be processed at 13:32. The entire backlog was cleared by 13:38 after which there were no more delays in judging submissions. Everything was resolved automatically without manual intervention.
Issue 2: Dodona was unresponsive due to high web server load
Around 13:15, Dodona became slow and eventually unresponsive due to a high CPU load on the web servers. This was caused by a combination of factors:
Dodona uses Cloudflare as a DDOS protection and caching layer. Requests are forwarded to our pool of web servers which are behind an Azure Load Balancer. The load balancer distributes incoming connections randomly over all active servers.
As soon as we noticed increased load and failing requests, we increased the number of web servers. This, however, did not have the expected effect. There was still a high CPU load on one of the web servers and only minimal load on the others. Dodona was still unresponsive for some users. After manually restarting Apache at 13:52 on the overloaded server, the situation improved significantly.
Around 14:30, we again observed the same situation: one web server was overloaded while the others are underutilized. The load balancer status correctly showed the overloaded server as "unhealthy", but we were still receiving requests on that server. We speculated that the issue might be due to the combination of Cloudflare using only a handful of long-lived connections to the load balancer, and the load balancer working on the transport (L4, i.e. connection) level and not the application (L7, i.e. request) level. As a result, many of the requests from Cloudflare were being sent to the same web server, leading to overload.
To mitigate this, we disabled the HTTP/2 protocol between Cloudflare and our load balancer at 14:37. This forced Cloudflare to open a new connection for every request to the load balancer. While this is less efficient, it allowed the load balancer to distribute requests more evenly across all web servers. After this change, the CPU load on the web servers was more balanced, and Dodona became responsive again for all users.
Average page load times were still elevated, but this was entirely due to the endpoint generating the large result pages.
Mitigation and Prevention
To prevent similar incidents in the future, we are taking or considering the following steps.
For issue 1:
For issue 2:
Beta Was this translation helpful? Give feedback.
All reactions