Degraded performance on Jan 9th, 2026 #144

bmesuere · 2026-01-12T10:00:22Z

bmesuere
Jan 12, 2026
Maintainer

In the afternoon of Jan 9th 2026, Dodona experienced a disruption of service. This post mortem aims to analyze the root causes and the steps we are taking to prevent similar incidents in the future.

Incident Summary

Date and Time: January 9th, 2026, starting at approximately 13:15 Brussels time
Duration: Approximately 1.5 hours
Impact:
- Users initially had to wait longer than usual for submissions to be judged.
- Dodona itself was sometimes unresponsive for several periods during the incident.

Root Cause Analysis

There were 2 separate issues that contributed to the disruption.

Issue 1: workers overwhelmed by a spike in submissions

Submissions are processed by a pool of workers. The number of workers in this pool is dynamically adjusted based on the current load (reactive scaling) and on predicted load (predictive scaling). Starting new workers takes a few minutes and happens in steps. Between each scaling event, there is a cooldown period of 5 minutes to prevent excessive scaling. There are always a minimum number of workers running to handle the base load.

At 13:23, we witnessed an increase in submissions due to the start of an exam with a big group of students. Because exams are not predictable in our load prediction model, no additional workers were started in advance. In addition, we had just scaled down to the minimum amount of workers at 13:16. In addition, we noticed a mismatch between the reported number of workers by Azure and the actual number of running workers. This meant that we had fewer operational workers than expected. As a result, the workers were quickly overwhelmed by the sudden spike in submissions, leading to long wait times for users.

Because we had just scaled down to the minimum number of workers, and because of the cooldown period, no new workers could be started until 13:21. At that time, additional workers were started, but this was insufficient to handle the backlog of submissions that had been built up. After each cooldown period, additional workers were started, reaching a maximum at 13:49.

The total queue length peaked at 191 submissions waiting to be processed at 13:32. The entire backlog was cleared by 13:38 after which there were no more delays in judging submissions. Everything was resolved automatically without manual intervention.

Issue 2: Dodona was unresponsive due to high web server load

Around 13:15, Dodona became slow and eventually unresponsive due to a high CPU load on the web servers. This was caused by a combination of factors:

Initially, a large number of users accessed Dodona simultaneously due to the start of the exam.
Later due to the delays in judging submissions (see Issue 1), users were repeatedly refreshing their browsers to check the status of their submissions, leading to an increased load on the web servers.
The exercises used in the exam generated particularly large result pages. Generating these pages is the most CPU-intensive operation performed by the web servers.

Dodona uses Cloudflare as a DDOS protection and caching layer. Requests are forwarded to our pool of web servers which are behind an Azure Load Balancer. The load balancer distributes incoming connections randomly over all active servers.

As soon as we noticed increased load and failing requests, we increased the number of web servers. This, however, did not have the expected effect. There was still a high CPU load on one of the web servers and only minimal load on the others. Dodona was still unresponsive for some users. After manually restarting Apache at 13:52 on the overloaded server, the situation improved significantly.
Around 14:30, we again observed the same situation: one web server was overloaded while the others are underutilized. The load balancer status correctly showed the overloaded server as "unhealthy", but we were still receiving requests on that server. We speculated that the issue might be due to the combination of Cloudflare using only a handful of long-lived connections to the load balancer, and the load balancer working on the transport (L4, i.e. connection) level and not the application (L7, i.e. request) level. As a result, many of the requests from Cloudflare were being sent to the same web server, leading to overload.

To mitigate this, we disabled the HTTP/2 protocol between Cloudflare and our load balancer at 14:37. This forced Cloudflare to open a new connection for every request to the load balancer. While this is less efficient, it allowed the load balancer to distribute requests more evenly across all web servers. After this change, the CPU load on the web servers was more balanced, and Dodona became responsive again for all users.

Average page load times were still elevated, but this was entirely due to the endpoint generating the large result pages.

Mitigation and Prevention

To prevent similar incidents in the future, we are taking or considering the following steps.

For issue 1:

Investigate the discrepancy between the reported and actual number of workers. This is likely due to a bug in Azure.
Adjust the scaling strategy to potentially:Increase the minimum number of workers.
Implement a more aggressive scaling strategy during sudden spikes in load.
Investigate the possibility of pre-scaling workers based on known/reported exam schedules.

For issue 2:

Implement dynamic scaling of web servers.
Investigate if we need a higher minimum number of web servers.
Explore more advanced load balancing solutions that operate at the application level.
Limit the size of the result pages generated by exercises or move the generation of these pages to the frontend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dodona

Degraded performance on Jan 9th, 2026 #144

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Dodona

Degraded performance on Jan 9th, 2026 #144

Uh oh!

bmesuere Jan 12, 2026 Maintainer

Incident Summary

Root Cause Analysis

Issue 1: workers overwhelmed by a spike in submissions

Issue 2: Dodona was unresponsive due to high web server load

Mitigation and Prevention

Replies: 0 comments

bmesuere
Jan 12, 2026
Maintainer