Kernel: Trap/signal fixes #26491

implicitfield · 2025-12-19T20:47:24Z

Fixes #26493
Fixes #26488
Fixes (part of) #26489 (the kernel no longer panics, but you can still get dash to crash)

With this, we no longer perform any non-essential steps in ProcessorBase::exit_trap if we're exiting an exception handler and there's a higher level trap we can defer those steps to. Not sure if this is the best approach really (not that I thought of a better one either), but this does get rid of all the kernel panics seen in #26489.

The thread's lock is also no longer taken while (potentially) dispatching signals. AFAICT taking the scheduler lock (which is something that Thread::check_dispatch_pending_signal already does) should provide sufficient locking.

oskar-skog · 2025-12-20T20:41:12Z

It's still panicking on aarch64.

Or maybe something went wrong on my end with applying the patch?

implicitfield · 2025-12-20T23:11:12Z

So I also got this to panic on aarch64 some time ago, but only once (somehow this seems to be really uncommon).

The stack trace went directly from synchronous_current_elsp_elx -> irq_current_elsp_elx with no exception_common in between, not sure how that's possible since interrupts should be disabled before and after the call to exception_common. But it seems like the kernel ends up handling a random IRQ before the page fault, which breaks this again.

implicitfield · 2025-12-20T23:44:06Z

@spholz Given all the issues and additional complexity with the exit_trap-driven signal dispatching, should we just move that functionality back into the scheduler? Some testing shows that the testcase from #26469 passes if we temporarily set current_thread->current_trap() to &trap before the call to check_invoke_scheduler() in exit_trap() in the !trap.next_trap case. (In retrospect I would've probably done that instead if I'd thought of that solution two weeks ago...)

oskar-skog · 2025-12-21T01:16:00Z

So I also got this to panic on aarch64 some time ago, but only once (somehow this seems to be really uncommon).

The stack trace went directly from synchronous_current_elsp_elx -> irq_current_elsp_elx with no exception_common in between, not sure how that's possible since interrupts should be disabled before and after the call to exception_common. But it seems like the kernel ends up handling a random IRQ before the page fault, which breaks this again.

I tried again with default QEMU settings (I used the e1000 network card previously) but it still panics at a similar rate.
And there is an exception_common directly after synchronous_current_elsp_elx

[dash(3306:3306)]: ASSERTION FAILED: !g_scheduler_lock.is_locked_by_current_processor()
[dash(3306:3306)]: ./Kernel/Tasks/Thread.cpp:147 in Kernel::Thread::block_impl(const BlockTimeout&, Blocker&)::<lambda()>
[dash(3306:3306)]: KERNEL PANIC! :^(
[dash(3306:3306)]: Aborted
[dash(3306:3306)]: at ./Kernel/Library/Assertions.cpp:38 in void abort()
Kernel + 0x000000000033e240  Kernel::__panic(char const*, unsigned int, char const*) +0xc0
Kernel + 0x0000000000338d9c  abort +0x100
Kernel + 0x0000000000338c9c  abort +0x0
Kernel + 0x0000000000495d54  AK::Function<void ()>::CallableWrapper<Kernel::Thread::block_impl(Kernel::Thread::BlockTimeout const&, Kernel::Thread::Blocker&)::{lambda()#2}>::call() +0x2f4
Kernel + 0x000000000033a408  AK::Function<void ()>::operator()() const +0xc8
Kernel + 0x00000000004b6d80  AK::Function<void ()>::CallableWrapper<Kernel::TimerQueue::fire()::{lambda(Kernel::TimerQueue::Queue&)#1}::operator()(Kernel::TimerQueue::Queue&) const::{lambda()#1}>::call() +0x160
Kernel + 0x00000000000058d0  Kernel::DeferredCallPool::execute_pending() +0x120
Kernel + 0x0000000000007c74  Kernel::ProcessorBase::exit_trap(Kernel::TrapFrame&, bool) +0x94
Kernel + 0x00000000004e28fc  irq_current_elsp_elx +0x84
Kernel + 0x000000000031c0c8  Kernel::Memory::Region::handle_cow_fault(unsigned long) +0x128
Kernel + 0x000000000031f2b0  Kernel::Memory::Region::handle_fault(Kernel::PageFault const&) +0x990
Kernel + 0x0000000000305a3c  Kernel::Memory::MemoryManager::handle_page_fault(Kernel::PageFault const&) +0x550
Kernel + 0x0000000000005ed8  Kernel::PageFault::handle(Kernel::RegisterState&) +0x138
Kernel + 0x00000000004d3ca4  exception_common +0x224
Kernel + 0x00000000004e280c  synchronous_current_elsp_elx +0x84
Kernel + 0x000000000048ac28  Kernel::push_value_on_user_stack(unsigned long&, unsigned long) +0x28
Kernel + 0x00000000004982e4  Kernel::Thread::dispatch_signal(unsigned char) +0x618
Kernel + 0x0000000000499e34  Kernel::Thread::check_dispatch_pending_signal(Kernel::YieldBehavior) +0xd0
Kernel + 0x0000000000007c60  Kernel::ProcessorBase::exit_trap(Kernel::TrapFrame&, bool) +0x80
Kernel + 0x00000000004e2f8c  synchronous_lower_el +0x84

implicitfield · 2025-12-21T13:39:26Z

That's a longer stack trace than what I was seeing, but its got the same issue (IRQ while handling page fault). We could maybe just not enable interrupts during exception handling (though I think most OSes do enable those).
I wonder why this doesn't seem to have been happening when the scheduler was dispatching signals. Though the scheduler does enter the process's context before doing that, so maybe it just doesn't ever page fault in that case?

spholz · 2025-12-21T16:19:32Z

(disclaimer: I haven't looked at your code yet)

@spholz Given all the issues and additional complexity with the exit_trap-driven signal dispatching, should we just move that functionality back into the scheduler? Some testing shows that the testcase from #26469 passes if we temporarily set current_thread->current_trap() to &trap before the call to check_invoke_scheduler() in exit_trap() in the !trap.next_trap case. (In retrospect I would've probably done that instead if I'd thought of that solution two weeks ago...)

I still think the more correct solution is to do all of this in exit_trap. That also seems to be what other unixes do.
But if we're unable to get it working like this (or only with a lot of extra effort), I guess it could be fine to switch back to the old approach. Just setting the current trap temporarily at that location feels like a hack though. I guess we could maybe keep the current trap set as long as possible?

That's a longer stack trace than what I was seeing, but its got the same issue (IRQ while handling page fault). We could maybe just not enable interrupts during exception handling (though I think most OSes do enable those).

Interrupts should be enabled whenever possible (so including during exceptions) to not cause a high interrupt latency.

I wonder why this doesn't seem to have been happening when the scheduler was dispatching signals. Though the scheduler does enter the process's context before doing that, so maybe it just doesn't ever page fault in that case?

Not sure. But I think this still should happen with the old approach, it just happens to not work with this new approach somehow. In fact, I've had a similar panic backtrace on RISC-V before in #23387 (see the attached gdb backtrace screenshot).

Do you know what actually causes the assertion failure? Maybe we should try to investigate that first.

!g_scheduler_lock.is_locked_by_current_processor() usually fails if we blocked and therefore switched context during a recursive spinlock.

If the problem is really that we shouldn't page fault, we can ensure that the stack region is mapped before accessing it by calling the page fault handler. But I'd really like to avoid that if possible.

spholz · 2025-12-21T16:22:36Z

!g_scheduler_lock.is_locked_by_current_processor() usually fails if we blocked and therefore switched context during a recursive spinlock.

Oops, no that's wrong. I've misread that assertion, sorry. What I was referring to is the non-negated version of that assertion.

The scheduler lock is just held by the signal dispatching code and the code at the top of the stack doesn't expect it to be held.

implicitfield · 2025-12-21T17:51:37Z

Just setting the current trap temporarily at that location feels like a hack though. I guess we could maybe keep the current trap set as long as possible?

It kind of is, but we can't hold on to trap after returning from exit_trap because it won't be on the stack anymore.

Interrupts should be enabled whenever possible (so including during exceptions) to not cause a high interrupt latency.

Yeah, ideally we wouldn't unconditionally disable/never enable those if we can solve this without doing that.

Not sure. But I think this still should happen with the old approach, it just happens to not work with this new approach somehow. In fact, I've had a similar panic backtrace on RISC-V before in #23387 (see the attached gdb backtrace screenshot).

Apparently aarch64 just unconditionally enables interrupts in exception_common(). I guess we should only be doing that if they were enabled when the exception happened (which should mostly match riscv64 and x86_64). If the cause is always the same as in the above backtrace, then that should prevent this panic from happening.

If there's a higher level trap, then that should take care of signal dispatching, scheduling, etc. when it exits its own trap frame.

This isn't necessary and only leads to deadlocks with SMP.

implicitfield · 2025-12-22T00:59:41Z

That should take care of #26493 and the panic from #26489 on aarch64 (though let me know if either of those still happen).

While testing this, I noticed even with the latest changes from this PR, oksh seems to run the test case noticably slower compared to using the (revised) scheduler-driven signal dispatching (with that the runtime was about ~120-130ms and with this it varies much more with a range of mostly ~150-230ms on my aarch64 machine).

I haven't profiled this, but I guess this approach might be adding latency to signal dispatching since exit_trap() might reschedule to something else after the current thread has been primed to enter the signal handler. Dispatching signals after entering and exiting the scheduler breaks things (the shell), so that's not really an option unless we fix that somehow.

oskar-skog · 2025-12-22T20:33:48Z

I did some pretty thorough testing and confirm that there are no more panics.

nico · 2025-12-22T21:10:27Z

Kernel/Arch/aarch64/Interrupts.cpp


    Processor::disable_interrupts();
-    Processor::current().exit_trap(*trap_frame);
+    Processor::current().exit_trap(*trap_frame, true);


https://github.com/SerenityOS/serenity/blob/master/Documentation/CodingStyle.md#right-8

github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Dec 19, 2025

implicitfield mentioned this pull request Dec 19, 2025

Kernel: panic triggered by dash/oksh, and also coredumps from SIGCHLD #26489

Open

implicitfield added 4 commits December 22, 2025 02:26

Kernel: Don't run non-essential trap exit steps for low-level exceptions

b301ddf

If there's a higher level trap, then that should take care of signal dispatching, scheduling, etc. when it exits its own trap frame.

Kernel: Don't take the thread lock when dispatching signals

d45170c

This isn't necessary and only leads to deadlocks with SMP.

Kernel/aarch64: Only enable interrupts in trap handler if they were on

82ecc45

Kernel: Only dispatch signals to running threads

7f1796d

implicitfield force-pushed the signal-fixes branch from 472b64a to 7f1796d Compare December 22, 2025 00:45

oskar-skog mentioned this pull request Dec 22, 2025

Event loop? Suddenly can't start any programs on riscv64 #26499

Open

nico reviewed Dec 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kernel: Trap/signal fixes #26491

Kernel: Trap/signal fixes #26491

implicitfield commented Dec 19, 2025 •

edited

Loading

Uh oh!

oskar-skog commented Dec 20, 2025

Uh oh!

implicitfield commented Dec 20, 2025

Uh oh!

implicitfield commented Dec 20, 2025

Uh oh!

oskar-skog commented Dec 21, 2025 •

edited

Loading

Uh oh!

implicitfield commented Dec 21, 2025

Uh oh!

spholz commented Dec 21, 2025

Uh oh!

spholz commented Dec 21, 2025 •

edited

Loading

Uh oh!

implicitfield commented Dec 21, 2025 •

edited

Loading

Uh oh!

implicitfield commented Dec 22, 2025 •

edited

Loading

Uh oh!

oskar-skog commented Dec 22, 2025

Uh oh!

nico Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kernel: Trap/signal fixes #26491

Are you sure you want to change the base?

Kernel: Trap/signal fixes #26491

Conversation

implicitfield commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oskar-skog commented Dec 20, 2025

Uh oh!

implicitfield commented Dec 20, 2025

Uh oh!

implicitfield commented Dec 20, 2025

Uh oh!

oskar-skog commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

implicitfield commented Dec 21, 2025

Uh oh!

spholz commented Dec 21, 2025

Uh oh!

spholz commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

implicitfield commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

implicitfield commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oskar-skog commented Dec 22, 2025

Uh oh!

nico Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

implicitfield commented Dec 19, 2025 •

edited

Loading

oskar-skog commented Dec 21, 2025 •

edited

Loading

spholz commented Dec 21, 2025 •

edited

Loading

implicitfield commented Dec 21, 2025 •

edited

Loading

implicitfield commented Dec 22, 2025 •

edited

Loading