Skip to content

[DP-108] Investigate why spawn is slow #206

Open
@qnikst

Description

@qnikst

[Imported from JIRA. Reported by Niklas Hambuechen @[email protected]) as DP-108 on 2015-03-15 02:48:31]
Copying my posts from the #haskell-distributed IRC channel:

spawn seems to be very slow for me, even though I'm on localhost. Doing it in a loop gets me to almost 50 ms per spawn, why would it be so high? I can't use spawnAsync in my case, but why would a spawn on localhost take this long in the first place? My ethernet latency is 0.5ms and localhost latency is 0.1ms, so that can't be it. CPU is low too.

I have a suspicion: using strace -f -c -w on the node onto which I spawn the processes (a slave using simplelocalnet), I see 179596 calls to the select syscall. That doesn't seem right given that I only do 100 spawns and nothing else. Might this be that the master is sending a lot of small numbers, which it recvs one after the other? I think this is the only way to trigger so many selects, and I've seen that recvInt32 does exactly such a thing (recv'ing 4 bytes at a time), and it does appear in my profiling output.

Further, the 50ms that each spawn takes are suspiciously close to the 40ms TCP ACK delay on Linux (I'm on Linux), as mentioned here: http://stackoverflow.com/a/2253620/263061.

I have found something different though that fixes the problem: setting +RTS -V0 on the slave reduces the time for each spawn to 3ms. How can it be that this has such a huge effect?

I can get the same good results with +RTS -C0.001. But why? This sets the context switch interval; if that has such a positive effect, doesn't that mean that there are other Haskell threads around that actually run and thus stop my recv/recv from immediately being scheduled again? Assume there's only one recv that I'm running; when it gets a context switch interrupt, interrupting the recv, it should see that there are no other Haskell threads to be run, and immediately go back into my recv again, I can't see a reason why it should do anything else that's not my recv ...

Also, setting +RTS -C to something very high does not make it slower than 50ms per spawn, e.g. setting +RTS -C1 does not make it take 1 second per spawn, it's still 50ms.

Setting +RTS -N2/-N3/-N4 helps, too: I get down to 6 ms, compared to the 50 ms for -N1.

nh2: may it be that there are actually 2 recvs going on, but only one can be active at the same time if I'm running on -N1, so the system toggles between them at the interval of the context switch interval -C, which defaults to 20ms, and two of these switches make the ~50ms that I'm seeing?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions