Skip to content

dce-umip-nemo.cc does not return #2

@tomhenderson

Description

@tomhenderson

In preparing for possible DCE 1.11 release on Ubuntu 16.04, I encountered the following problem with a umip example. The version 1.11 targets are presently pulling the tip of the development repositories.

$ git clone https://gitlab.com/tomhenderson/bake.git
$ cd bake
$ git checkout -b dce-1.11 origin/dce.111
$ export PATH=$PATH:`pwd`/build/bin
$ export DCE_PATH=`pwd`/build/bin_dce:`pwd`/build/sbin
$ export LD_LIBRARY_PATH=`pwd`/build/lib
$ export PYTHONPATH=`pwd`/build/lib
$ ./bake.py configure -e dce-linux-1.11 -e dce-umip-1.11 -e dce-quagga-1.11
$ ./bake.py download
$ ./bake.py build
$ cd source/ns-3-dce
$ ./waf --run dce-umip-nemo

and the simulation does not return.

The problem is isolated to the dce-umip-nemo.cc example. The simulation enters an infinite loop at time 280.025306697s and not reach the scheduled stop time of 300s. The following command will yield the following output:

$ NS_LOG="DceRrTaskScheduler:DceTaskManager" NS_GLOBAL_VALUE="RngRun=1" ./waf --run dce-umip-nemo

...
+280.025306697s 5 DceRrTaskScheduler:DequeueNext(0x10a7c10)
+280.025306697s 5 DceTaskManager:Wakeup(0x10ab0a0, 0x7f7a6c015ea0, 2)
+280.025306697s 5 DceRrTaskScheduler:Enqueue(0x10a7c10, 0x7f7a6c015ea0)
+280.025306697s 5 DceTaskManager:Yield(0x10ab0a0, 0x10964b0)
+280.025306697s 5 DceRrTaskScheduler:Enqueue(0x10a7c10, 0x10964b0)
+280.025306697s 5 DceTaskManager:Schedule(0x10ab0a0)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): [DEBUG] next=0x7f7a6c015ea0
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Leaving 0x10964b0, delay +0ns next = 0x7f7a6c015ea0 entering main
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Delayed schedule +0ns
+280.025306697s 5 DceTaskManager:GarbageCollectDeadTasks(0x10ab0a0)
+280.025306697s 5 DceTaskManager:Schedule(0x10ab0a0)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): [DEBUG] next=0x7f7a6c015ea0
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Leaving main, entering 0x7f7a6c015ea0
+280.025306697s 5 DceRrTaskScheduler:DequeueNext(0x10a7c10)
+280.025306697s 5 DceTaskManager:Sleep(0x10ab0a0, 0x7f7a6c015ea0)
+280.025306697s 5 DceTaskManager:Schedule(0x10ab0a0)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): [DEBUG] next=0x10964b0
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Leaving 0x7f7a6c015ea0, delay +0ns next = 0x10964b0 entering main
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Delayed schedule +0ns
+280.025306697s 5 DceTaskManager:GarbageCollectDeadTasks(0x10ab0a0)
+280.025306697s 5 DceTaskManager:Schedule(0x10ab0a0)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): [DEBUG] next=0x10964b0
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Leaving main, entering 0x10964b0
+280.025306697s 5 DceRrTaskScheduler:DequeueNext(0x10a7c10)
+280.025306697s 5 DceTaskManager:Wakeup(0x10ab0a0, 0x7f7a6c015ea0, 2)

(repeats endlessly)

It seems from this logging that the same task is dequeued from the queue and rescheduled, over and over.

The ns-3 simulation event that occurs at this time is WifiNetDevice::ForwardUp(). This simulation is sending some pings over wifi, and the receipt and a forward up of one of these packets to the Linux stack is putting it into this loop.

If I change the fiber manager (Pthread or Ucontext) it makes no difference.

However, if I change the ns-3 random number stream (i.e., the random seed), the behavior disappears. For instance, this completes successfully for me:

$ NS_GLOBAL_VALUE="RngRun=2" ./waf --run dce-umip-nemo

By changing the run number, the random variable streams are different, and this affects the mobility (random walk) and probably also some Wi-Fi contention resolution, such that the problematic event is not triggered. However, it may be the case that if we let the simulation run for longer time, it would get similarly stuck at a later time. I also experienced some successful runs with RngRun=1 so the problem does not seem deterministic.

Parth debugged this a bit and provided the following information (I am quoting below):

Based on what I could understand from the log file gdb.txt, it's a polling operation which is being enqueued every time.

As far as I understand the dce implementation, the function void LinuxSocketImpl::Poll () should be invoked with the returned event that occurred. But it seems like it's never being called. Now, I'm not very sure if the linux socket factory is being used in this case, but I'm just guessing based on what I currently understand.

Also, I tried to output the m_active list's size every time the scheduler enqueues or dequeues something, and I could see that initially the queue size would go up to 12 or so, but later on it would remain to max 2, but majority of the times it would move back and forth from 1 (dequeue-next) to 0 (enqueue). This is probably because we keep adding the polling job to the queue and keep dequeuing it again and again.

With the condition of RngRun=2 it does work without a gdb attached, but runs infinitely under the gdb, similar to how RngRun=1 works without the gdb. It's actually pretty tough to determine if the current gdb execution could be trusted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions