Description
Hello,
While stress testing the mesh I've come across a problem where at some point (20 - 30 seconds) mesh nodes would start dropping out and no longer transmitting any data. My data cache size and handle cache size defines are both 8 at the time.
At the time the mesh is stalled, the call to data_entry_allocate() in handle_storage_info_set() was returning DATA_CACHE_ENTRY_INVALID, causing handle_storage_info_set() to return NRF_ERROR_NO_MEM (line 392).
At the time this problem was occuring, my caches looked like this:
m_handle_cache <array> 0x200036C8 handle_entry_t[8]
[0] <struct> 0x200036C8 handle_entry_t
handle 23261 0x200036C8 uint16_t
version 0 0x200036CA uint16_t
index_next 1 0x200036CC uint16_t
tx_event 0 0x200036CC uint16_t
index_prev 7 0x200036CE uint16_t
persistent 0 0x200036CE uint16_t
data_entry 8 0x200036D0 uint16_t
[1] <struct> 0x200036D2 handle_entry_t
handle 23393 0x200036D2 uint16_t
version 0 0x200036D4 uint16_t
index_next 2 0x200036D6 uint16_t
tx_event 0 0x200036D6 uint16_t
index_prev 0 0x200036D8 uint16_t
persistent 0 0x200036D8 uint16_t
data_entry 8 0x200036DA uint16_t
[2] <struct> 0x200036DC handle_entry_t
handle 23921 0x200036DC uint16_t
version 0 0x200036DE uint16_t
index_next 3 0x200036E0 uint16_t
tx_event 0 0x200036E0 uint16_t
index_prev 1 0x200036E2 uint16_t
persistent 0 0x200036E2 uint16_t
data_entry 8 0x200036E4 uint16_t
[3] <struct> 0x200036E6 handle_entry_t
handle 10470 0x200036E6 uint16_t
version 0 0x200036E8 uint16_t
index_next 4 0x200036EA uint16_t
tx_event 0 0x200036EA uint16_t
index_prev 2 0x200036EC uint16_t
persistent 0 0x200036EC uint16_t
data_entry 8 0x200036EE uint16_t
[4] <struct> 0x200036F0 handle_entry_t
handle 3 0x200036F0 uint16_t
version 0 0x200036F2 uint16_t
index_next 5 0x200036F4 uint16_t
tx_event 0 0x200036F4 uint16_t
index_prev 3 0x200036F6 uint16_t
persistent 0 0x200036F6 uint16_t
data_entry 8 0x200036F8 uint16_t
[5] <struct> 0x200036FA handle_entry_t
handle 23789 0x200036FA uint16_t
version 0 0x200036FC uint16_t
index_next 8 0x200036FE uint16_t
tx_event 0 0x200036FE uint16_t
index_prev 4 0x20003700 uint16_t
persistent 0 0x20003700 uint16_t
data_entry 8 0x20003702 uint16_t
[6] <struct> 0x20003704 handle_entry_t
handle 2 0x20003704 uint16_t
version 0 0x20003706 uint16_t
index_next 7 0x20003708 uint16_t
tx_event 0 0x20003708 uint16_t
index_prev 8 0x2000370A uint16_t
persistent 0 0x2000370A uint16_t
data_entry 8 0x2000370C uint16_t
[7] <struct> 0x2000370E handle_entry_t
handle 23830 0x2000370E uint16_t
version 0 0x20003710 uint16_t
index_next 0 0x20003712 uint16_t
tx_event 0 0x20003712 uint16_t
index_prev 6 0x20003714 uint16_t
persistent 0 0x20003714 uint16_t
data_entry 8 0x20003716 uint16_t
m_data_cache <array> 0x200034CC data_entry_t[8]
[0] <struct> 0x200034CC data_entry_t
trickle <struct> 0x200034CC trickle_t
p_packet 0x00000000 0x200034DC mesh_packet_t *
[1] <struct> 0x200034E0 data_entry_t
trickle <struct> 0x200034E0 trickle_t
p_packet 0x00000000 0x200034F0 mesh_packet_t *
[2] <struct> 0x200034F4 data_entry_t
trickle <struct> 0x200034F4 trickle_t
p_packet 0x00000000 0x20003504 mesh_packet_t *
[3] <struct> 0x20003508 data_entry_t
trickle <struct> 0x20003508 trickle_t
p_packet 0x00000000 0x20003518 mesh_packet_t *
[4] <struct> 0x2000351C data_entry_t
trickle <struct> 0x2000351C trickle_t
p_packet 0x00000000 0x2000352C mesh_packet_t *
[5] <struct> 0x20003530 data_entry_t
trickle <struct> 0x20003530 trickle_t
p_packet 0x00000000 0x20003540 mesh_packet_t *
[6] <struct> 0x20003544 data_entry_t
trickle <struct> 0x20003544 trickle_t
p_packet 0x00000000 0x20003554 mesh_packet_t *
[7] <struct> 0x20003558 data_entry_t
trickle <struct> 0x20003558 trickle_t
p_packet 0x00000000 0x20003568 mesh_packet_t *
As you can see, the data cache is empty and no handle caches are referencing the data cache.
In data_entry_allocate() there is a loop at the top checking for unused data cache entries, but this loop stops after RBC_MESH_DATA_CACHE_ENTRIES because the variable 'allocated' is static and never reset back to 0.
So, after the data cache has been filled RBC_MESH_DATA_CACHE_ENTRIES times, it starts looking for unused handle cache entries to free them, but as you can see in my case, the handle cache is unlinked and the data cache is free, so nothing happens and the mesh can no longer send any packets.
My fix for now (which seems to work perfectly) is to change this line in data_entry_allocate from:
for (uint32_t i = allocated; i < RBC_MESH_DATA_CACHE_ENTRIES; ++i)
to
for (uint32_t i = 0; i < RBC_MESH_DATA_CACHE_ENTRIES; ++i)
disregarding the 'allocated' variable and making use of the initial loop to check for 'free' data cache entries again. The 'allocated' variable seems to be useless, because the data cache already has a way of being marked as 'in use' or 'free'.
Please share your opinion on this fix, perhaps I'm missing something.