Description
This is a weird one, so strap in.
I've been using cesium-unreal with large city tile sets. My goal is to sample data from these tile sets.
For this sampling, a couple of scene captures fly around the city using predefined paths capturing images and other data.
After the sampling gets into a part of the city with a lot of complex geometry (a park with lots of trees) there is a steep increase in memory consumption of the engine & Cesium. So steep in fact that my program would crash with an out of memory error shortly after, because consumption jumped from 7GB to 190GB.
This was incredibly hard to debug, because there wasn't a stack trace or any meaningful information in the logs, and this only happened a couple of hours into the sampling.
I finally was able to fix it (I think) by setting tileCacheUnloadTimeLimit
to 0
in https://github.com/CesiumGS/cesium-unreal/blob/ue4-main/Source/CesiumRuntime/Private/Cesium3DTileset.cpp#L989
Like I said, I don't have a lot of debug/crash information, so it's also possible that this crash is not related to tileCacheUnloadTimeLimit
or Cesium at all. But after setting it to 0 I never encountered the issue again. In addition, this never happened before 1.19 (but my code/sample also wasn't the same back then).
So here is my idea on why this might be happening:
- The SceneCaptures used for sampling get to a location in the tileset with a lot of complex geometry (tile set has to load geometry, textures, collisions...) leading to a spike in memory consumption.
- After leaving that area the complex geometry is still in cache, in addition a new part of the tileset has to be rendered.
- The 5ms specified by default for
tileCacheUnloadTimeLimit
are not enough time to unload all the unused data. - Sampling continues & more data accumulates in the cache, increasing RAM consumption.
- The CPU begins to thrash (spends a lot of time moving memory around between physical and swap etc.)
- The timer to track the time budget in https://github.com/CesiumGS/cesium-native/blob/main/Cesium3DTilesSelection/src/Tileset.cpp#L1414 uses chrono's system clock and not the actual time spend on cleaning up.
It's quite possible that the main thread that is handling "the unload" is getting pre-empted a lot because the CPU is occupied with handling the sudden increase in memory consumption (thrashing). This in turn means that the actual time spend unloading might be a lot shorter than the default 5ms specified (because system clock tracks system time, not time spent computing). - The cache unloading is unable to catch up, while the sampling continues and more tile data keeps accumulating in the cache, the system becomes more and more unresponsive.
- Unreal/Cesium reaches >100GB memory consumption
- Windows kills the program.
Even if the sequence of events I described here is not what causes the issue. I still think the chrono::system_clock
issue should be addressed. Because it is bound to lead to unexpected behaviour, since the time specified in tileCacheUnloadTimeLimit
is not guaranteed to be consumed for unloading. Unlucky scheduling, high system load or an unresponsive system might result in the function having a lot less computation time than specified.
I can think of a couple of fixes:
- Track the actual CPU time spend by the unload function (there is
boost::chrono::thread_clock
for example) - Use a different metric, e.g. how much bytes have been unloaded
- Force the unloading to continue if the cache is overflown (e.g. 2x it's size)
- At the very least, expose
tileCacheUnloadTimeLimit
to cesium-unreal, so you can work around any issues caused by this. (probably also exposemainThreadLoadingTimeLimit
)
Also, it should probably be documented that maximumCachedBytes
is actually no longer a hard limit if tileCacheUnloadTimeLimit != 0
. This is valuable information, especially for memory constrained systems.
I would also be open to do the PR myself once you decide if this is something you want to address and how to address it.