-
Notifications
You must be signed in to change notification settings - Fork 33
Description
I believe I have encountered another really hard to isolate/reproduce Pallene bug. This is going to be a terrible bug report because it has been really difficult to isolate and reproduce, and I so far have only encountered it as a piece of a much larger program.
But I want to report this bug so it is at least known and on the radar.
If I compile Lua with -DLUAI_ASSERT, then I (sometimes) get the following assertion failure:
lua: lapi.c:1015: lua_callk: Assertion `((nresults) == (-1) || (L->ci->top - L->top >= (nresults) - (nargs))) && "results from function overflow current stack size"' failed.
Aborted (core dumped)
I only get this sometimes, using the same data inputs. Also, I only have triggered this (thus far) when using LuaLanes. I've been scrutinizing both Lanes and other modules I use, but I don't think they are the problem. I think Lanes relationship is a red-herring. At best, Lanes might be stressing out the system in such a way that it forces the bug to materialize more easily.
But this bug can go for weeks without happening, and then all of a sudden, every single run crashes.
If I pallenec --emit-lua for my entire project, I never get any crashes/assertions, so that's why I believe this is a Pallene bug.
I've been trying to isolate this for a few months now, by moving groups of Pallene functions that I considered suspect, to Lua.
I think I have finally isolated things down just enough that it might be just small enough to start a discussion.
I will be attaching some files. But as a high level summary, the code when it was crashing seemed to be coming from the function:
function m.GenerateBackTestPlotForChartJS(backTestPlot:BackTestPlot, src_file_path:string, tgt_file_path:string, file_basename:string, wants_monolithic:boolean, use_remote_deps:boolean) : (string, string)
It used to be a big giant function, so I've been refactoring and breaking it down for the past few weeks.
There are now 6 helper functions calls:
Generate_JSArrayData_1_*
Generate_JSArrayData_2_*
Generate_JSArrayData_3_*
Generate_JSArrayData_4_*
Generate_JSArrayData_5_*
Generate_JSArrayData_6_*
My crashes stopped for about 2 weeks after this refactoring, so I thought maybe it was the size of the original function that was causing the problem, but the crashes returned.
So continuing on, I started moving the 6 helper functions back and forth between Pallene and Lua. I noticed that keeping 1-4 in Pallene didn't seem to affect crashes, but
5 and 6 seemed to crash often.
I also noticed there was another commonality between 5-6, and a difference between 1-4:
Functions 5 & 6 both call another local helper function, which calls out to externally imported Lua functions (table.concat in this case).
This helper function is:
local function GenerateCodeForChartJSSeries(series_plot:Series, var_name_prefix:string, index:integer) : (string, string, string)
and in the middle/end of the function, there in indirect call to table.concat().
local ret_data_decl = g_luaFuncs_concat(js_data_array_decl, "", 1, #js_data_array_decl)
So my currently thinking is that this external call may be responsible for the bug. To try to test this theory:
I moved the following to Lua:
m.GenerateBackTestPlotForChartJS
Generate_JSArrayData_5_*
Generate_JSArrayData_6_*
But I modified 5 & 6 so their calls to GenerateCodeForChartJSSeries go to Pallene. This continued to crash.
Then I modified GenerateCodeForChartJSSeries so it no longer calls table.concat. Originally this helper would return the resulting string as a return parameter. I made a change to skip the table.concat() and instead return the array of strings.
Then back in 5 & 6, I call table.concat() in Lua on the now returned array of strings to get back to where I was.
Thus far, my crashes have disappeared. (But as I warned, this crash can be really random, so I can't prove this change is a coincidence or not.)
Also, I did worry maybe table.concat was secretly returning more values than I expected, so I did actually write a wrapper around table.concat to make sure it only ever returns 1 value. Despite my wrapper, I would still get crashes. So I don't think this is the problem.
But for the moment, my theory is that Pallene is incorrectly generating code relating to the calling of external functions (i.e. table.concat), or has something to do with string code in general.
I have lots of places in my codebase that do call external functions in Pallene, and they haven't crashed like this. But the big difference with this module for me, is that this is one of the few areas where I do a lot of string handling in Pallene. Most of my other modules are centered around floats and integers, and not strings.
That said, these functions discussed here are one minor hotspot in my program. While working around this by converting to Lua, my run times for a particular dataset went from about 25 seconds to 28 seconds (~12% slower). So originally I wasn't sure if Pallene would give me any benefit for this section of code, but it turned out to give a pretty good boost when it doesn't crash.
Attached are immediate before & afters of the code discussed here, with the table.concat workaround. This is just the Pallene code. I am not including all the rest of the program because I think it would be too much. (If you really need it, I can provide it.) So there is no test driver program to run/crash. However, you might be able to look/compile at the Pallene generated code, and maybe something might reveal itself to you.
chart_pln_suspectconcat_before.pln.txt
chart_pln_workaround_after.pln.txt