Skip to content

Optimise accesses to the Table name -> Table id map #7118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

RobinMorisset
Copy link
Contributor

Currently these accesses are guarded by (striped) read-write locks.

With this patch, there are still (regular) locks for protecting writing threads from each other, but threads that only need to read that map (so the overwhelming majority) no longer need to use any lock. Instead we now rely on lightweight atomic synchronisation, and on the thread_progress functionality for memory reclaimation.

One ugly hack is that if we add 1 table, then remove it, the corresponding bucket may go

  • inline, empty
  • inline, with a value
  • outlined vector without a value

Once we have a filled the inline vector, we can only transition to an outlined vector, and never go back to the inline state. This saves us from an ABA hazard:

  • Reader sees inline vector of size 1
  • Reader reads the name NameA that corresponds to table IdA, sees it matches what it looked for
  • Writer deletes (or rename the table), moving us to an empty vector
  • Writer inserts another table in that bucket, with a different name: NameB -> IdB
  • Reader reads IdB, concluding that NameA -> IdB. As long as we stick to outline vectors and always allocate a fresh one when deleting tables, we don't have that risk, as the vectors only get freed (and thus the adress may get reused by another vector) once all readers have made thread progress (and thus are no longer in the middle of an operation).

One consequence of this hack, is that we must lie to the memory counting system, pretending to have an outlined vector at all times, even when we only have the initial inlined vector. Otherwise, the tests detect the memory increase in inline (1 element) -> outlined (0 element) -> outlined (1 element), and report a memory leak.

This patch was motivated by noticing that heavyweight services can spend several % of CPU time in db_get_table_aux. A trial in production confirmed that this patch roughly halves the time spent in that function.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 13, 2023

CT Test Results

    4 files    225 suites   1h 26m 33s ⏱️
3 474 tests 3 380 ✅  94 💤 0 ❌
4 901 runs  4 782 ✅ 119 💤 0 ❌

Results for commit 5d58cad.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@rickard-green rickard-green added the team:VM Assigned to OTP team VM label Apr 17, 2023
@RobinMorisset
Copy link
Contributor Author

ping @sverker

@sverker
Copy link
Contributor

sverker commented Oct 4, 2023

ping @sverker

I'm sorry. I will try to get time to look at this again. I think it demands a proper review. Lock-less stuff are tricky.

db_lock(tb, kind);
if (name_lck)
erts_rwmtx_runlock(name_lck);
db_lock(tb, kind);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here in db_get_table_aux() when the table is locked we should double check the name of the table to be the expected (if we did lookup by name) and bail out with badarg if not. That would probably mean we got raced by a rename operation.
This will also make the ABA problem go away. And with this I think we could just clear an inline_entry in remove_named_tab() without having to allocate a new empty one. Remove in an outline bucket with more than one entry is still more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried getting this to work and I'm hitting a small issue.
The lock is not always taken in db_get_table_aux(): if what == DB_READ_TBL_STRUCT, then there is an early return before the lock is taken. The double-check might still work, but I'm struggling to figure out how to make it safe, and ensure that the table does not change under us. Apart from that I've got a patch with the relevant changes (removing the hack for size, and doing an atomic write of 0 to the inline_entry size when removing the only table in it (rather than allocating an empty outlined entry).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should double check the name after locking the table. Otherwise the table op may be seen as happening after the table was renamed.

ets:rename(aaa, bbb),
ets:insert(bbb, {key, bbb})

A concurrent process doing ets:lookup(aaa, key) should never get {key,bbb} if it never existed in the tabled when it was named aaa.

For DB_READ_TBL_STRUCT it's enough to do the check without lock. Just to make sure the table got the correct name now. DB_READ_TBL_STRUCT is only used by yielding ets:insert as a first cheap lookup of the table. It will lock the table later and then also double check the name. See comment in function ets_insert_2_list_lock_tbl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I added it to the comment (and moved the double-check to after taking the lock).

@sverker sverker added the waiting waiting for changes/input from author label Oct 30, 2023
@lpgauth
Copy link
Contributor

lpgauth commented Jan 19, 2024

ping @RobinMorisset

Would love to see this merged :)

@RobinMorisset
Copy link
Contributor Author

oops, sorry for forgetting about that. I'm a bit busy this week, but I'll try to get this ready for merging next week.

@lpgauth
Copy link
Contributor

lpgauth commented Feb 8, 2024

@RobinMorisset would be nice to have in OTP 27 :)

@RobinMorisset RobinMorisset force-pushed the ets_table_names_upstream branch from 9047110 to 0f96524 Compare February 20, 2024 10:54
@RobinMorisset
Copy link
Contributor Author

This new version passes tests, so it should not be completely broken, but I've put the double-checking before the locking because the lock is not always taken (see my comment), so I'm not very confident yet that this last change is correct.

RobinMorisset and others added 2 commits February 22, 2024 04:28
Currently these accesses are guarded by (striped) read-write locks.

With this patch, there are still (regular) locks for protecting writing threads from each other,
but threads that only need to read that map (so the overwhelming majority) no longer need to use any lock.
Instead we now rely on lightweight atomic synchronisation, and on the thread_progress functionality for
memory reclaimation.

One ugly hack is that if we add 1 table, then remove it, the corresponding
bucket may go
- inline, empty
- inline, with a value
- outlined vector without a value
Once we have a filled the inline vector, we can only transition to an
outlined vector, and never go back to the inline state. This saves us
from an ABA hazard:
- Reader sees inline vector of size 1
- Reader reads the name NameA that corresponds to table IdA, sees it matches what it looked for
- Writer deletes (or rename the table), moving us to an empty vector
- Writer inserts another table in that bucket, with a different name:
NameB -> IdB
- Reader reads IdB, concluding that NameA -> IdB.
As long as we stick to outline vectors and always allocate a fresh one
when deleting tables, we don't have that risk, as the vectors only get
freed (and thus the adress may get reused by another vector) once all
readers have made thread progress (and thus are no longer in the middle
of an operation).

One consequence of this hack, is that we must lie to the memory counting
system, pretending to have an outlined vector at all times, even when we
only have the initial inlined vector. Otherwise, the tests detect the
memory increase in inline (1 element) -> outlined (0 element) ->
outlined (1 element), and report a memory leak.

This patch was motivated by noticing that heavyweight
services can spend several % of CPU time in db_get_table_aux.
A trial in production confirmed that this patch roughly
halves the time spent in that function.
…), and avoid the hack where we make empty outline vectors.
@RobinMorisset RobinMorisset force-pushed the ets_table_names_upstream branch from 6f1dfd3 to da8337a Compare February 22, 2024 15:40
Comment on lines 950 to 953
void schedule_meta_name_tab_entries_for_deletion(struct meta_name_tab_entries *entries)
{
char* ptr = (char *) entries;
ptr -= sizeof(ErtsThrPrgrLaterOp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using raw char pointer arithmetic with sizeof(ErtsThrPrgrLaterOp)

I think you should introduce a struct meta_name_tab_entries_allocated, or whatever you want to call it.

struct meta_name_tab_entries_allocated {
    ErtsThrPrgrLaterOp later_op;
    struct meta_name_tab_entries data;
};

and then use macro ErtsContainerStruct(entries, struct meta_name_tab_entries_allocated, data) to find the start of it.

and entries = &allocated->data to go the other way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I did not know about ErtsContainerStruct and it is indeed much cleaner.

Comment on lines 833 to 835
if (what == DB_READ_TBL_STRUCT) {
if (name_lck)
erts_rwmtx_runlock(name_lck);
return tb;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you must check the name for DB_READ_TBL_STRUCT as well. Otherwise we may start to operate on a completely random wrong table which feels a bit scary. At least I think it can result in incorrect errors if the random table has a different keypos for example.

Comment on lines 855 to 859
if (ERTS_UNLIKELY(tb->common.the_name != id)) {
*freason_p = BADARG | EXF_HAS_EXT_INFO;
p->fvalue = EXI_ID;
tb = NULL;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if block should end with return NULL, otherwise it will crash down where it checks tb->common.status.

I assume that means this code has never been executed. It's quite difficult to provoke the necessary race condition, but I think we should at least try with an evil test case that concurrently loops around both ets:rename and maybe ets:lookup and ets:insert. Instrumenting the code with a strategic erts_thr_yield() could even make it practically possible for the test case to provoke the race.

if (entries->data[i].name_atom == id) {
    erts_thr_yield();
    tb = entries->data[i].tb;
    break;
}

@RobinMorisset
Copy link
Contributor Author

I've checked and the new test catches that bug (triggers the segfault, which disappears with the fix). Thanks again for the careful review!

@sverker
Copy link
Contributor

sverker commented Feb 27, 2024

Did you get the test case to trigger the race as it is, without erts_thr_yield and with only 1000 iterations?

@RobinMorisset
Copy link
Contributor Author

I got it to trigger the race as it is, but I had 1M iterations at the time. I reduced the number of iterations because it takes forever to run otherwise now that the race is fixed.

@RobinMorisset
Copy link
Contributor Author

Testing some more, 1k iterations does not trigger the bug, but 10k iterations is enough to trigger it reliably (2 out of 2 runs).
But once I fix the bug, 10k iterations takes forever (to the point I'm wondering whether it is triggering some kind of deadlock or livelock or something).

@RobinMorisset
Copy link
Contributor Author

I found the issue: I was missing a db_unlock in the double-check branch!
The test now runs fine (and very fast) with the fix, and immediately blows up if I re-introduce the segfault in that branch, confirming that it is being tested.

@sverker
Copy link
Contributor

sverker commented Feb 27, 2024

You can also use undocumented spawn_opt option {scheduler,N} to make sure writer and reader are spawned on different schedulers.

@sverker
Copy link
Contributor

sverker commented Feb 27, 2024

You should always also run debug built emulator when developing. That will detect the missing unlock.

@RobinMorisset
Copy link
Contributor Author

You should always also run debug built emulator when developing. That will detect the missing unlock.

How do you ensure that the debug version is used by the tests? I've just been following https://github.com/erlang/otp/blob/master/HOWTO/TESTING.md and doing

make stdlib_test ARGS="-suite ets_SUITE -case racy_rename"

@sverker
Copy link
Contributor

sverker commented Feb 27, 2024

I don't know if there is a way to select emulator type when running tests like that.

I use the old way by releasing the tests:
https://github.com/erlang/otp/blob/master/HOWTO/TESTING.md#releasing-tests

Then I can start whatever erl I want and run tests with ts:run.

$ERL_TOP/bin/cerl -debug
1> ts:install().
2> ts:run(stdlib,ets_SUITE,racy_rename,[batch]).

@RobinMorisset
Copy link
Contributor Author

I've just tried that, and ts:install() fails with ** exception error: undefined function ts:install/0.
Is there an additional command I should run to bring ts in scope?

@dgud
Copy link
Contributor

dgud commented Feb 27, 2024

You need to stand in the test_server directory.

The more modern variant is using common test: (though you will still need test_server in the path).

cd  TEST_INSTALL_PATH/test/app/
erl -pa /TEST_INSTALL_PATH/test/test_server -sname foo EXTRA_ARGS
> ct:run_test([{suite, ets_SUITE}, {testcase,racy_rename}]).

@RobinMorisset
Copy link
Contributor Author

What is TEST_INSTALL_PATH supposed to be?
I tried doing $ERL_TOP/bin/cerl -debug -pa ../../tests/test_server in $ERL_TOP/release/tests/stdlib_test and then calling ct:run_test([{suite, ets_SUITE}, {testcase,racy_rename}]). in the resulting shell, but it gave me an error message "Failed to start CTH, see the CT Log for details" (with no information on how to find that log) and "*** FAILED {ets_SUITE,init_per_suite} ***".
I also tried ts:install(), and it apparently found it, but it gave another error message: {error,no_configure_script}

Sorry for all the questions, I generally find the test and build system rather mystifying and I'd like to figure it out for future PRs.

@dgud
Copy link
Contributor

dgud commented Feb 27, 2024

-pa ../../tests/test_server This needs to be an absolute path for a to me unknown reason.

@RobinMorisset
Copy link
Contributor Author

It worked! Thanks for the help.
Unfortunately, the race that reproduces every few thousand iterations in release mode stubbornly refuses to reproduce in debug mode, even with 1M iterations. Still it is good to know how to run tests in debug mode.

@sverker
Copy link
Contributor

sverker commented Feb 27, 2024

It worked! Thanks for the help. Unfortunately, the race that reproduces every few thousand iterations in release mode stubbornly refuses to reproduce in debug mode, even with 1M iterations. Still, it is good to know how to run tests in debug mode.

That is not surprising. The debug VM does a lot of extra checks that can radically change the timing.

@garazdawi
Copy link
Contributor

make TYPE=debug
make stdlib_test TYPE=debug ARGS="-suite ets_SUITE -case racy_rename"

as per https://github.com/erlang/otp/blob/master/HOWTO/DEVELOPMENT.md#types-and-flavors

@lpgauth
Copy link
Contributor

lpgauth commented Jul 12, 2024

@RobinMorisset bumping this is case you have time to complete this PR :) thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM waiting waiting for changes/input from author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants