Skip to content

How to run the code? #4

@dongzhangqi7

Description

@dongzhangqi7

When I run the code following the README.md, I have a problem. The compute node is stuck at "Wait for thread start".
Memory Node: ./Server 19843 10 1 (In fact, I do not understand what is the meaning of NODEID)
Compute Node: ./db_bench --benchmarks=fillrandom,readrandom,readrandom,readrandomwriterandom --threads=16 --value_size=400 --num=100000000 --bloom_bits=10 --readwritepercent=5 --compute_node_id=0 --fixed_compute_shards_num=0

Whole log as follows:
Mark: valgrind socket info1
searching for IB devices in host
found 2 device(s)
device not specified, using first one found: mlx5_1
New MR was registered with addr=0x7fcca32bf010, lkey=0xc09fc, rkey=0xc09fc, flags=0x7, size=10240000, total registered size is 0
New MR was registered with addr=0x7fcca28fa010, lkey=0xc0c08, rkey=0xc0c08, flags=0x7, size=10240000, total registered size is 10240000
SST buffer, send&receive buffer were registered with a
maximum outstanding wr number is32768
maximum query pair number is262144
maximum completion queue number is16777216
maximum memory region number is16777216
maximum memory region size is18446744073709551615
Success to connect to 10.10.10.10
TCP connection was established
connect to node id 0QP was created, QP number=0x957

Local LID = 0xffff
total bytes: 23read byte: 23Remote QP number = 0xa4c
Remote LID = 0xffff
Remote GID =fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
QP 0x7fcc9c003648 state was change to RTS
total bytes: 1read byte: 1Finish the connection with node 0
New MR was registered with addr=0x7fcc5bfff010, lkey=0x8055a, rkey=0x8055a, flags=0x7, size=1073741824, total registered size is 20480000
dLSM: version 1.22
Date: Tue Dec 12 03:00:43 2023
Start to sync options
client handling thread
CPU: 112 * Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
CPUCache:
Keys: 16 bytes each
Values: 400 bytes each (200 bytes after compression)
Entries: 100000000
RawSize: 39672.9 MB (estimated)
FileSize: 20599.4 MB (estimated)
WARNING: Snappy compression is not enabled

DBImpl start
New MR was registered with addr=0x7fcc59ffe010, lkey=0x9b2b0, rkey=0x9b2b0, flags=0x7, size=33554432, total registered size is 1094221824
Memory used up, Initially, allocate new one, memory pool is Version_edit, total memory this pool is 1
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81
failed to poll send for remote memory register
communication thread created
DBImpl finished
corrupt message from client.DBImpl deallocated
Total number of entries within the cahce is 0DBImpl start
number 0 got bad completion with status: 0x5, vendor syndrome: 0xf9
failed to poll send for remote memory register
communication thread created
DBImpl finished
validation write finished
start front-end threads
Wait for thread start
total bytes: 1read byte: 1sync wait time is 128830Threads start to run
New MR was registered with addr=0x7fcc0bfff010, lkey=0x8065d, rkey=0x8065d, flags=0x7, size=1073741824, total registered size is 1127776256
Memory used up, Initially, allocate new one, memory pool is FlushBuffer, total memory this pool is 1
New MR was registered with addr=0x7fcbb3fff010, lkey=0x8045c, rkey=0x8045c, flags=0x7, size=1073741824, total registered size is 2201518080
Memory used up, Initially, allocate new one, memory pool is IndexChunk, total memory this pool is 1
New MR was registered with addr=0x7fcb5bfff010, lkey=0x8085f, rkey=0x8085f, flags=0x7, size=1073741824, total registered size is 3275259904
Memory used up, Initially, allocate new one, memory pool is FilterChunk, total memory this pool is 1
number 0 got bad completion with status: 0x5, vendor syndrome: 0xf9
failed to poll send for remote memory register
number 0 got bad completion with status: 0x5, vendor syndrome: 0xf9
failed to poll send for remote memory register
Segmentation fault (core dumped)
root@liu-9:~/dLSM/build# ./db_bench --benchmarks=fillrandom,readrandom,readrandom,readrandomwriterandom --threads=1 --value_size=400 --num=100000000 --bloom_bits=10 --readwritepercent=5 --compute_node_id=0 --fixed_compute_shards_num=0
Mark: valgrind socket info1
searching for IB devices in host
found 2 device(s)
device not specified, using first one found: mlx5_1
New MR was registered with addr=0x7f886fb63010, lkey=0xc2120, rkey=0xc2120, flags=0x7, size=10240000, total registered size is 0
New MR was registered with addr=0x7f886f19e010, lkey=0xc2d2c, rkey=0xc2d2c, flags=0x7, size=10240000, total registered size is 10240000
SST buffer, send&receive buffer were registered with a
maximum outstanding wr number is32768
maximum query pair number is262144
maximum completion queue number is16777216
maximum memory region number is16777216
maximum memory region size is18446744073709551615
Success to connect to 10.10.10.10
TCP connection was established
connect to node id 0QP was created, QP number=0x958

Local LID = 0xffff
total bytes: 23read byte: 23Remote QP number = 0xa4d
Remote LID = 0xffff
Remote GID =fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
QP 0x7f8868003648 state was change to RTS
total bytes: 1read byte: 1Finish the connection with node 0
New MR was registered with addr=0x7f8827fff010, lkey=0x8075e, rkey=0x8075e, flags=0x7, size=1073741824, total registered size is 20480000
dLSM: version 1.22
Date: Tue Dec 12 03:04:40 2023
Start to sync options
client handling thread
CPU: 112 * Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
CPUCache:
Keys: 16 bytes each
Values: 400 bytes each (200 bytes after compression)
Entries: 100000000
RawSize: 39672.9 MB (estimated)
FileSize: 20599.4 MB (estimated)
WARNING: Snappy compression is not enabled

DBImpl start
New MR was registered with addr=0x7f886c59c010, lkey=0x9b8cc, rkey=0x9b8cc, flags=0x7, size=33554432, total registered size is 1094221824
Memory used up, Initially, allocate new one, memory pool is Version_edit, total memory this pool is 1
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81
failed to poll send for remote memory register
communication thread created
DBImpl finished
corrupt message from client.DBImpl deallocated
Total number of entries within the cahce is 0DBImpl start
number 0 got bad completion with status: 0x5, vendor syndrome: 0xf9
failed to poll send for remote memory register
communication thread created
DBImpl finished
validation write finished
start front-end threads
Wait for thread start

Do you have any suggestions?
Very Appreciate it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions