Skip to content

Memory corruption during CPU intensive work #93

@yazun

Description

@yazun

After roughly 10 hours of quite intensive memory-mostly data crunching (50-60% CPU load, zeroish IO load) we see a crash and a core as below:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: dr3_ops_cs36 surveys 192.168.168.154(34674) REMOTE SUBPLAN (coord4:1'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000008aff06 in CopyDataRowTupleToSlot (slot=slot@entry=0x1bcdda0, combiner=<optimized out>) at execRemote.c:1843
1843    execRemote.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-307.el7.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-46.el7.x86_64 libcom_err-1.42.9-17.el7.x86_64 libselinux-2.5-15.el7.x86_64 libxml2-2.9.1-6.el7.4.x86_64 nspr-4.21.0-1.el7.x86_64 nss-3.44.0-7.el7_7.x86_64 nss-softokn-freebl-3.44.0-8.el7_7.x86_64 nss-util-3.44.0-4.el7_7.x86_64 openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00000000008aff06 in CopyDataRowTupleToSlot (slot=slot@entry=0x1bcdda0, combiner=<optimized out>) at execRemote.c:1843
#1  0x00000000008b3e72 in FetchTuple (combiner=combiner@entry=0x1bcd7e0) at execRemote.c:2144
#2  0x00000000008bd728 in ExecRemoteSubplan (pstate=0x1bcd7e0) at execRemote.c:10744
#3  0x0000000000904acc in ExecProcNode (node=0x1bcd7e0) at ../../../src/include/executor/executor.h:273
#4  fetch_input_tuple (aggstate=aggstate@entry=0x1bcd018) at nodeAgg.c:725
#5  0x000000000091354d in agg_retrieve_direct (aggstate=<optimized out>) at nodeAgg.c:3312
#6  ExecAgg (pstate=<optimized out>) at nodeAgg.c:3022
#7  0x0000000000906672 in ExecProcNode (node=0x1bcd018) at ../../../src/include/executor/executor.h:273
#8  ExecMaterial (pstate=0x1bccca8) at nodeMaterial.c:134
#9  0x000000000091cd7c in ExecProcNode (node=0x1bccca8) at ../../../src/include/executor/executor.h:273
#10 ExecNestLoop (pstate=0x1bbb020) at nodeNestloop.c:170
#11 0x00000000009480df in ExecProcNode (node=0x1bbb020) at ../../../src/include/executor/executor.h:273
#12 ExecutePlan (execute_once=<optimized out>, dest=0x1898f18, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1bbb020, estate=0x1bb9c08) at execMain.c:1955
#13 standard_ExecutorRun (queryDesc=0x19e6c18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:465
#14 0x00000000006d034e in AdvanceProducingPortal (portal=portal@entry=0x19e3398, can_wait=can_wait@entry=0 '\000') at pquery.c:2592
#15 0x00000000006d2f27 in PortalRun (portal=0x19e3398, count=<optimized out>, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x19968e8, altdest=0x19968e8, completionTag=0x7ffe096d9730 "") at pquery.c:1344
#16 0x0000000000705d53 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x19964d8 "p_7_4a39_3_137b0456") at postgres.c:2958
#17 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:5507
#18 0x000000000079c4ed in BackendRun (port=0x18898b0) at postmaster.c:4979
#19 BackendStartup (port=0x18898b0) at postmaster.c:4651
#20 ServerLoop () at postmaster.c:1956
#21 0x000000000079d366 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1564
#22 0x0000000000497c53 in main (argc=5, argv=0x1855680) at main.c:228
(gdb)

It happened already twice, so seems like a high probable scenario - it happens with no RAM strain.

and the offending part seems to be coming from a corrupted pocket?
offending line

datarow = (RemoteDataRow) palloc(sizeof(RemoteDataRowData) + combiner->currentRow->msglen);
(gdb) p combiner->currentRow->msglen
value has been optimized out
(gdb) up
#1  0x00000000008b3e72 in FetchTuple (combiner=combiner@entry=0x1bcd7e0) at execRemote.c:2144
2144    in execRemote.c
(gdb) p *combiner
$4 = {ss = {ps = {type = T_RemoteSubplanState, plan = 0x188d7d8, state = 0x1bb9c08, ExecProcNode = 0x8bd660 <ExecRemoteSubplan>, ExecProcNodeReal = 0x8bd660 <ExecRemoteSubplan>, instrument = 0x0, worker_instrument = 0x0, qual = 0x0, lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleSlot = 0x1bcdda0, ps_ExprContext = 0x1c92218, ps_ProjInfo = 0x0, skip_data_mask_check = 0 '\000', audit_fga_qual = 0x0}, ss_currentRelation = 0x0,
    ss_currentScanDesc = 0x0, ss_ScanTupleSlot = 0x0, ss_currentMaskDesc = 0x0, inited = 0 '\000'}, node_count = 0, connections = 0x1bce528, conn_count = 1, current_conn = 0, current_conn_rows_consumed = 1, combine_type = COMBINE_TYPE_NONE, command_complete_count = 11, request_type = REQUEST_TYPE_QUERY, tuple_desc = 0x0, description_count = 0, copy_in_count = 0, copy_out_count = 0, copy_file = 0x0, processed = 0, errorCode = "\000\000\000\000", errorMessage = 0x0,
  errorDetail = 0x0, errorHint = 0x0, returning_node = 0, currentRow = 0xf3, rowBuffer = 0x7f7f988e05f8, tapenodes = 0x0, tapemarks = 0x7f7f988e07c8, prerowBuffers = 0x0, dataRowBuffer = 0x0, dataRowMemSize = 0x7f7f988e0898, nDataRows = 0x0, tmpslot = 0x0, errorNode = 0x0, backend_pid = 0, is_abort = 0 '\000', merge_sort = 0 '\000', extended_query = 1 '\001', probing_primary = 0 '\000', tuplesortstate = 0x0, remoteCopyType = REMOTE_COPY_NONE, tuplestorestate = 0x0,
  cursor = 0x7f7f98bc0fe8 "p_7_4a39_2_137b044d", update_cursor = 0x0, cursor_count = 12, cursor_connections = 0x7f7f988e01d8, recv_node_count = 12, recv_tuples = 0, recv_total_time = -1, DML_processed = 0, conns = 0x0, ccount = 0, recv_datarows = 0}
(gdb) p combiner->currentRow->msglen
Cannot access memory at address 0xf7
(gdb) p *combiner->currentRow
Cannot access memory at address 0xf3

Any idea if this could be fixed?

The queries are similar and involve index lookups, q3c index and lateral join + aggregates within lateral.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions