-
Notifications
You must be signed in to change notification settings - Fork 10
Description
OEM customer would like to know why this error message came up with loading model "DeepSeek-R1-Distill-Qwen-7B" and set "tp=2" as below. And how to fix this problem.
[16004.940106] {468}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[16004.940109] {468}[Hardware Error]: It has been corrected by h/w and requires no further action
[16004.940111] {468}[Hardware Error]: event severity: corrected
[16004.940112] {468}[Hardware Error]: Error 0, type: corrected
[16004.940114] {468}[Hardware Error]: section_type: PCIe error
[16004.940115] {468}[Hardware Error]: port_type: 5, upstream switch port
[16004.940117] {468}[Hardware Error]: version: 3.0
[16004.940118] {468}[Hardware Error]: command: 0x0147, status: 0x0010
[16004.940120] {468}[Hardware Error]: device_id: 0000:1a:00.0
[16004.940121] {468}[Hardware Error]: slot: 0
[16004.940123] {468}[Hardware Error]: secondary_bus: 0x1b
[16004.940124] {468}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x5000
[16004.940125] {468}[Hardware Error]: class_code: 060400
[16004.940126] {468}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x000b
[16004.940128] {468}[Hardware Error]: aer_cor_status: 0x00002000, aer_cor_mask: 0x00000000
[16004.940129] {468}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00180000
[16004.940131] {468}[Hardware Error]: aer_uncor_severity: 0x00463030
[16004.940132] {468}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
.
.
.
According to triage result, if same model with setting as tp=1 or tp=4 then no error message and model could execute without problem.