Skip to content

Conversation

@mohanasv2
Copy link

@mohanasv2 mohanasv2 commented Jan 27, 2026

AMD Venice MCA phase-1 patches for VeLinux (6.6 kernel)

  1. x86/mce: Don't remove sysfs if thresholding sysfs init fails
  2. x86/mce: Remove old CMCI storm mitigation code
  3. x86/mce: Add per-bank CMCI storm mitigation
  4. x86/mce: Ensure user polling settings are honored when restarting timer
  5. x86/mce/amd: Add default names for MCA banks and blocks
  6. x86/mce/amd: Fix threshold limit reset
  7. x86/mce/amd: Rename threshold restart function
  8. x86/mce/amd: Remove return value for mce_threshold_{create,remove}_device()
  9. x86/mce/amd: Remove smca_banks_map
  10. x86/mce/amd: Remove shared threshold bank plumbing
  11. x86/mce/amd: Put list_head in threshold_bank
  12. x86/mce: Remove __mcheck_cpu_init_early()
  13. x86/mce: Set CR4.MCE last during init
  14. x86/mce: Define BSP-only init
  15. x86/mce: Define BSP-only SMCA init
  16. x86/MCE/AMD: Split amd_mce_is_memory_error()
  17. x86/mce: Define amd_mce_usable_address()
  18. x86/mce: Cleanup mce_usable_address()
  19. x86/mce: Remove redundant check from mce_device_create()
  20. x86/mce: Dynamically size space for machine check records
  21. x86/mce: Clean up TP_printk() output line of the 'mce_record' tracepoint
  22. tracing: Add the ::ppin field to the mce_record tracepoint
  23. tracing: Add the ::microcode field to the mce_record tracepoint
  24. x86/mce: Switch to new Intel CPU model defines
  25. x86/mce: Remove unused variable and return value in machine_check_poll()
  26. x86/mce: Rename mce_setup() to mce_prep_record()
  27. x86/mce: Define mce_prep_record() helpers for common and per-CPU fields
  28. x86/mce: Use mce_prep_record() helpers for apei_smca_report_x86_error()
  29. x86/mce: Add wrapper for struct mce to export vendor specific info
  30. x86/mce: Make several functions return bool
  31. x86/mce: Make four functions return bool
  32. x86/mce: Break up __mcheck_cpu_apply_quirks()
  33. x86/mce: Do 'UNKNOWN' vendor check early
  34. x86/mce: Cleanup bank processing on init
  35. x86/cpu/intel: Replace PAT erratum model/family magic numbers with symbolic IFM references
  36. x86/mce: Convert family/model mixed checks to VFM-based checks
  37. x86/mce: Separate global and per-CPU quirks
  38. x86/mce: Move machine_check_poll() status checks to helper functions
  39. x86/MCE/AMD: Add support for new MCA_SYND{1,2} registers
  40. x86/msr: Rename 'mce_rdmsrl()' to 'mce_rdmsrq()'
  41. x86/msr: Rename 'mce_wrmsrl()' to 'mce_wrmsrq()'
  42. x86/mce: Add a clear_bank() helper

The Venice MCA backport to VeLinux includes 50+ patches, planned for integration in two phases. Phase 1 consists of 42 patches, which have been submitted as part of this PR. The remaining patches will be submitted in the next phase.

This patch series contains 42 commits that modernize and refactor x86 Machine Check Exception (MCE) handling across Intel and AMD platforms. The changes improve error reporting, tracepoint exposure, helper abstractions, vendor-specific quirk handling, and initialization robustness, while aligning with upstream kernel conventions in naming, readability, and sysfs infrastructure.

  1. Key Bug Fixes
      - Bank initialization cleanup: Unified bank preparation into __mcheck_cpu_init_prepare_banks(), removing redundant flags and ensuring vendor settings apply before polling.
      - Return type consistency: Converted multiple functions to return bool instead of 0/1 for clarity and correctness.
      - Redundant checks removed: Eliminated unnecessary MCA support checks in mce_device_create().
      - Timeout/error handling: Simplified machine_check_poll() by removing unused variables and return values after CMCI storm rework.

  2. Feature Additions
      - New AMD registers: Added support for MCA_SYND1 and MCA_SYND2 on Zen4 systems, exporting supplemental error info (e.g., FRU text).
      - Tracepoint extensions:
        - Added ::microcode field to record active microcode revision.
        - Added ::ppin field to expose Protected Processor Inventory Number.
        - Cleaned up TP_printk() output for better readability.
      - Wrapper struct: Introduced mce_hw_err to encapsulate struct mce, preventing UAPI bloat and enabling vendor-specific extensions.
      - Dynamic buffer sizing: Allocated machine check record space based on CPU count, scaling beyond the historical fixed buffer.

  3. Logic & Performance Improvements
      - Helper abstractions:
        - Split mce_prep_record() into common and per-CPU helpers.
        - Defined clear_bank() and status-check helpers for vendor-specific actions.
        - Split amd_mce_is_memory_error() into legacy and SMCA-specific helpers.
      - Quirk handling:
        - Separated global vs per-CPU quirks.
        - Moved “UNKNOWN vendor” check to BSP-only init.
        - Broke up __mcheck_cpu_apply_quirks() into vendor-specific helpers.
      - Naming consistency:
        - Renamed mce_setup() → mce_prep_record().
        - Renamed MSR accessors mce_wrmsrl() → mce_wrmsrq() and mce_rdmsrl() → mce_rdmsrq().
      - Intel errata handling: Replaced magic family/model numbers with symbolic IFM macros for PAT erratum checks.

  4. Robustness & Safety
      - Initialization resilience:
        - BSP-only SMCA init ensures handlers are set once per system.
        - Vendor-specific quirks applied consistently across CPUs.
      - Error address usability:
        - Defined amd_mce_usable_address() for AMD-specific validation.
        - Cleaned up mce_usable_address() with Intel-specific helpers.

Unit Test:

NOTE: The test cases listed below will work only after both Phase 1 and Phase 2 MCA patches are merged. Therefore, these test cases should be executed only after all Venice MCA patches have been integrated.

  1. Without patches:
      
      dmesg | grep -i mce
          [ 0.000000] Linux version 6.6.95-base-mce+ (amd@host) (gcc (Debian 12.2.0-14+deb12u1) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) [Intel-SIG] backporting KVM: x86: Advertise AVX10.1 CPUID to userspace #59 SMP PREEMPT_DYNAMIC Mon Jan 12 16:22:36 IST 2026
          [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.6.95-base-mce+ root=UUID=7ee00f0d-484a-4038-bdea-ab9e659efaa5 ro quiet
          [ 0.024602] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.6.95-base-mce+ root=UUID=7ee00f0d-484a-4038-bdea-ab9e659efaa5 ro quiet
          [ 0.843764] BOOT_IMAGE=/boot/vmlinuz-6.6.95-base-mce+
          [ 1.009275] usb usb1: Manufacturer: Linux 6.6.95-base-mce+ ehci_hcd
          [ 1.011461] usb usb2: Manufacturer: Linux 6.6.95-base-mce+ xhci-hcd
          [ 1.011585] usb usb3: Manufacturer: Linux 6.6.95-base-mce+ xhci-hcd
          [ 1.012162] usb usb4: Manufacturer: Linux 6.6.95-base-mce+ xhci-hcd
          [ 1.012250] usb usb5: Manufacturer: Linux 6.6.95-base-mce+ xhci-hcd
          [ 1.013054] usb usb6: Manufacturer: Linux 6.6.95-base-mce+ xhci-hcd
          [ 1.013129] usb usb7: Manufacturer: Linux 6.6.95-base-mce+ xhci-hcd
          [ 2.213820] MCE: In-kernel MCE decoding enabled.

  2. With patch:

      dmesg | grep -i mce
          [ 0.000000] Linux version 6.6.95-mce-full+ (amd@host) (gcc (Debian 12.2.0-14+deb12u1) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) Ve5.15 brbe #57 SMP PREEMPT_DYNAMIC Mon Jan 12 15:16:20 IST 2026
          [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.6.95-mce-full+ root=UUID=7ee00f0d-484a-4038-bdea-ab9e659efaa5 ro quiet
          [ 0.024368] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.6.95-mce-full+ root=UUID=7ee00f0d-484a-4038-bdea-ab9e659efaa5 ro quiet
          [ 0.533450] mce: HEST corrected error threshold limit: 10
          [ 0.844504] BOOT_IMAGE=/boot/vmlinuz-6.6.95-mce-full+
          [ 1.009294] usb usb1: Manufacturer: Linux 6.6.95-mce-full+ ehci_hcd
          [ 1.011542] usb usb2: Manufacturer: Linux 6.6.95-mce-full+ xhci-hcd
          [ 1.011663] usb usb3: Manufacturer: Linux 6.6.95-mce-full+ xhci-hcd
          [ 1.012220] usb usb4: Manufacturer: Linux 6.6.95-mce-full+ xhci-hcd
          [ 1.012309] usb usb5: Manufacturer: Linux 6.6.95-mce-full+ xhci-hcd
          [ 1.013122] usb usb6: Manufacturer: Linux 6.6.95-mce-full+ xhci-hcd
          [ 1.013202] usb usb7: Manufacturer: Linux 6.6.95-mce-full+ xhci-hcd
          [ 2.066804] MCE: In-kernel MCE decoding enabled.

yghannam and others added 30 commits January 27, 2026 10:59
commit 4c113a5b28bfd589e2010b5fc8867578b0135ed7 upstream

Currently, the MCE subsystem sysfs interface will be removed if the
thresholding sysfs interface fails to be created. A common failure is due to
new MCA bank types that are not recognized and don't have a short name set.

The MCA thresholding feature is optional and should not break the common MCE
sysfs interface. Also, new MCA bank types are occasionally introduced, and
updates will be needed to recognize them. But likewise, this should not break
the common sysfs interface.

Keep the MCE sysfs interface regardless of the status of the thresholding
sysfs interface.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-1-236dd74f645f@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 3ed57b4 upstream

When a "storm" of corrected machine check interrupts (CMCI) is detected
this code mitigates by disabling CMCI interrupt signalling from all of
the banks owned by the CPU that saw the storm.

There are problems with this approach:

1) It is very coarse grained. In all likelihood only one of the banks
   was generating the interrupts, but CMCI is disabled for all.  This
   means Linux may delay seeing and processing errors logged from other
   banks.

2) Although CMCI stands for Corrected Machine Check Interrupt, it is
   also used to signal when an uncorrected error is logged. This is
   a problem because these errors should be handled in a timely manner.

Delete all this code in preparation for a finer grained solution.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20231115195450.12963-2-tony.luck@intel.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 7eae17c upstream

This is the core functionality to track CMCI storms at the machine check
bank granularity. Subsequent patches will add the vendor specific hooks
to supply input to the storm detection and take actions on the start/end
of a storm.

machine_check_poll() is called both by the CMCI interrupt code, and for
periodic polls from a timer. Add a hook in this routine to maintain
a bitmap history for each bank showing whether the bank logged an
corrected error or not each time it is polled.

In normal operation the interval between polls of these banks determines
how far to shift the history. The 64 bit width corresponds to about one
second.

When a storm is observed a CPU vendor specific action is taken to reduce
or stop CMCI from the bank that is the source of the storm.  The bank is
added to the bitmap of banks for this CPU to poll. The polling rate is
increased to once per second.  During a storm each bit in the history
indicates the status of the bank each time it is polled. Thus the
history covers just over a minute.

Declare a storm for that bank if the number of corrected interrupts seen
in that history is above some threshold (defined as 5 in this series,
could be tuned later if there is data to suggest a better value).

A storm on a bank ends if enough consecutive polls of the bank show no
corrected errors (defined as 30, may also change). That calls the CPU
vendor specific function to revert to normal operational mode, and
changes the polling rate back to the default.

  [ bp: Massage. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231115195450.12963-3-tony.luck@intel.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 00c092de6f28ebd32208aef83b02d61af2229b60 upstream

Users can disable MCA polling by setting the "ignore_ce" parameter or by
setting "check_interval=0". This tells the kernel to *not* start the MCE
timer on a CPU.

If the user did not disable CMCI, then storms can occur. When these
happen, the MCE timer will be started with a fixed interval. After the
storm subsides, the timer's next interval is set to check_interval.

This disregards the user's input through "ignore_ce" and
"check_interval". Furthermore, if "check_interval=0", then the new timer
will run faster than expected.

Create a new helper to check these conditions and use it when a CMCI
storm ends.

  [ bp: Massage. ]

Fixes: 7eae17c ("x86/mce: Add per-bank CMCI storm mitigation")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-2-236dd74f645f@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit d66e1e90b16055d2f0ee76e5384e3f119c3c2773 upstream

Ensure that sysfs init doesn't fail for new/unrecognized bank types or if
a bank has additional blocks available.

Most MCA banks have a single thresholding block, so the block takes the same
name as the bank.

Unified Memory Controllers (UMCs) are a special case where there are two
blocks and each has a unique name.

However, the microarchitecture allows for five blocks. Any new MCA bank types
with more than one block will be missing names for the extra blocks. The MCE
sysfs will fail to initialize in this case.

Fixes: 87a6d40 ("x86/mce/AMD: Update sysfs bank names for SMCA systems")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-3-236dd74f645f@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 5f6e3b720694ad771911f637a51930f511427ce1 upstream

The MCA threshold limit must be reset after servicing the interrupt.

Currently, the restart function doesn't have an explicit check for this.  It
makes some assumptions based on the current limit and what's in the registers.
These assumptions don't always hold, so the limit won't be reset in some
cases.

Make the reset condition explicit. Either an interrupt/overflow has occurred
or the bank is being initialized.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-4-236dd74f645f@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 9af8b441cf6953f683b825fbf241a979ea7521e8 upstream

It operates per block rather than per bank. So rename it for clarity.

No functional changes.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-5-236dd74f645f@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
…vice()

commit 4d2161b9e8ba64076f520ec2f00eefb00722c15e upstream

The return values are not checked, so set return type to 'void'.

Also, move function declarations to internal.h, since these functions are
only used within the MCE subsystem.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-6-236dd74f645f@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit b249288abde5190bb113ea5acef8af4ceac4957c upstream

The MCx_MISC0[BlkPtr] field was used on legacy systems to hold a register
offset for the next MCx_MISC* register. In this way, an implementation-specific
number of registers can be discovered at runtime.

The MCAX/SMCA register space simplifies this by always including the
MCx_MISC[1-4] registers. The MCx_MISC0[BlkPtr] field is used to indicate
(true/false) whether any MCx_MISC[1-4] registers are present.

Currently, MCx_MISC0[BlkPtr] is checked early and cached to be used during
sysfs init later. This is unnecessary as the MCx_MISC0 register is read again
later anyway.

Remove the smca_banks_map variable as it is effectively redundant, and use
a direct register/bit check instead.

  [ bp: Zap smca_get_block_address() too. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250825-wip-mca-updates-v5-3-865768a2eef8@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit d35fb3121a36170bba951c529847a630440e4174 upstream

Legacy AMD systems include an integrated Northbridge that is represented
by MCA bank 4. This is the only non-core MCA bank in legacy systems. The
Northbridge is physically shared by all the CPUs within an AMD "Node".

However, in practice the "shared" MCA bank can only by managed by a
single CPU within that AMD Node. This is known as the "Node Base Core"
(NBC). For example, only the NBC will be able to read the MCA bank 4
registers; they will be Read-as-Zero for other CPUs. Also, the MCA
Thresholding interrupt will only signal the NBC; the other CPUs will not
receive it. This is enforced by hardware, and it should not be managed by
software.

The current AMD Thresholding code attempts to deal with the "shared" MCA
bank by micromanaging the bank's sysfs kobjects. However, this does not
follow the intended kobject use cases. It is also fragile, and it has
caused bugs in the past.

Modern AMD systems do not need this shared MCA bank support, and it
should not be needed on legacy systems either.

Remove the shared threshold bank code. Also, move the threshold struct
definitions to mce/amd.c, since they are no longer needed in amd_nb.c.

[Backport Changes]

1. In arch/x86/include/asm/amd_nb.h, the upstream patch removes the
refcount.h include, but this header is already removed in the current
source tree. Therefore, the removal step was skipped since the expected
change is already reflected in the existing code.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-2-yazen.ghannam@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit c4bac5c640e3782bf30c07c4d82042d0202fe224 upstream

The threshold_bank structure is a container for one or more threshold_block
structures. Currently, the container has a single pointer to the 'first'
threshold_block structure which then has a linked list of the remaining
threshold_block structures.

This results in an extra level of indirection where the 'first' block is
checked before iterating over the remaining blocks.

Remove the indirection by including the head of the block list in the
threshold_bank structure which already acts as a container for all the bank's
thresholding blocks.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-8-236dd74f645f@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 9f34032ec0deef58bd0eb7475f1981adfa998648 upstream

The __mcheck_cpu_init_early() function was introduced so that some
vendor-specific features are detected before the first MCA polling event done
in __mcheck_cpu_init_generic().

Currently, __mcheck_cpu_init_early() is only used on AMD-based systems and
additional code will be needed to support various system configurations.

However, the current and future vendor-specific code should be done during
vendor init. This keeps all the vendor code in a common location and
simplifies the generic init flow.

Move all the __mcheck_cpu_init_early() code into mce_amd_feature_init().

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250825-wip-mca-updates-v5-6-865768a2eef8@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit cfffcf97997bd35f4a59e035523d1762568bdbad upstream

Set the CR4.MCE bit as the last step during init. This brings the MCA
init order closer to what is described in the x86 docs.

x86 docs:
  AMD		Intel
  		MCG_CTL
  MCA_CONFIG	MCG_EXT_CTL
  MCi_CTL	MCi_CTL
  MCG_CTL
  CR4.MCE	CR4.MCE

Current Linux:
  AMD		Intel
  CR4.MCE	CR4.MCE
  MCG_CTL	MCG_CTL
  MCA_CONFIG	MCG_EXT_CTL
  MCi_CTL	MCi_CTL

Updated Linux:
  AMD		Intel
  MCG_CTL	MCG_CTL
  MCA_CONFIG	MCG_EXT_CTL
  MCi_CTL	MCi_CTL
  CR4.MCE	CR4.MCE

The new init flow will match Intel's docs, but there will still be a
mismatch for AMD regarding MCG_CTL. However, there is no known issue with this
ordering, so leave it for now.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 669ce4984b729ad5b4c6249d4a8721ae52398bfb upstream

Currently, MCA initialization is executed identically on each CPU as
they are brought online. However, a number of MCA initialization tasks
only need to be done once.

Define a function to collect all 'global' init tasks and call this from
the BSP only. Start with CPU features.

[Backport Changes]

1. In file arch/x86/kernel/cpu/mce/core.c, within the newly added function
mca_bsp_init(), the call to rdmsrq() was replaced with the existing
equivalent call rdmsrl() because the upstream commit c435e608cf59f that
globally renamed rdmsrl() to rdmsrq() is not available yet in the current
source tree.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit c6e465b8d45a1bc717d196ee769ee5a9060de8e2 upstream

Currently, on AMD systems, MCA interrupt handler functions are set during CPU
init. However, the functions only need to be set once for the whole system.

Assign the handlers only during BSP init. Do so only for SMCA systems to
maintain the old behavior for legacy systems.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 495a91d upstream

Define helper functions for legacy and SMCA systems in order to reuse
individual checks in later changes.

Describe what each function is checking for, and correct the XEC bitmask
for SMCA.

No functional change intended.

  [ bp: Use "else in amd_mce_is_memory_error() to make the conditional
    balanced, for readability. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613141142.36801-2-yazen.ghannam@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 48da1ad upstream

Currently, all valid MCA_ADDR values are assumed to be usable on AMD
systems. However, this is not correct in most cases. Notifiers expecting
usable addresses may then operate on inappropriate values.

Define a helper function to do AMD-specific checks for a usable memory
address. List out all known cases.

  [ bp: Tone down the capitalized words. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-3-yazen.ghannam@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 1bae0cf upstream

Move Intel-specific checks into a helper function.

Explicitly use "bool" for return type.

No functional change intended.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-4-yazen.ghannam@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 612905e upstream

mce_device_create() is called only from mce_cpu_online() which in turn
will be called iff MCA support is available. That is, at the time of
mce_device_create() call it's guaranteed that MCA support is available.
No need to duplicate this check so remove it.

  [ bp: Massage commit message. ]

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231107165529.407349-1-nik.borisov@suse.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 108c649 upstream

Systems with a large number of CPUs may generate a large number of
machine check records when things go seriously wrong. But Linux has
a fixed-size buffer that can only capture a few dozen errors.

Allocate space based on the number of CPUs (with a minimum value based
on the historical fixed buffer that could store 80 records).

  [ bp: Rename local var from tmpp to something more telling: gpool. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>
Link: https://lore.kernel.org/r/20240307192704.37213-1-tony.luck@intel.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit ac5e80e upstream

 - Only capitalize entries where that makes sense
 - Print separate values separately
 - Rename 'PROCESSOR' to vendor & CPUID

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Avadhut Naik <avadhut.naik@amd.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Link: https://lore.kernel.org/r/ZgZpn/zbCJWYdL5y@gmail.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 9843064 upstream

Machine Check Error information from 'struct mce' is exposed to userspace
through the mce_record tracepoint.

Currently, however, the PPIN (Protected Processor Inventory Number) field
of 'struct mce' is not exposed.

Add a PPIN field to the tracepoint as it provides a unique identifier for
the system (or socket in case of multi-socket systems) on which the MCE
has been received.

Also, add a comment explaining the kind of information that can be and
should be added to the tracepoint.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240401171455.1737976-2-avadhut.naik@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 186d7ef upstream

Currently, the microcode field (Microcode Revision) of 'struct mce' is not
exposed to userspace through the mce_record tracepoint.

Knowing the microcode version on which the MCE was received is critical
information for debugging. If the version is not recorded, later attempts
to acquire the version might result in discrepancies since it can be
changed at runtime.

Add microcode version to the tracepoint to prevent ambiguity over
the active version on the system when the MCE was received.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240401171455.1737976-3-avadhut.naik@amd.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 4a5f2dd upstream

New CPU #defines encode vendor and family as well as model.

  [ bp: Squash *three* mce patches into one, fold in fix:
    https://lore.kernel.org/r/20240429022051.63360-1-tony.luck@intel.com ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181511.41772-1-tony.luck%40intel.com
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 5b9d292 upstream

The recent CMCI storm handling rework removed the last case that checks
the return value of machine_check_poll().

Therefore the "error_seen" variable is no longer used, so remove it.

Fixes: 3ed57b4 ("x86/mce: Remove old CMCI storm mitigation code")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240523155641.2805411-3-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 5ad21a2 upstream

There is no MCE "setup" done in mce_setup(). Rather, this function initializes
and prepares an MCE record.

Rename the function to highlight what it does.

No functional change is intended.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-2-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit f9bbb8a upstream

Generally, MCA information for an error is gathered on the CPU that
reported the error. In this case, CPU-specific information from the
running CPU will be correct.

However, this will be incorrect if the MCA information is gathered while
running on a CPU that didn't report the error. One example is creating
an MCA record using mce_prep_record() for errors reported from ACPI.

Split mce_prep_record() so that there is a helper function to gather
common, i.e. not CPU-specific, information and another helper for
CPU-specific information.

Leave mce_prep_record() defined as-is for the common case when running
on the reporting CPU.

Get MCG_CAP in the global helper even though the register is per-CPU.
This value is not already cached per-CPU like other values. And it does
not assist with any per-CPU decoding or handling.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-3-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 793aa4b upstream

Current AMD systems can report MCA errors using the ACPI Boot Error
Record Table (BERT). The BERT entries for MCA errors will be an x86
Common Platform Error Record (CPER) with an MSR register context that
matches the MCAX/SMCA register space.

However, the BERT will not necessarily be processed on the CPU that
reported the MCA errors. Therefore, the correct CPU number needs to be
determined and the information saved in struct mce.

Use the newly defined mce_prep_record_*() helpers to get the correct
data.

Also, add an explicit check to verify that a valid CPU number was found
from the APIC ID search.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-4-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Abhishek Rajput <Abhishek.Rajput@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 750fd23926f1507cc826b5a4fdd4bfc7283e7723 upstream

Currently, exporting new additional machine check error information
involves adding new fields for the same at the end of the struct mce.
This additional information can then be consumed through mcelog or
tracepoint.

However, as new MSRs are being added (and will be added in the future)
by CPU vendors on their newer CPUs with additional machine check error
information to be exported, the size of struct mce will balloon on some
CPUs, unnecessarily, since those fields are vendor-specific. Moreover,
different CPU vendors may export the additional information in varying
sizes.

The problem particularly intensifies since struct mce is exposed to
userspace as part of UAPI. It's bloating through vendor-specific data
should be avoided to limit the information being sent out to userspace.

Add a new structure mce_hw_err to wrap the existing struct mce. The same
will prevent its ballooning since vendor-specifc data, if any, can now be
exported through a union within the wrapper structure and through
__dynamic_array in mce_record tracepoint.

Furthermore, new internal kernel fields can be added to the wrapper
struct without impacting the user space API.

  [ bp: Restore reverse x-mas tree order of function vars declarations. ]

[Backport Changes]

1. In arch/x86/kernel/cpu/mce/core.c, within the function mce_panic()
deviations are shown due to line number changes.This is because the
declaration of struct page *p was removed from the top of the function
and moved inside the if condition
(if (final && (final->status & MCI_STATUS_ADDRV))) in upstream merge
commit b4442ca. Backporting that commit would introduce additional
dependencies.

Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20241022194158.110073-2-avadhut.naik@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit c845cb8dbd2e1a804babfd13648026c3a7cfbc0b upstream

Make several functions that return 0 or 1 return a boolean value for
better readability.

No functional changes are intended.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20241212140103.66964-2-qiuxu.zhuo@intel.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
qzhuo2 and others added 12 commits January 27, 2026 10:59
commit c46945c9cac8437a674edb9d8fbe71511fb4acee upstream

Make those functions whose callers only care about success or failure return
a boolean value for better readability. Also, update the call sites
accordingly as the polarities of all the return values have been flipped.

No functional changes.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20241212140103.66964-4-qiuxu.zhuo@intel.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 51a12c28bb9a043e9444db5bd214b00ec161a639 upstream

Split each vendor specific part into its own helper function.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20241212140103.66964-5-qiuxu.zhuo@intel.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit a46b2bbe1e36e7faab5010f68324b7d191c5c09f upstream

The 'UNKNOWN' vendor check is handled as a quirk that is run on each
online CPU. However, all CPUs are expected to have the same vendor.

Move the 'UNKNOWN' vendor check to the BSP-only init so it is done early
and once. Remove the unnecessary return value from the quirks check.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 0f134c53246366c00664b640f9edc9be5db255b3 upstream

Unify the bank preparation into __mcheck_cpu_init_clear_banks(), rename that
function to what it does now - prepares banks. Do this so that generic and
vendor banks init goes first so that settings done during that init can take
effect before the first bank polling takes place.

Move __mcheck_cpu_check_banks() into __mcheck_cpu_init_prepare_banks() as it
already loops over the banks.

The MCP_DONTLOG flag is no longer needed, since the MCA polling function is
now called only if boot-time logging should be done.

[Backport Changes]

1. In file arch/x86/kernel/cpu/mce/core.c, within the function
__mcheck_cpu_check_banks(), the call to wrmsrq() was replaced with the
existing equivalent call wrmsrl() because the upstream commit
78255eb239733 that globally renamed wrmsrl() to wrmsrq() is not available
yet in the current source tree.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250825-wip-mca-updates-v5-5-865768a2eef8@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
…mbolic IFM references

commit fd82221 upstream

There's an erratum that prevents the PAT from working correctly:

   https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf
   # Document 316515 Version 010

The kernel currently disables PAT support on those CPUs, but it
does it with some magic numbers.

Replace the magic numbers with the new "IFM" macros.

Make the check refer to the last affected CPU (INTEL_CORE_YONAH)
rather than the first fixed one. This makes it easier to find the
documentation of the erratum since Intel documents where it is
broken and not where it is fixed.

I don't think the Pentium Pro (or Pentium II) is actually affected.
But the old check included them, so it can't hurt to keep doing the
same.  I'm also not completely sure about the "Pentium M" CPUs
(models 0x9 and 0xd).  But, again, they were included in in the
old checks and were close Pentium III derivatives, so are likely
affected.

While we're at it, revise the comment referring to the erratum name
and making sure it is a quote of the language from the actual errata
doc.  That should make it easier to find in the future when the URL
inevitably changes.

Why bother with this in the first place? It actually gets rid of one
of the very few remaining direct references to c->x86{,_model}.

No change in functionality intended.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Link: https://lore.kernel.org/r/20240829220042.1007820-1-dave.hansen@linux.intel.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 359d7a98e3e3f88dbf45411427b284bb3bbbaea5 upstream

Convert family/model mixed checks to VFM-based checks to make the code
more compact. Simplify.

  [ bp: Drop the "what" from the commit message - it should be visible from
    the diff alone. ]

Suggested-by: Sohil Mehta <sohil.mehta@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20241212140103.66964-6-qiuxu.zhuo@intel.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 7eee1e92684507f64ec6a75fecbd27e37174b888 upstream

Many quirks are global configuration settings and a handful apply to
each CPU.

Move the per-CPU quirks to vendor init to execute them on each online
CPU. Set the global quirks during BSP-only init so they're only executed
once and early.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 91af6842e9945d064401ed2d6e91539a619760d1 upstream

There are a number of generic and vendor-specific status checks in
machine_check_poll(). These are used to determine if an error should be
skipped.

Move these into helper functions. Future vendor-specific checks will be
added to the helpers.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit d4fca1358ea9096f2f6ed942e2cb3a820073dfc1 upstream

Starting with Zen4, AMD's Scalable MCA systems incorporate two new registers:
MCA_SYND1 and MCA_SYND2.

These registers will include supplemental error information in addition to the
existing MCA_SYND register. The data within these registers is considered
valid if MCA_STATUS[SyndV] is set.

Userspace error decoding tools like rasdaemon gather related hardware error
information through the tracepoints.

Therefore, export these two registers through the mce_record tracepoint so
that tools like rasdaemon can parse them and output the supplemental error
information like FRU text contained in them.

  [ bp: Massage. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20241022194158.110073-4-avadhut.naik@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit ebe29309c4d2821d5fdccd5393eba9c77540e260 upstream

Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 8e44e83f57c3289a41507eb79a315400629978ae upstream

Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
commit 5c6f123c419b6e20f84ac1683089a52f449273aa upstream

Add a helper at the end of the MCA polling function to collect vendor and/or
feature actions.

Start with a basic skeleton for now. Actions for AMD thresholding and deferred
errors will be added later.

  [ bp: Drop the obvious comment too. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
Signed-off-by: Rahul Kumar <Kumar.Rahul2@amd.com>
Signed-off-by: mohanasv2 <mohanasv@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants