153 lines
6.1 KiB
ReStructuredText
153 lines
6.1 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
|
|
|
|
==========================
|
|
EDAC Memory Repair Control
|
|
==========================
|
|
|
|
Copyright (c) 2024-2025 HiSilicon Limited.
|
|
|
|
:Author: Shiju Jose <shiju.jose@huawei.com>
|
|
:License: The GNU Free Documentation License, Version 1.2 without
|
|
Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
|
|
(dual licensed under the GPL v2)
|
|
:Original Reviewers:
|
|
|
|
- Written for: 6.15
|
|
|
|
Introduction
|
|
------------
|
|
|
|
Some memory devices support repair operations to address issues in their
|
|
memory media. Post Package Repair (PPR) and memory sparing are examples of
|
|
such features.
|
|
|
|
Post Package Repair (PPR)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Post Package Repair is a maintenance operation which requests the memory
|
|
device to perform repair operation on its media. It is a memory self-healing
|
|
feature that fixes a failing memory location by replacing it with a spare row
|
|
in a DRAM device.
|
|
|
|
For example, a CXL memory device with DRAM components that support PPR
|
|
features implements maintenance operations. DRAM components support those
|
|
types of PPR functions:
|
|
|
|
- hard PPR, for a permanent row repair, and
|
|
- soft PPR, for a temporary row repair.
|
|
|
|
Soft PPR is much faster than hard PPR, but the repair is lost after a power
|
|
cycle.
|
|
|
|
The data may not be retained and memory requests may not be correctly
|
|
processed during a repair operation. In such case, the repair operation should
|
|
not be executed at runtime.
|
|
|
|
For example, for CXL memory devices, see CXL spec rev 3.1 [1]_ sections
|
|
8.2.9.7.1.1 PPR Maintenance Operations, 8.2.9.7.1.2 sPPR Maintenance Operation
|
|
and 8.2.9.7.1.3 hPPR Maintenance Operation for more details.
|
|
|
|
Memory Sparing
|
|
~~~~~~~~~~~~~~
|
|
|
|
Memory sparing is a repair function that replaces a portion of memory with
|
|
a portion of functional memory at a particular granularity. Memory
|
|
sparing has cacheline/row/bank/rank sparing granularities. For example, in
|
|
rank memory-sparing mode, one memory rank serves as a spare for other ranks on
|
|
the same channel in case they fail.
|
|
|
|
The spare rank is held in reserve and not used as active memory until
|
|
a failure is indicated, with reserved capacity subtracted from the total
|
|
available memory in the system.
|
|
|
|
After an error threshold is surpassed in a system protected by memory sparing,
|
|
the content of a failing rank of DIMMs is copied to the spare rank. The
|
|
failing rank is then taken offline and the spare rank placed online for use as
|
|
active memory in place of the failed rank.
|
|
|
|
For example, CXL memory devices can support various subclasses for sparing
|
|
operation vary in terms of the scope of the sparing being performed.
|
|
|
|
Cacheline sparing subclass refers to a sparing action that can replace a full
|
|
cacheline. Row sparing is provided as an alternative to PPR sparing functions
|
|
and its scope is that of a single DDR row. Bank sparing allows an entire bank
|
|
to be replaced. Rank sparing is defined as an operation in which an entire DDR
|
|
rank is replaced.
|
|
|
|
See CXL spec 3.1 [1]_ section 8.2.9.7.1.4 Memory Sparing Maintenance
|
|
Operations for more details.
|
|
|
|
.. [1] https://computeexpresslink.org/cxl-specification/
|
|
|
|
Use cases of generic memory repair features control
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
1. The soft PPR, hard PPR and memory-sparing features share similar control
|
|
attributes. Therefore, there is a need for a standardized, generic sysfs
|
|
repair control that is exposed to userspace and used by administrators,
|
|
scripts and tools.
|
|
|
|
2. When a CXL device detects an error in a memory component, it informs the
|
|
host of the need for a repair maintenance operation by using an event
|
|
record where the "maintenance needed" flag is set. The event record
|
|
specifies the device physical address (DPA) and attributes of the memory
|
|
that requires repair. The kernel reports the corresponding CXL general
|
|
media or DRAM trace event to userspace, and userspace tools (e.g.
|
|
rasdaemon) initiate a repair maintenance operation in response to the
|
|
device request using the sysfs repair control.
|
|
|
|
3. Userspace tools, such as rasdaemon, request a repair operation on a memory
|
|
region when maintenance need flag set or an uncorrected memory error or
|
|
excess of corrected memory errors above a threshold value is reported or an
|
|
exceed corrected errors threshold flag set for that memory.
|
|
|
|
4. Multiple PPR/sparing instances may be present per memory device.
|
|
|
|
5. Drivers should enforce that live repair is safe. In systems where memory
|
|
mapping functions can change between boots, one approach to this is to log
|
|
memory errors seen on this boot against which to check live memory repair
|
|
requests.
|
|
|
|
The File System
|
|
---------------
|
|
|
|
The control attributes of a registered memory repair instance could be
|
|
accessed in the /sys/bus/edac/devices/<dev-name>/mem_repairX/
|
|
|
|
sysfs
|
|
-----
|
|
|
|
Sysfs files are documented in
|
|
`Documentation/ABI/testing/sysfs-edac-memory-repair`.
|
|
|
|
Examples
|
|
--------
|
|
|
|
The memory repair usage takes the form shown in this example:
|
|
|
|
1. CXL memory sparing
|
|
|
|
Memory sparing is defined as a repair function that replaces a portion of
|
|
memory with a portion of functional memory at that same DPA. The subclass
|
|
for this operation, cacheline/row/bank/rank sparing, vary in terms of the
|
|
scope of the sparing being performed.
|
|
|
|
Memory sparing maintenance operations may be supported by CXL devices that
|
|
implement CXL.mem protocol. A sparing maintenance operation requests the
|
|
CXL device to perform a repair operation on its media. For example, a CXL
|
|
device with DRAM components that support memory sparing features may
|
|
implement sparing maintenance operations.
|
|
|
|
2. CXL memory Soft Post Package Repair (sPPR)
|
|
|
|
Post Package Repair (PPR) maintenance operations may be supported by CXL
|
|
devices that implement CXL.mem protocol. A PPR maintenance operation
|
|
requests the CXL device to perform a repair operation on its media.
|
|
For example, a CXL device with DRAM components that support PPR features
|
|
may implement PPR Maintenance operations. Soft PPR (sPPR) is a temporary
|
|
row repair. Soft PPR may be faster, but the repair is lost with a power
|
|
cycle.
|
|
|
|
Sysfs files for memory repair are documented in
|
|
`Documentation/ABI/testing/sysfs-edac-memory-repair`
|