-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New feature: support memory row CE threshold policy #150
Conversation
Please add a Signed-off-by for the patches you submit. (please see: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#developer-s-certificate-of-origin-1-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If both row isolation and page isolation are enabled, page isolation is automatically
disabled by default.
I would be expecting to have a warning for users to know that raw isolation disabled page isolation. Also, I don't see what part of the code actually checks it.
ras-page-isolation.h
Outdated
unsigned long count; | ||
}; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: just one blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
ras-page-isolation.h
Outdated
int count; | ||
time_t start; | ||
}; | ||
#define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: please add a blank line after struct definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
@@ -102,6 +121,11 @@ static void page_offline_init(void) | |||
offline = OFFLINE_ACCOUNT; | |||
} | |||
|
|||
if (row_offline_action != OFFLINE_OFF) { | |||
log(TERM, LOG_INFO, "row threshold is open, so turn off page threshold\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If both row isolation and page isolation are enabled, page isolation is automatically
disabled by default.I would be expecting to have a warning for users to know that raw isolation disabled page isolation. Also, I don't see what part of the code actually checks it.
I checked here.
Already added |
Please rebase on the top of upstream, and check compilation/warnings. At least here, it is not building with gcc 14.1.1:
|
809760a
to
92fa716
Compare
- Introduction: Identify memory row faults in memory CE faults and isolate the physical memory pages where row faults occur. This method can effectively prevent CE storms or memory UCE faults caused by memory row failures. - Implementation: The system counts the number of CE faults in the same memory row within a specified period. If the number of CE faults exceeds the configured threshold, the system considers that the memory row may fail and isolates all physical pages recorded in the memory row. Notes: 1. This function is disabled by default. You can enable it by configuring the'ROW_CE_ACTION' field in the '/etc/sysconfig/rasdaemon' configuration file. 2. If both row isolation and page isolation are enabled, page isolation is automatically disabled by default. 3. If the number of fault times in the DIMM CE fault information received by the rasdaemon is 0, the BIOS does not correctly parse the number of fault times when parsing the fault information. When a fault occurs, the rasdaemon process considers that the number of faults is 1 by default, which is the same as the kernel process. Signed-off-by: zhuofeng <[email protected]>
Modified |
Merged, thanks! |
Introduction: Identify memory row faults in memory CE faults and isolate the physical memory pages where row faults occur. This method can effectively prevent CE storms or memory UCE faults caused by memory row failures.
Implementation: The system counts the number of CE faults in the same memory row within a specified period. If the number of CE faults exceeds the configured threshold, the system considers that the memory row may fail and isolates all physical pages recorded in the memory row.
Notes: