Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: support memory row CE threshold policy #150

Closed
wants to merge 1 commit into from

Conversation

zhuofeng6
Copy link

  • Introduction: Identify memory row faults in memory CE faults and isolate the physical memory pages where row faults occur. This method can effectively prevent CE storms or memory UCE faults caused by memory row failures.

  • Implementation: The system counts the number of CE faults in the same memory row within a specified period. If the number of CE faults exceeds the configured threshold, the system considers that the memory row may fail and isolates all physical pages recorded in the memory row.

Notes:

  1. This function is disabled by default. You can enable it by configuring the'ROW_CE_ACTION' field in the '/etc/sysconfig/rasdaemon' configuration file.
  2. If both row isolation and page isolation are enabled, page isolation is automatically disabled by default.
  3. If the number of fault times in the DIMM CE fault information received by the rasdaemon is 0, the BIOS does not correctly parse the number of fault times when parsing the fault information. When a fault occurs, the rasdaemon process considers that the number of faults is 1 by default, which is the same as the kernel process.

@mchehab
Copy link
Owner

mchehab commented Jun 11, 2024

Please add a Signed-off-by for the patches you submit.

(please see: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#developer-s-certificate-of-origin-1-1)

Copy link
Owner

@mchehab mchehab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If both row isolation and page isolation are enabled, page isolation is automatically
    disabled by default.

I would be expecting to have a warning for users to know that raw isolation disabled page isolation. Also, I don't see what part of the code actually checks it.

unsigned long count;
};


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: just one blank line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

int count;
time_t start;
};
#define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: please add a blank line after struct definitions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

@@ -102,6 +121,11 @@ static void page_offline_init(void)
offline = OFFLINE_ACCOUNT;
}

if (row_offline_action != OFFLINE_OFF) {
log(TERM, LOG_INFO, "row threshold is open, so turn off page threshold\n");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If both row isolation and page isolation are enabled, page isolation is automatically
    disabled by default.

I would be expecting to have a warning for users to know that raw isolation disabled page isolation. Also, I don't see what part of the code actually checks it.

I checked here.

@zhuofeng6
Copy link
Author

Please add a Signed-off-by for the patches you submit.

(please see: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#developer-s-certificate-of-origin-1-1)

Already added

@mchehab
Copy link
Owner

mchehab commented Jul 16, 2024

Please rebase on the top of upstream, and check compilation/warnings.

At least here, it is not building with gcc 14.1.1:

In file included from ras-page-isolation.c:24:
ras-page-isolation.h:100:48: warning: comparison between 'enum dsm_location_field_index' and 'enum apei_location_field_index' [-Wenum-compare]
  100 | #define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
      |                                                ^
ras-page-isolation.h:105:49: note: in expansion of macro 'ROW_LOCATION_FIELDS_NUM'
  105 |         int                     location_fields[ROW_LOCATION_FIELDS_NUM];
      |                                                 ^~~~~~~~~~~~~~~~~~~~~~~
ras-page-isolation.c: In function 'row_isolation_init':
ras-page-isolation.c:276:9: error: too few arguments to function 'parse_env_string'
  276 |         parse_env_string(&row_threshold, threshold_string);
      |         ^~~~~~~~~~~~~~~~
ras-page-isolation.c:198:13: note: declared here
  198 | static void parse_env_string(struct isolation *config, char *str, unsigned int size)
      |             ^~~~~~~~~~~~~~~~
ras-page-isolation.c:277:9: error: too few arguments to function 'parse_env_string'
  277 |         parse_env_string(&row_cycle, cycle_string);
      |         ^~~~~~~~~~~~~~~~
ras-page-isolation.c:198:13: note: declared here
  198 | static void parse_env_string(struct isolation *config, char *str, unsigned int size)
      |             ^~~~~~~~~~~~~~~~
ras-page-isolation.c: In function 'row_record_copy':
ras-page-isolation.h:100:48: warning: comparison between 'enum dsm_location_field_index' and 'enum apei_location_field_index' [-Wenum-compare]
  100 | #define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
      |                                                ^
ras-page-isolation.c:501:29: note: in expansion of macro 'ROW_LOCATION_FIELDS_NUM'
  501 |         for (int i = 0; i < ROW_LOCATION_FIELDS_NUM; i++) {
      |                             ^~~~~~~~~~~~~~~~~~~~~~~
make[2]: *** [Makefile:1256: rasdaemon-ras-page-isolation.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from ras-mc-handler.c:29:
ras-page-isolation.h:100:48: warning: comparison between 'enum dsm_location_field_index' and 'enum apei_location_field_index' [-Wenum-compare]
  100 | #define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
      |                                                ^
ras-page-isolation.h:105:49: note: in expansion of macro 'ROW_LOCATION_FIELDS_NUM'
  105 |         int                     location_fields[ROW_LOCATION_FIELDS_NUM];
      |                                                 ^~~~~~~~~~~~~~~~~~~~~~~
In file included from ras-events.c:45:
ras-page-isolation.h:100:48: warning: comparison between 'enum dsm_location_field_index' and 'enum apei_location_field_index' [-Wenum-compare]
  100 | #define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
      |                                                ^
ras-page-isolation.h:105:49: note: in expansion of macro 'ROW_LOCATION_FIELDS_NUM'
  105 |         int                     location_fields[ROW_LOCATION_FIELDS_NUM];
      |                                                 ^~~~~~~~~~~~~~~~~~~~~~~

@zhuofeng6 zhuofeng6 force-pushed the ce_row branch 4 times, most recently from 809760a to 92fa716 Compare July 28, 2024 03:45
- Introduction: Identify memory row faults in memory CE faults and
isolate the physical memory pages where row faults occur. This method
can effectively prevent CE storms or memory UCE faults caused by memory
row failures.

- Implementation: The system counts the number of CE faults in the same
memory row within a specified period. If the number of CE faults exceeds
the configured threshold, the system considers that the memory row may
fail and isolates all physical pages recorded in the memory row.

Notes:
1. This function is disabled by default. You can enable it by
configuring the'ROW_CE_ACTION' field in the '/etc/sysconfig/rasdaemon' configuration file.
2. If both row isolation and page isolation are enabled, page isolation is automatically
disabled by default.
3. If the number of fault times in the DIMM CE fault information received by the rasdaemon
is 0, the BIOS does not correctly parse the number of fault times when parsing the fault information.
When a fault occurs, the rasdaemon process considers that the number of faults is 1 by default,
which is the same as the kernel process.

Signed-off-by: zhuofeng <[email protected]>
@zhuofeng6
Copy link
Author

Please rebase on the top of upstream, and check compilation/warnings.

At least here, it is not building with gcc 14.1.1:

In file included from ras-page-isolation.c:24:
ras-page-isolation.h:100:48: warning: comparison between 'enum dsm_location_field_index' and 'enum apei_location_field_index' [-Wenum-compare]
  100 | #define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
      |                                                ^
ras-page-isolation.h:105:49: note: in expansion of macro 'ROW_LOCATION_FIELDS_NUM'
  105 |         int                     location_fields[ROW_LOCATION_FIELDS_NUM];
      |                                                 ^~~~~~~~~~~~~~~~~~~~~~~
ras-page-isolation.c: In function 'row_isolation_init':
ras-page-isolation.c:276:9: error: too few arguments to function 'parse_env_string'
  276 |         parse_env_string(&row_threshold, threshold_string);
      |         ^~~~~~~~~~~~~~~~
ras-page-isolation.c:198:13: note: declared here
  198 | static void parse_env_string(struct isolation *config, char *str, unsigned int size)
      |             ^~~~~~~~~~~~~~~~
ras-page-isolation.c:277:9: error: too few arguments to function 'parse_env_string'
  277 |         parse_env_string(&row_cycle, cycle_string);
      |         ^~~~~~~~~~~~~~~~
ras-page-isolation.c:198:13: note: declared here
  198 | static void parse_env_string(struct isolation *config, char *str, unsigned int size)
      |             ^~~~~~~~~~~~~~~~
ras-page-isolation.c: In function 'row_record_copy':
ras-page-isolation.h:100:48: warning: comparison between 'enum dsm_location_field_index' and 'enum apei_location_field_index' [-Wenum-compare]
  100 | #define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
      |                                                ^
ras-page-isolation.c:501:29: note: in expansion of macro 'ROW_LOCATION_FIELDS_NUM'
  501 |         for (int i = 0; i < ROW_LOCATION_FIELDS_NUM; i++) {
      |                             ^~~~~~~~~~~~~~~~~~~~~~~
make[2]: *** [Makefile:1256: rasdaemon-ras-page-isolation.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from ras-mc-handler.c:29:
ras-page-isolation.h:100:48: warning: comparison between 'enum dsm_location_field_index' and 'enum apei_location_field_index' [-Wenum-compare]
  100 | #define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
      |                                                ^
ras-page-isolation.h:105:49: note: in expansion of macro 'ROW_LOCATION_FIELDS_NUM'
  105 |         int                     location_fields[ROW_LOCATION_FIELDS_NUM];
      |                                                 ^~~~~~~~~~~~~~~~~~~~~~~
In file included from ras-events.c:45:
ras-page-isolation.h:100:48: warning: comparison between 'enum dsm_location_field_index' and 'enum apei_location_field_index' [-Wenum-compare]
  100 | #define ROW_LOCATION_FIELDS_NUM (DSM_FIELD_NUM > APEI_FIELD_NUM ? DSM_FIELD_NUM : APEI_FIELD_NUM)
      |                                                ^
ras-page-isolation.h:105:49: note: in expansion of macro 'ROW_LOCATION_FIELDS_NUM'
  105 |         int                     location_fields[ROW_LOCATION_FIELDS_NUM];
      |                                                 ^~~~~~~~~~~~~~~~~~~~~~~

Modified

@mchehab
Copy link
Owner

mchehab commented Nov 18, 2024

Merged, thanks!

@mchehab mchehab closed this Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants