Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipmitool to be used with AER only on Ampere Altra platforms #56

Open
wants to merge 61 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
43a78ea
ipmitool to be used with AER only on Ampere Altra platforms
RocheWilliam Nov 9, 2021
f3f4d9c
Enrich ras_report_aer_ipmi_init() comments.
RocheWilliam Dec 6, 2021
b14b901
rasdaemon: fix compile against musl libc
stintel Sep 1, 2021
fc3f8de
rasdaemon: ras-mc-ctl: Fix script to parse dimm sizes
muralimk-amd Jul 27, 2021
c66be8e
Update ras-mc-ctl manpage to match current options
justinvreeland Nov 3, 2021
ae89390
add labels for asrock x570 motherboard
stevenj Dec 7, 2021
a7068c7
rasdaemon.service.in: comment out syslog.target
evils Dec 11, 2021
8c5d37d
rasdaemon: Fix the issue of sprintf data type mismatch in uuid_le()
Oct 20, 2021
3b6f56b
rasdaemon: Fix the issue of command option -r for hip08
Oct 20, 2021
81d0497
rasdaemon: Fix some print format issues for hisi common error section
Oct 20, 2021
6d29781
rasdaemon: Add some modules supported by hisi common error section
Oct 20, 2021
4067e28
Makefile.am: clean output from misc/*.in
mchehab Apr 1, 2022
4a57930
misc/rasdaemon.spec.in: fix some issues on it
mchehab Apr 1, 2022
5b75569
Bump version to 0.6.8
mchehab Apr 1, 2022
29820f5
rasdaemon: use the new block_rq_error tracepoint
yang-shi Apr 4, 2022
d4d734d
libtrace: Use XSI version of strerror_r on non glibc systems
kraj Aug 31, 2022
f71ecea
rasdaemon: ras-report: fix possible but unlikely file descriptor leak
aristeu Jan 19, 2023
632c15f
rasdaemon: mce-amd-smca: properly limit bank types
aristeu Jan 19, 2023
fe52450
rasdaemon: ras-memory-failure-handler: handle localtime() failure cor…
aristeu Jan 19, 2023
4e35e7c
rasdaemon: Support cpu fault isolation for corrected errors
Lostwayzxc Feb 23, 2022
28fd2bb
rasdaemon: Support cpu fault isolation for recoverable errors
Lostwayzxc Feb 23, 2022
c5c0796
rasdaemon: Modify recording Hisilicon common error data
shijujose4 Mar 2, 2022
b00b8d2
rasdaemon: ras-mc-ctl: Modify error statistics for HiSilicon KunPeng9…
shijujose4 Feb 24, 2022
ceac760
rasdaemon: ras-mc-ctl: Reformat error info of the HiSilicon Kunpeng920
shijujose4 Mar 5, 2022
96a9942
rasdaemon: ras-mc-ctl: Add printing usage if necessary parameters are…
shijujose4 Mar 5, 2022
01c2453
rasdaemon: ras-mc-ctl: Add support to display the HiSilicon vendor er…
shijujose4 Mar 5, 2022
db4bf0a
rasdaemon: ras-mc-ctl: Relocate reading and display Kunpeng920 errors…
shijujose4 Mar 7, 2022
9e10a2d
rasdaemon: ras-mc-ctl: Updated HiSilicon platform name
shijujose4 Apr 28, 2022
b05a49d
rasdaemon: Fix for a memory out-of-bounds issue and optimized code to…
shijujose4 Apr 28, 2022
d6cecbe
rasdaemon: Add four modules supported by HiSilicon common section
Oct 31, 2022
82a0a7f
labels/asus: add ASUS TUF GAMING B450-PLUS II
dgcampea Dec 19, 2022
ccfb4bf
configure.ac: fix bashisms
thesamesam Dec 29, 2022
49e6791
INSTALL: update from latest version of it
mchehab Jan 21, 2023
a2f8b73
.gitignore: add the auto-generated "compile" file
mchehab Jan 21, 2023
b6992fa
Bump version to 0.7.0
mchehab Jan 21, 2023
768bdb4
Add a release workflow
mchehab Jan 21, 2023
106ef7b
on_tag.yml: use a different approach to upload artifact
mchehab Jan 22, 2023
a65f9be
Convert to use libtraceevent
mchehab Jan 21, 2023
c14a515
Adjust indentations
mchehab Jan 21, 2023
6b51a2e
Remove the old libtrace
mchehab Jan 21, 2023
973bd64
ci.yml: add libtraceevent-dev dependency
mchehab Jan 21, 2023
83f2312
configure.ac: get rid of obsolete macros
mchehab Jan 21, 2023
0b242e6
Makefile.am: enable all options on make distcheck
mchehab Jan 21, 2023
20dd76a
README: Update instructions about how to contribute
mchehab Jan 23, 2023
6ad5c83
rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely
shijujose4 Feb 4, 2023
d79c076
Makefile.am: fix mock build target
mchehab Feb 18, 2023
bb63ff9
misc/rasdaemon.spec.in: add libtraceevent requirement
mchehab Feb 18, 2023
eea1e38
Convert README to markdown format
mchehab Feb 18, 2023
aaccd37
labels/asrock: add X399D8A-2T
tictooc Feb 11, 2023
2647640
Bump version to 0.8.0
mchehab Feb 18, 2023
73cd1b1
ChangeLog: do some minor updates
mchehab Feb 18, 2023
be4ec11
ci.yml: fix workflow to build rasdaemon
mchehab Feb 18, 2023
9809066
Fix create release workflow
mchehab Feb 18, 2023
48a6d13
configure.ac: fix bashisms
thesamesam Feb 19, 2023
d7583fb
rasdaemon: fix table create if some cpus are offline
shijujose4 Mar 5, 2023
02e4960
ras-mc-ctl: add option to exclude old events from reports
m-sundman Apr 20, 2023
ac737e5
rasdaemon: Move definition for BIT and BIT_ULL to a common file
shijujose4 Jan 16, 2023
87c6384
rasdaemon: Add support for the CXL poison events
shijujose4 Mar 31, 2023
9ffa38e
rasdaemon: Add support for the CXL AER uncorrectable errors
shijujose4 Mar 17, 2023
a9049ef
rasdaemon: Add support for the CXL AER correctable errors
shijujose4 Mar 17, 2023
7ca3126
ipmitool to be used with AER only on Ampere Altra platforms
RocheWilliam Nov 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 98 additions & 29 deletions ras-aer-handler.c
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@
#include "ras-logger.h"
#include "bitfield.h"
#include "ras-report.h"
#ifdef HAVE_AMP_NS_DECODE
#include <stdbool.h>
#include <sys/stat.h>
#include <sys/utsname.h>
#endif

/* bit field meaning for correctable error */
static const char *aer_cor_errors[32] = {
Expand Down Expand Up @@ -52,6 +57,89 @@ static const char *aer_uncor_errors[32] = {
[20] = "Unsupported Request",
};

#ifdef HAVE_AMP_NS_DECODE
#define IPMITOOL_CMD "/usr/bin/ipmitool"
#define DMIDECODE_CMD "/usr/sbin/dmidecode"
static bool ampere_ipmitool = false;
RocheWilliam marked this conversation as resolved.
Show resolved Hide resolved

static void ras_report_aer_ipmi_init(void)
{
struct utsname unm;
struct stat st;
int rc;

/*
* Verify on startup if we are on an Ampere Altra or Altra Max
* platform, to set the use of ipmitool (if installed).
* This depends on BIOS implementation to provide the CPU information.
* If the BIOS doesn't provide it or gives a different string, the
* ipmitool use will be disabled.
*/
if (stat(IPMITOOL_CMD, &st) != 0)
return;

if ((uname(&unm) != 0) || (strncmp(unm.machine, "aarch64", 8) != 0))
return;

RocheWilliam marked this conversation as resolved.
Show resolved Hide resolved
/* prefer dmidecode (if installed) as only lscpu newer than 2.37 gets dmi info */
if (stat(DMIDECODE_CMD, &st) == 0)
rc = system(DMIDECODE_CMD" -t 4 | /usr/bin/grep "
"'Ampere(R) Altra(R)' > /dev/null");
else
rc = system("/usr/bin/lscpu | /usr/bin/grep "
"'Ampere(R) Altra(R)' > /dev/null");
if (rc == -1 || !WIFEXITED(rc) || WEXITSTATUS(rc))
RocheWilliam marked this conversation as resolved.
Show resolved Hide resolved
return;

ampere_ipmitool = true;
}

static void ras_report_aer_ipmi(int severity_val, struct ras_aer_event *ev)
{
char ipmi_add_sel[114];
uint8_t sel_data[5];
int seg, bus, dev, fn, rc;

if (!ampere_ipmitool)
return;

/*
* Get PCIe AER error source seg/bus/dev/fn and save it into
* BMC OEM SEL, ipmitool raw 0x0a 0x44 is IPMI command-Add SEL
* entry, please refer IPMI specification chapter 31.6. 0xcd3a
* is manufactuer ID(ampere),byte 12 is sensor num(CE is 0xBF,
* UE is 0xCA), byte 13~14 is segment number, byte 15 is bus
* number, byte 16[7:3] is device number, byte 16[2:0] is
* function number.
*/

switch (severity_val) {
case HW_EVENT_AER_UNCORRECTED_NON_FATAL:
case HW_EVENT_AER_UNCORRECTED_FATAL:
sel_data[0] = 0xca;
break;
case HW_EVENT_AER_CORRECTED:
default:
sel_data[0] = 0xbf;
}

sscanf(ev->dev_name, "%x:%x:%x.%x", &seg, &bus, &dev, &fn);

sel_data[1] = seg & 0xff;
sel_data[2] = (seg & 0xff00) >> 8;
sel_data[3] = bus;
sel_data[4] = (((dev & 0x1f) << 3) | (fn & 0x7));

sprintf(ipmi_add_sel, IPMITOOL_CMD
" raw 0x0a 0x44 0x00 0x00 0xc0 0x00 0x00 0x00 0x00 0x3a 0xcd 0x00 0xc0 0x%02x 0x%02x 0x%02x 0x%02x 0x%02x",
sel_data[0], sel_data[1], sel_data[2], sel_data[3], sel_data[4]);

rc = system(ipmi_add_sel);
if (rc == -1 || !WIFEXITED(rc) || WEXITSTATUS(rc))
log(TERM, LOG_ERR, "ipmitool command failed [%d]", rc);
}
#endif

#define BUF_LEN 1024

int ras_aer_event_handler(struct trace_seq *s,
Expand All @@ -67,9 +155,6 @@ int ras_aer_event_handler(struct trace_seq *s,
struct tm *tm;
struct ras_aer_event ev;
char buf[BUF_LEN];
char ipmi_add_sel[105];
uint8_t sel_data[5];
int seg, bus, dev, fn;

/*
* Newer kernels (3.10-rc1 or upper) provide an uptime clock.
Expand Down Expand Up @@ -132,24 +217,20 @@ int ras_aer_event_handler(struct trace_seq *s,
switch (severity_val) {
case HW_EVENT_AER_UNCORRECTED_NON_FATAL:
ev.error_type = "Uncorrected (Non-Fatal)";
sel_data[0] = 0xca;
break;
case HW_EVENT_AER_UNCORRECTED_FATAL:
ev.error_type = "Uncorrected (Fatal)";
sel_data[0] = 0xca;
break;
case HW_EVENT_AER_CORRECTED:
ev.error_type = "Corrected";
sel_data[0] = 0xbf;
break;
default:
ev.error_type = "Unknown severity";
sel_data[0] = 0xbf;
}
trace_seq_puts(s, ev.error_type);

/* Insert data into the SGBD */
#ifdef HAVE_SQLITE3
/* Insert data into the SGBD */
ras_store_aer_event(ras, &ev);
#endif

Expand All @@ -159,28 +240,16 @@ int ras_aer_event_handler(struct trace_seq *s,
#endif

#ifdef HAVE_AMP_NS_DECODE
/*
* Get PCIe AER error source seg/bus/dev/fn and save it into
* BMC OEM SEL, ipmitool raw 0x0a 0x44 is IPMI command-Add SEL
* entry, please refer IPMI specificaiton chapter 31.6. 0xcd3a
* is manufactuer ID(ampere),byte 12 is sensor num(CE is 0xBF,
* UE is 0xCA), byte 13~14 is segment number, byte 15 is bus
* number, byte 16[7:3] is device number, byte 16[2:0] is
* function number
*/
sscanf(ev.dev_name, "%x:%x:%x.%x", &seg, &bus, &dev, &fn);

sel_data[1] = seg & 0xff;
sel_data[2] = (seg & 0xff00) >> 8;
sel_data[3] = bus;
sel_data[4] = (((dev & 0x1f) << 3) | (fn & 0x7));

sprintf(ipmi_add_sel,
"ipmitool raw 0x0a 0x44 0x00 0x00 0xc0 0x00 0x00 0x00 0x00 0x3a 0xcd 0x00 0xc0 0x%02x 0x%02x 0x%02x 0x%02x 0x%02x",
sel_data[0], sel_data[1], sel_data[2], sel_data[3], sel_data[4]);

system(ipmi_add_sel);
/* Give a chance to provide AER error though IPMI */
ras_report_aer_ipmi(severity_val, &ev);
#endif

return 0;
}

void ras_aer_handler_init(void)
{
#ifdef HAVE_AMP_NS_DECODE
ras_report_aer_ipmi_init();
#endif
}
1 change: 1 addition & 0 deletions ras-aer-handler.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,5 @@ int ras_aer_event_handler(struct trace_seq *s,
struct pevent_record *record,
struct event_format *event, void *context);

void ras_aer_handler_init(void);
#endif
1 change: 1 addition & 0 deletions ras-events.c
Original file line number Diff line number Diff line change
Expand Up @@ -824,6 +824,7 @@ int handle_ras_events(int record_events)
"ras", "mc_event");

#ifdef HAVE_AER
ras_aer_handler_init();
rc = add_event_handler(ras, pevent, page_size, "ras", "aer_event",
ras_aer_event_handler, NULL, AER_EVENT);
if (!rc)
Expand Down