-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipmitool to be used with AER only on Ampere Altra platforms #56
base: master
Are you sure you want to change the base?
Conversation
When monitoring AER errors, verify if we are on an Altra platform to use ipmitool when dealing with this type of errors. Needs both --enable-aer and --enable-amp-ns-decode Fixes: 738bafa ("Add error handling for Ampere-specific errors.") Signed-off-by: William Roche <[email protected]>
The new functionalities added for Ampere platforms with commit 738bafa ("Add error handling for Ampere-specific errors.") introduced 2 problems:
The first problem is the most important to solve. To do so, we introduce an initialization function in the AER code to check if ipmitool should be used (present and on the appropriate platform), this function task is to set a boolean flag indicating if ipmitool needs to be used. |
Take Jason Tian's feedback into account to provide more details about the conditions allowing the use of ipmitool on Altra or Altra Max platforms. Signed-off-by: William Roche <[email protected]>
Feedback provided by Jason about ras-aer-handler.c changes:
The new commit "Enrich ras_report_aer_ipmi_init() comments." I just added is a suggestion to address that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code can identify different platform from Ampere, I am ok to this change.
This pull request is still pending, and I'm available to give any detail if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of things that require changes. Basically:
- dependencies from command line tools should be set via
./configure
; - don't rely on lscpu/dmidecode/grep - location of such tools are system-dependent and the additional complexity under
configure.ac
would just make harder to compile it. Instead, just open the needed file(s) from /sys/class/dmi/id/ and/or read the contents of /proc/cpuinfo - seeking for the corresponding CPU model, board name, etc.
980e51e
to
4e1fffb
Compare
Fix the following compile errors that occurs when building against musl: ras-events.c: In function 'read_ras_event_all_cpus': ras-events.c:366:16: error: 'PATH_MAX' undeclared (first use in this function) 366 | char pipe_raw[PATH_MAX]; | ^~~~~~~~ ras-events.c: In function 'handle_ras_events_cpu': ras-events.c:564:16: error: 'PATH_MAX' undeclared (first use in this function) 564 | char pipe_raw[PATH_MAX]; | Signed-off-by: Stijn Tintel <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Removes trailing spaces at the end of a line from file location and fixes --layout option to parse dimm nodes to get the size of each dimm from ras-mc-ctl. Issue is reported mchehab#43 Where '> ras-mc-ctl --layout' reports all 0s With this change the layout option prints the correct dimm sizes > sudo ras-mc-ctl --layout +-----------------------------------------------+ | mc0 | | csrow0 | csrow1 | csrow2 | csrow3 | ----------+-----------------------------------------------+ ... channel7: | 16384 MB | 0 MB | 0 MB | 0 MB | channel6: | 16384 MB | 0 MB | 0 MB | 0 MB | ... ----------+-----------------------------------------------+ Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Naveen Krishna Chatradhi <[email protected]> Cc: Yazen Ghannam <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]> Link: https://lkml.kernel.org/r/[email protected]/
Signed-off-by: Justin Vreeland <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Steven Johnson <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
syslog is only used when the daemon runs in backround mode this service is configured to run in foreground mode Signed-off-by: Mauro Carvalho Chehab <[email protected]>
The data type of sprintf called in the function uuid_le() is mismatch. Arm64 compiler force it to unsigned char by default, and can work normally. But if someone compile it with the option -fsigned-char, the function can't work correctly. Signed-off-by: Xiaofei Tan <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
It will record event even the option -r is not provided for hip08. It is not right, and fix it. Signed-off-by: Xiaofei Tan <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
It is not right to use '%d' to print uint8_t and uint16_t, although there is no function issue. Change to use '%hhu' and '%hu' separately. Signed-off-by: Xiaofei Tan <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Add some modules supported by hisi common error section. Besides, HHA is the module for some old platform, and it takes the same place of MATA, so remove it. Signed-off-by: Xiaofei Tan <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Cleanup files that are generated at build time from the *.in input files. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Sort of sync this file from Fedora's upstream, addressing some bugs sysfsdir install bugs. After such change, the main difference would be that, in Fedora, it uses different config settings, depending at the architecture: -%configure --enable-all --with-sysconfdefdir=%{_sysconfdir}/sysconfig +%ifarch %{arm} aarch64 +%configure --enable-sqlite3 --enable-aer --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --enable-non-standard --enable-arm --enable-hisi-ns-decode --with-sysconfdefdir=%{_sysconfdir}/sysconfig +%else +%configure --enable-sqlite3 --enable-aer --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --with-sysconfdefdir=%{_sysconfdir}/sysconfig +%endif Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Since Linux 5.18-rc1 a new block tracepoint called block_rq_error is available for tracing disk error events dedicatedly. Currently rasdaemon is using block_rq_complete which also traces successful cases. It incurs excessive tracing logs and somehow overhead since the event is triggered quite often. Use the new tracepoint for disk error reporting, and the new trace point has the same format as block_rq_complete. Signed-off-by: Yang Shi <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
The version used is glibc specific therefore make it so and provide a fallback for non-glibc systems Signed-off-by: Khem Raj <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Found with covscan. Signed-off-by: Aristeu Rozanski <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Found with covscan. Signed-off-by: Aristeu Rozanski <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
…rectly We could just have an empty string but keeping the format could prevent issues if someone is actually parsing this. Found with covscan. v2: fixed the timestamp as pointed by Robert Elliott Signed-off-by: Aristeu Rozanski <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
When the corrected errors exceed the set limit in cycle, try to offline the related cpu core. Signed-off-by: Shengwei Luo <[email protected]> Signed-off-by: Junchong Pan <[email protected]> Signed-off-by: Lei Feng <[email protected]> Signed-off-by: Xiaofei Tan <[email protected]> Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
When the recoverable errors in cpu core occurred, try to offline the related cpu core. Signed-off-by: Shengwei Luo <[email protected]> Signed-off-by: Junchong Pan <[email protected]> Signed-off-by: Lei Feng <[email protected]> Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
autoreconf is producing a compile file. Ignore it on git status. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Should be auto-filling the release information and upload a source distro package tarball. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Use my own upload release asset logic, as it is known to work already on ZBar. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Rasdaemon used for a long time an early version of this library, with the code embedded directly into its code. The rationale is that the library was not officially released on that time, but this has long changed. So, instead, just use the library directly. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
With the function rename due to the usage of libtraceevent library, adjust some indentations. No functional changes. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Now that rasdaemon is using the libtraceevent library, we can get rid of our own fork. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
This is needed to build newest version of rasdaemon. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Use autoupdate 2.71, in order to get rid of obsoleted macros. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Ensure that all modules are enabled on "make distcheck". Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Nowadays, we're only using github in practice for development. Let it clearer at the documentation. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
The error events are not received in the rasdaemon since kernel 6.1-rc6. This issue is firstly detected and reported, when testing the CXL error events in the rasdaemon. Debugging showed, poll() on trace_pipe_raw in the ras-events.c do not return and this issue is seen after the commit 42fb0a1e84ff525ebe560e2baf9451ab69127e2b ("tracing/ring-buffer: Have polling block on watermark"). This issue is also verified using a test application for poll() and select() on per_cpu trace_pipe_raw. There is also a bug reported on this issue, https://lore.kernel.org/all/[email protected]/ This issue occurs for the per_cpu case, which calls the ring_buffer_poll_wait(), in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until the percentage of pages are available. The default value set for the buffer_percent is 50 in the kernel/trace/trace.c. However poll() does not return even met the percentage of pages condition. As a fix, rasdaemon set buffer_percent as 0 through the /sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent, then the task will wake up as soon as data is added to any of the specific cpu buffer and poll() on per_cpu/cpuX/trace_pipe_raw does not block indefinitely. Dependency on the kernel fix commit 3e46d910d8acf94e5360126593b68bf4fee4c4a1("tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw") Signed-off-by: Shiju Jose <[email protected]>
Mock now makes mandatory to add the install dir, otherwise it refuses to build. So, add it. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
As we're not not bunding libtraceevent inside RASdaemon, packaging it now requires it. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
That allows git??b to better parse it. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
It is missing an entry about new labels. Also, version is at the wrong place. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
make dist-bzip2 requires configure to work, which, in turn, depends on having some tools installed. Signed-off-by: Mauro Carvalho Chehab <[email protected]>
configure scripts need to be runnable with a POSIX-compliant /bin/sh. On many (but not all!) systems, /bin/sh is provided by Bash, so errors like this aren't spotted. Notably Debian defaults to /bin/sh provided by dash which doesn't tolerate such bashisms as '=='. This retains compatibility with bash. Fixes configure warnings/errors like: ``` checking for libtraceevent... yes ./configure: 13430: test: x: unexpected operator ./configure: 13439: test: x: unexpected operator ``` Signed-off-by: Sam James <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Fix for regression in ras_mc_create_table() if some cpus are offline at the system start Issue: Regression in the ras_mc_create_table() if some of the cpus are offline at the system start when run the rasdaemon. This issue is reproducible in ras_mc_create_table() with decode and record non-standard events and reproducible sometimes with ras_mc_create_table() for the standard events. Also in the multi thread way, there is memory leak in ras_mc_event_opendb() as struct sqlite3_priv *priv and sqlite3 *db allocated/initialized per thread, but stored in the common struct ras_events ras in pthread data, which is shared across the threads. Reason: when the system starts with some of the cpus offline and then run the rasdaemon, read_ras_event_all_cpus() exit with error and switch to the multi thread way. However read() in read_ras_event() return error in threads for each of the offline CPUs and does clean up including calling ras_mc_event_closedb(). Since the 'struct ras_events ras' passed in the pthread_data to each of the threads is common, struct sqlite3_priv *priv and sqlite3 *db allocated/ initialized per thread and stored in the common 'struct ras_events ras', are getting overwritten in each ras_mc_event_opendb()(which called from pthread per cpu), result memory leak. Also when ras_mc_event_closedb() is called in the above error case from the threads corresponding to the offline cpus, close the sqlite3 *db and free sqlite3_priv *priv stored in the common 'struct ras_events ras', result regression when accessing priv->db in the ras_mc_create_table() from another context later. Solution: In ras_mc_event_opendb(), allocate struct sqlite3_priv *priv, init sqlite3 *db and create tables common for the threads with shared 'struct ras_events ras' based on a reference count and free them in the same way. Also protect critical code ras_mc_event_opendb() and ras_mc_event_closedb() using mutex in the multi thread case from any regression caused by the thread pre-emption. Reported-by: Lei Feng <[email protected]> Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Move definition for BIT() and BIT_ULL() to the common file ras-record.h Signed-off-by: Shiju Jose <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Add support to log and record the CXL poison events. The corresponding Kernel patches here: https://lore.kernel.org/linux-cxl/[email protected]/ Presently for logging only, could be extended for the policy based recovery action for the frequent poison events depending on the above kernel patches. Signed-off-by: Shiju Jose <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Add support to log and record the CXL AER uncorrectable errors. The corresponding Kernel patches are here: https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t https://lore.kernel.org/lkml/[email protected]/T/ It was found that the header log data to be converted to the big-endian format to correctly store in the SQLite DB likely because the SQLite database seems uses the big-endian storage. Signed-off-by: Shiju Jose <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Dave Jiang <[email protected]># Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Add support to log and record the CXL AER correctable errors. The corresponding Kernel patches are here: https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t https://lore.kernel.org/linux-cxl/[email protected]/T/#t Signed-off-by: Shiju Jose <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
When monitoring AER errors, verify if we are on an Altra platform to use ipmitool when dealing with this type of errors. Needs both --enable-aer and --enable-amp-ns-decode Fixes: 738bafa ("Add error handling for Ampere-specific errors.") Signed-off-by: William Roche <[email protected]>
Sorry that it took me so long to come back with an updated version of this fix. Please let me know how you would prefer to proceed with this request ? |
When monitoring AER errors, verify if we are on an Altra platform
to use ipmitool when dealing with this type of errors.
Needs both --enable-aer and --enable-amp-ns-decode
Fixes: 738bafa ("Add error handling for Ampere-specific errors.")
Signed-off-by: William Roche [email protected]