Skip to content

Latest commit

 

History

History
380 lines (278 loc) · 16.9 KB

README.md

File metadata and controls

380 lines (278 loc) · 16.9 KB

Caplin Platform Diagnostics

Caplin Platform Diagnostics is a collection of Bash scripts that collect diagnostics on a running or crashed Caplin Platform component.

The scripts automate a series of common Linux diagnostic commands that Caplin Support ask customers to run when raising a support request (see Send diagnostic information to Caplin Support on the Caplin website).

Caplin Platform Diagnostics is made available under an MIT licence.

Contents:

Requirements

The Caplin Platform Diagnostics scripts have the following requirements:

  • CentOS/RHEL 6 or 7
  • GNU Debugger: $ sudo yum install gdb
  • Red Hat OpenJDK 8 (full JDK, not just JRE): $ sudo yum install java-1.8.0-openjdk-devel

Installation

Copy (or symlink) the two diagnostic scripts to a directory on your executable path. For example, ~/bin/ or /usr/local/bin/

Quick start

To run diagnostics on a process, follow the steps below:

  1. Install dependencies, if not already installed:

    $ sudo yum install gdb java-1.8.0-openjdk-devel
    
  2. Run the script below under the same user as the target process (run time 20 seconds):

    $ caplin-process-diagnostics.sh <pid>
    

    For full details and options, see Running diagnostics on a process.

  3. Upload the generated tar file and any log files requested by Caplin Support to Caplin's File Upload Facility.

To run diagnostics on a core-file, follow the steps below:

  1. Install dependencies, if not already installed:

    $ sudo yum install gdb
    
  2. Run the script below:

    $ caplin-corefile-diagnostics.sh <corefile>
    

    For full details and options, see Running diagnostics on a core file.

  3. Upload the generated tar file and any log files requested by Caplin Support to Caplin's File Upload Facility.

Running diagnostics on a core file

The caplin-corefile-diagnostics.sh script collates diagnostics for a core file dumped by a crashed Caplin Platform component.

The diagnostics collated include all the files Caplin Support require to analyse the core file: the component binary, the core file, and all shared libraries referenced in the core file. For the full list of information collated, see Information collated, below.

Run this script on the crashed component's host, or, if this is not possible, on an identically configured host (same operating system and Java versions).

After running the script, log in to Caplin's secure File Upload Facility and upload the following files:

  • Tar archive generated by the caplin-corefile-diagnostics.sh script
  • Java virtual machine log and error files (if available):
    • HotSpot JVM error file (hs_err_pid<process-id>.log)
    • Heap dump file (var/java_pid<process-id>.hprof)
    • Garbage collection log (var/gc.log)
  • Caplin log files for the period of the incident
  • Caplin configuration files

Requirements

This script has the following requirements:

  • CentOS/RHEL 6 or 7
  • GNU Debugger (gdb RPM package).
  • Write permission to the current directory
  • Run on the crashed component's host or, if this is not possible, on an identically configured host (same operating system and Java versions)

Usage

Syntax: caplin-corefile-diagnostics.sh core [binary]

  • core: path to the core file dumped by the crashed process.
  • binary: path to the crashed process's binary. Defaults to the path of the binary recorded in the core file.

Run as: any user

Runtime: < 1 minute

Output: diagnostics-<hostname>-<core-file>-<timestamp>.tar.gz

Information collated

This script collates the following information:

Diagnostic Dependencies User
/etc/os-release - -
/etc/redhat-release - -
/etc/security/limits.conf - -
/etc/security/limits.d/* - -
ulimit -aS output - -
ulimit -aH output - -
uname -a output - -
df output for binary's 'var' directory - -
Caplin dfw versions output Binary is in a DFW -
Core file - -
Core file backtrace gdb RPM package -
Core file libraries gdb RPM package -
Component binary - -

Example

The example below collates diagnostics for a core file, core.4972, dumped by a Liberator binary, rttpd:

$ ./caplin-corefile-diagnostics.sh ~/dfw1/servers/Liberator/core.4972

Caplin Core-file Diagnostics
============================

Host:            server1
Core:            /home/caplin/dfw1/servers/Liberator/core.4972
Binary:          /home/caplin/dfw1/servers/Liberator/bin/rttpd
GDB installed:   1
Script temp dir: diagnostics-server1-rttpd-core.4972-20190916104354

Recording /etc/os-release
Recording /etc/redhat-release
Recording 'uname -a' output
Recording 'df' output for /home/caplin/dfw1/servers/Liberator/var
Recording 'dfw versions' output
Getting thread backtraces from core.4972
Getting list of libraries referenced by core.4972
Copying libraries referenced by core.4972

DONE

Files collected:

  core.4972
  core.4972.backtrace.out
  core.4972.libs.tar
  dfw-versions.out
  diagnostics.log
  libs-list.out
  libs-list.txt
  os-release
  redhat-release
  rttpd
  uname.out

Archiving files to diagnostics-server1-rttpd-core.4972-20190916104354.tar.gz

Please login to https://www.caplin.com/account/uploads
and upload the archive to Caplin Support.

Running diagnostics on a process

The caplin-process-diagnostics.sh script collates diagnostics for a process without terminating the process.

Script run-time is 20s for the default set of diagnostics. Optional diagnostics take longer, and their timing can be variable. For example, the run time for the optional GDB core dump (--gcore) depends on the size of the target process in memory, and the host's disk I/O and CPU performance.

For the full list of information collated, see Information collated, below.

After running the script, log in to Caplin's secure File Upload Facility and upload the following files:

  • Tar archive generated by the caplin-process-diagnostics.sh script
  • Java virtual machine log files (if available):
    • Garbage collection log (var/gc.log)
  • Caplin log files for the period of the incident
  • Caplin configuration files

Requirements

The main dependency is the GNU Debugger (gdb package). This is required for generating stack traces and a core dump.

If any requirements are missing when you run the script, the script lists the missing dependencies and asks if you wish to continue. If you choose to continue, the script skips any diagnostics with missing dependencies.

All diagnostics:

  • CentOS/RHEL 6 or 7
  • Write permission to the current directory

GDB core dump and backtrace diagnostics:

  • gdb RPM package
  • Free disk space greater than the process's virtual memory
  • CentOS/RHEL 7: SELINUX boolean deny_ptrace set to off (if SELINUX enabled and enforcing).
  • CentOS/RHEL 7: Yama kernel module sysctl setting kernel.yama.ptrace_scope set to 0, 1, or 2.

JVM diagnostics:

  • java-1.8.0-openjdk-devel RPM package. This package installs the full JDK, which includes the jcmd diagnostic tool.

Optional strace diagnostic:

  • strace RPM package. Only required if requested by Caplin Support.

Usage

Syntax: caplin-process-diagnostics.sh [options] pid

  • pid: process identifier of the running component
  • Options:
    • --gcore: include the optional GDB core file dump diagnostic. Only include this diagnostic when requested by Caplin Support.
    • --jvm-heap: include the optional JVM heap dump diagnostic. Halts the JVM temporarily for the duration of the diagnostic. Only include this diagnostic when requested by Caplin Support.
    • --jvm-class-histogram: include the optional JVM class histogram diagnostic. Halts the JVM temporarily for the duration of the diagnostic. Only include this diagnostic when requested by Caplin Support.
    • --strace: include the optional strace diagnostic. Only include this diagnostic when requested by Caplin Support.
    • --help: display help and exit
    • --version: display version and exit

Run as:

  • CentOS 6: the process's user
  • CentOS 7:
    • kernel.yama.ptrace_scope=0: the process's user
    • kernel.yama.ptrace_scope=1: root (required for core dump, thread backtraces, and strace)
    • kernel.yama.ptrace_scope=2: root (required for core dump, thread backtraces, and strace)
    • kernel.yama.ptrace_scope=3: the process's user (core dump, thread backtraces, and strace prohibited for all users)

Runtime: 20s for the default set of diagnostics

Output: diagnostics-<hostname>-<binary>-<pid>-<timestamp>.tar.gz

Information collated

Default diagnostics:

Diagnostic Dependencies User
/etc/os-release - -
/etc/redhat-release - -
uname -a output - -
/proc/sys/kernel/core_pattern - -
/proc/sys/kernel/core_uses_pid - -
/proc/<pid>/limits - -
/etc/security/limits.conf - -
/etc/security/limits.d/* - -
top output for the system (5 seconds) - -
top output for the process (5 seconds) - -
df output for the process's <working-dir>/var directory - -
free output - -
vmstat output (5 seconds) - -
Caplin dfw info output Process binary is in a DFW -
Caplin dfw status output Process binary is in a DFW -
Caplin dfw versions output Process binary is in a DFW -
JVM jcmd <pid> Thread.print output jcmd JDK command Note 1
JVM jcmd <pid> GC.heap_info output jcmd JDK command Note 1
JVM jcmd <pid> VM.system_properties output jcmd JDK command Note 1
JVM jcmd <pid> VM.flags output jcmd JDK command Note 1
JVM jcmd <pid> PerfCounter.print output jcmd JDK command Note 1
JVM jstat -gc <pid> output jcmd JDK command Note 1
JVM jstat -gcutil <pid> output jcmd JDK command Note 1
GDB thread backtrace gdb RPM package Note 2
Process binary - -

Optional diagnostics (only enable if requested by Caplin Support):

Diagnostic Dependencies User
GDB core-file dump, backtrace, and libraries gdb RPM package Note 2
JVM jcmd <pid> GC.heap_dump output jcmd JDK command Note 1
JVM jcmd <pid> GC.class_histogram output jcmd JDK command Note 1
strace output (system-call logging) strace RPM package Note 2

Note 1: JVM diagnostics must be run as the process's user. If you run the script as root, then the script uses sudo to run the JVM diagnostics as the process's user.

Note 2: GDB thread backtraces, GDB core dump, and strace can be run as the process's user, unless prohibited by the Yama kernel module (introduced in CentOS/RHEL 7). The script will advise you if root privileges are required to run these diagnostics.

Performance impact

The default set of diagnostics includes only one diagnostic that directly impacts the performance of the target process:

  • GDB thread backtrace: the target process is halted temporarily for less than 1 second for each backtrace.

The optional diagnostics have a potentially greater performance impact and should only be enabled when requested by Caplin Support:

  • GDB core dump: the target process is halted temporarily for the time it takes the gcore command to write the process's virtual memory to a core file. The execution time is determined by the size of the process's virtual memory (ps -o vsz= -q <process-id>) and the host's CPU and I/O performance.

  • strace: slows performance of the target process for the duration of the diagnostic (40 seconds).

  • JVM heap dump: halts the JVM temporarily for the duration of the diagnostic.

  • JVM class histogram: halts the JVM temporarily for the duration of the diagnostic.

Example

The example below collates diagnostics for a Liberator running as process 4972:

$ ./caplin-process-diagnostics.sh 4972

Caplin Process Diagnostics
==========================

Process ID:      4972
Process binary:  /home/caplin/dfw1/kits/Liberator/Liberator-7.1.9-313149/bin/rttpd

Script user:     same user as process 4972
Script temp dir: ./diagnostics-server1-rttpd-4972-20190916102608

Recording /etc/redhat-release
Recording 'uname -a' output
Recording /proc/sys/kernel/core_pattern
Recording /proc/sys/kernel/core_uses_pid
Recording /proc/4972/limits
Recording 'top' output (5 seconds)
Recording 'top' output for process 4972 (5 seconds)
Recording process 4972 limits (/proc/4972/limits)
Recording 'df' output for /home/caplin/dfw1/servers/Liberator/var
Recording 'free' output
Recording 'vmstat' output (5 seconds)
Recording 'dfw info' output
Recording 'dfw status' output
Recording 'dfw versions' output
1/3: Dumping GDB thread backtraces for process 4972
  Sleeping for 1 second...
2/3: Dumping GDB thread backtraces for process 4972
  Sleeping for 1 second...
3/3: Dumping GDB thread backtraces for process 4972
1/3: Dumping JVM stack trace for process 4972
  Sleeping for 1 second...
2/3: Dumping JVM stack trace for process 4972
  Sleeping for 1 second...
3/3: Dumping JVM stack trace for process 4972
Recording JVM heap info
Recording JVM properties
Recording JVM flags
Recording JVM performance counters
Recording JVM jstat GC output

DONE

Files collected:
  df.out
  dfw-info.out
  dfw-status.out
  dfw-versions.out
  diagnostics.log
  free.out
  jvm-flags
  jvm-heapinfo
  jvm-jstat-gc
  jvm-jstat-gcutil
  jvm-perfcounter
  jvm-props
  jvm-stacktrace-20190916102810.out
  jvm-stacktrace-20190916102811.out
  jvm-stacktrace-20190916102812.out
  proc-4972-limits
  proc-sys-kernel-core_pattern
  proc-sys-kernel-core_uses_pid
  redhat-release
  rttpd-backtrace-20190916102806.out
  rttpd-backtrace-20190916102808.out
  rttpd-backtrace-20190916102809.out
  top-4972.out
  top.out
  uname.out
  vmstat.out

Archiving files to diagnostics-server1-rttpd-4972-20190916102608.tar.gz

Please login to https://www.caplin.com/account/uploads
and upload the archive to Caplin Support.