-
Notifications
You must be signed in to change notification settings - Fork 35
450.obdiag‐analyze‐guide
本文适用于独立部署 obdiag 的场景,使用 obdiag analyze
命令可帮助 OceanBase 数据库相关的诊断信息分析,目前支持对 OceanBase 的日志进行分析,找出发生过的错误信息。
如果是 OBD 部署的集群,可直接通过 OBD 侧诊断命令组对所选集群进行信息收集,详细命令可参考 诊断工具命令组。
obdiag analyze <analyze type> [options]
analyze type
包含如下:
- log:一键分析 OceanBase 的日志。
- flt_trace: 一键全链路诊断。
使用该命令可以一键在线分析 OceanBase 集群的日志,或者通过 --files
开启离线分析模式。
- 本文所指的在线分析指的是 OceanBase 集群在线运行状态,日志分布在各个 OBServer 节点上。
- 本文所指的离线分析模式是
--files
参数传递下,可以分析已经收集到 obdiag 部署机器上的 OBServer 日志。 - 需要确保已经在 obdiag 配置文件
config.yml
中配置好需要收集节点的登录信息。相关的详细配置介绍,参见 obdiag 配置。
obdiag analyze log [options]
选项说明如下:
选项名 | 是否必选 | 数据类型 | 默认值 | 说明 |
---|---|---|---|---|
--from | 否 | string | 默认为空 | 日志收集的开始时间,格式为: yyyy-mm-dd hh:mm:ss 。 |
--to | 否 | string | 默认为空 | 日志收集的结束时间,格式为: yyyy-mm-dd hh:mm:ss 。 |
--since | 否 | string | 默认为空 | 日志收集最近的某段时间,格式为: \<n> <m|h|d> ,其中,n 表示待输入的时间数字,m 表示分钟,h 表示小时,d 表示天,例如 30m 表示分析最近 30 分钟的日志。 |
--scope | 否 | string | all | 选择分析的 OceanBase 集群日志类型,可配置值为 observer、election、rootservice、all 。 |
--grep | 否 | string | 默认为空 | 选择查询的关键字。 |
--store_dir | 否 | string | 默认为命令执行的当前路径 | 存储结果的本地路径。 |
--log_level | 否 | string | WARN | 选择分析的 OceanBase 集群日志等级,可配置的日志等级从低到高依次为 DEBUG, TRACE, INFO, WDIAG, WARN, EDIAG, ERROR 。分析的时候会分析包含选定等级以及以上等级的日志 |
--files | 否 | string | 默认为空 |
--files 是离线分析日志文件模式,一旦指定 --files 则进入离线分析日志模式,需要传递本 OceanBase 日志或者日志路径,离线分析模式下 --from 、--to 、--since 、--ob_install_dir 设置将不需要设置 |
-c | 否 | string | ~/.obdiag/config.yml |
配置文件路径 |
-
在线分析日志示例:
obdiag analyze log --from "2023-10-08 10:25:00" --to "2023-10-08 11:30:00" ... FileListInfo: +----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Node | LogList | +================+=======================================================================================================================================================================================================================+ | xx.xx.xx.xx | ['observer.log.20231008104204260', 'observer.log.20231008111305072', 'observer.log.20231008114410668', 'observer.log.wf.20231008104204260', 'observer.log.wf.20231008111305072', 'observer.log.wf.20231008114410668'] | +----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ... Analyze OceanBase Online Log Summary: +----------------+-----------+------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ | Node | Status | FileName | ErrorCode | Message | Count | +================+===========+==============================================================================+=============+===============================================================================================================================+=========+ | xx.xx.xx.xx | Completed | analyze_pack_20231008171201/xx_xx_xx_xx/observer.log.20231008104204260 | -5006 | You have an error in your SQL syntax; check the manual that corresponds to your OceanBase version for the right syntax to use | 2 | +----------------+-----------+------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ | xx.xx.xx.xx | Completed | analyze_pack_20231008171201/xx_xx_xx_xx/observer.log.20231008111305072 | -5006 | You have an error in your SQL syntax; check the manual that corresponds to your OceanBase version for the right syntax to use | 8 | +----------------+-----------+------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ | xx.xx.xx.xx | Completed | analyze_pack_20231008171201/xx_xx_xx_xx/observer.log.20231008114410668 | -5006 | You have an error in your SQL syntax; check the manual that corresponds to your OceanBase version for the right syntax to use | 10 | +----------------+-----------+------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ | xx.xx.xx.xx | Completed | analyze_pack_20231008171201/xx_xx_xx_xx/observer.log.20231008114410668 | -4009 | IO error | 20 | +----------------+-----------+------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ For more details, please run cmd 'cat analyze_pack_20231008171201/result_details.txt'
-
快捷分析最近一段时间的日志:
# 在线分析最近一小时的日志,该指令执行的时候会从远程主机上拉取最近一小时的日志进行分析,诊断出出现过的错误 obdiag analyze log --since 1h # 在线分析最近 30 分钟的日志,该指令执行的时候会从远程主机上拉取最近30分钟的日志进行分析,诊断出出现过的错误 obdiag analyze log --since 30m
-
-
离线分析日志示例
$ ls -lh test/ -rw-r--r-- 1 admin staff 256M Oct 8 17:24 observer.log.20231008104204260 -rw-r--r-- 1 admin staff 256M Oct 8 17:24 observer.log.20231008111305072 -rw-r--r-- 1 admin staff 256M Oct 8 17:24 observer.log.20231008114410668 -rw-r--r-- 1 admin staff 18K Oct 8 17:24 observer.log.wf.20231008104204260 -rw-r--r-- 1 admin staff 19K Oct 8 17:24 observer.log.wf.20231008111305072 -rw-r--r-- 1 admin staff 18K Oct 8 17:24 observer.log.wf.20231008114410668 $ obdiag analyze log --files test/ Analyze OceanBase Offline Log Summary: +-----------+-----------+-----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ | Node | Status | FileName | ErrorCode | Message | Count | +===========+===========+=======================================================================+=============+===============================================================================================================================+=========+ | 127.0.0.1 | Completed | analyze_pack_20231008172144/127_0_0_1_/observer.log.20231008104204260 | -5006 | You have an error in your SQL syntax; check the manual that corresponds to your OceanBase version for the right syntax to use | 2 | +-----------+-----------+-----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ | 127.0.0.1 | Completed | analyze_pack_20231008172144/127_0_0_1_/observer.log.20231008111305072 | -5006 | You have an error in your SQL syntax; check the manual that corresponds to your OceanBase version for the right syntax to use | 8 | +-----------+-----------+-----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ | 127.0.0.1 | Completed | analyze_pack_20231008172144/127_0_0_1_/observer.log.20231008114410668 | -5006 | You have an error in your SQL syntax; check the manual that corresponds to your OceanBase version for the right syntax to use | 10 | +-----------+-----------+-----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ | 127.0.0.1 | Completed | analyze_pack_20231008172144/127_0_0_1_/observer.log.20231008114410668 | -4009 | IO error | 20 | +-----------+-----------+-----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+---------+ For more details, please run cmd 'cat analyze_pack_20231008172144/result_details.txt'
-
离线分析指定 observer 日志:
# 离线分析指定的日志文件 obdiag analyze log --files observer.log.20230831142211247
-
全链路诊断是什么:OceanBase 数据库是分布式数据库,因此调用链路复杂,当出现超时问题的时,往往无法快速定位是OceanBase 内部组件或是网络的问题,运维人员只能根据经验和 observer 日志进行分析。OB内核在4.0新增了trace.log日志,可以用于分析全链路诊断。全链路有两条路径,一条是从应用通过客户端(JDBC 或 OCI 等)下发请求给 ODP(代理服务器)访问 OBServer,访问结果返回给应用;另一条是从应用通过客户端(JDBC 或 OCI 等)直接访问 OBServer,访问结果返回给应用。全链路诊断是对全链路所有组件进行问题定位的诊断。
obdiag全链路诊断做了什么:
┌────────┐ ┌─────────────────────────────┐ ┌────────────────────────────┐
│ 服务器1 │------>│ 搜索并过滤flt_trace_id相关日志 │------>│ 回传过滤后的日志到obdiag的节点 │---┐
└────────┘ └─────────────────────────────┘ └────────────────────────────┘ │
┌────────┐ ┌─────────────────────────────┐ ┌────────────────────────────┐ │ ┌────────────────────────────────┐
│ 服务器2 │------>│ 搜索并过滤flt_trace_id相关日志 │------>│ 回传过滤后的日志到obdiag的节点 │---┼--->│ 按照父子关系聚合各节点日志输出链路树 │
└────────┘ └─────────────────────────────┘ └────────────────────────────┘ │ └────────────────────────────────┘
┌────────┐ ┌─────────────────────────────┐ ┌────────────────────────────┐ │
│ 服务器N │------>│ 搜索并过滤flt_trace_id相关日志 │------>│ 回传过滤后的日志到obdiag的节点 │---┘
└────────┘ └─────────────────────────────┘ └────────────────────────────┘
在gv$ob_sql_audit中,如果有明确的SQL语句可以通过通过query_sql查到疑似慢sql的 flt_trace_id, 例如:
OceanBase(root@test)>select query_sql, flt_trace_id from oceanbase.gv$ob_sql_audit where query_sql like 'select @@version_comment limit 1';
+----------------------------------+--------------------------------------+
| query_sql | flt_trace_id |
+----------------------------------+--------------------------------------+
| select @@version_comment limit 1 | 00060aa3-d607-f5f2-328b-388e17f687cb |
+----------------------------------+--------------------------------------+
1 row in set (0.001 sec)
其中flt_trace_id为00060aa3-d607-f5f2-328b-388e17f687cb.
或者你也可从obproxy、oceanbase的trace.log日志中找到flt_trace_id
head trace.log
[2023-12-07 22:20:07.242229] [489640][T1_L0_G0][T1][YF2A0BA2DA7E-00060BEC28627BEF-0-0] {"trace_id":"00060bec-275e-9832-e730-7c129f2182ac","name":"close_das_task","id":"00060bec-2a20-bf9e-56c9-724cb467f859","start_ts":1701958807240606,"end_ts":1701958807240607,"parent_id":"00060bec-2a20-bb5f-e03a-5da01aa3308b","is_follow":false}
其中 00060bec-275e-9832-e730-7c129f2182ac
就是其flt_trace_id。
需要确保已经在 obdiag 配置文件 config.yml
中配置好需要收集节点的登录信息。相关的详细配置介绍,参见 obdiag 配置。
obdiag config -hxx.xx.xx.xx -uroot@sys -p***** -P2881
obdiag analyze flt_trace [options]
选项说明如下:
选项名 | 是否必选 | 数据类型 | 默认值 | 说明 |
---|---|---|---|---|
--flt_trace_id | 是 | string | flt_trace_id | flt_trace_id从gv$ob_sql_audit中查看flt_trace_id或者直接通过trace.log查找 |
--store_dir | 否 | string | 默认为命令执行的当前路径 | 存储结果的本地路径。 |
--files | 否 | string | 默认为空 |
--files 是离线分析日志文件模式,一旦指定 --files 则进入离线分析日志模式 |
--top | 否 | string | 5 | 展示在全链路诊断结果中耗时最长的Top N叶子节点 |
--recursion | 否 | string | 8 | 最大递归的全链路诊断的层数 |
--output | 否 | string | 60 | 打印到屏幕上的结果行数,全量结果会完整的存储到结果文件中 |
-c | 否 | string | ~/.obdiag/config.yml |
配置文件路径 |
例子:
$ obdiag analyze flt_trace --flt_trace_id 000605b1-28bb-c15f-8ba0-1206bcc08aa3
root node id: 000605b1-28bb-c15f-8ba0-1206bcc08aa3
TOP time-consuming leaf span:
+---+----------------------------------+-------------+---------------------+
| ID| Span Name | Elapsed Time| NODE |
+---+----------------------------------+-------------+---------------------+
| 18| px_task | 2.758 ms | OBSERVER(xx.xx.xx.1)|
| 5 | pc_get_plan | 52 μs | OBSERVER(xx.xx.xx.1)|
| 16| do_local_das_task | 45 μs | OBSERVER(xx.xx.xx.1)|
| 10| do_local_das_task | 17 μs | OBSERVER(xx.xx.xx.1)|
| 17| close_das_task | 14 μs | OBSERVER(xx.xx.xx.1)|
+---+----------------------------------+-------------+---------------------+
Tags & Logs:
-------------------------------------
18 - px_task Elapsed: 2.758 ms
NODE:OBSERVER(xx.xx.xx.1)
tags: [{'group_id': 0}, {'qc_id': 1}, {'sqc_id': 0}, {'dfo_id': 1}, {'task_id': 1}]
5 - pc_get_plan Elapsed: 52 μs
NODE:OBSERVER(xx.xx.xx.1)
16 - do_local_das_task Elapsed: 45 μs
NODE:OBSERVER(xx.xx.xx.3)
10 - do_local_das_task Elapsed: 17 μs
NODE:OBSERVER(xx.xx.xx.1)
17 - close_das_task Elapsed: 14 μs
NODE:OBSERVER(xx.xx.xx.3)
Details:
+---+----------------------------------+-------------+---------------------+
| ID| Span Name | Elapsed Time| NODE |
+---+----------------------------------+-------------+---------------------+
| 1 | TRACE | - | - |
| 2 | └─com_query_process | 5.351 ms | OBPROXY(xx.xx.xx.1) |
| 3 | └─mpquery_single_stmt | 5.333 ms | OBSERVER(xx.xx.xx.1)|
| 4 | ├─sql_compile | 107 μs | OBSERVER(xx.xx.xx.1)|
| 5 | │ └─pc_get_plan | 52 μs | OBSERVER(xx.xx.xx.1)|
| 6 | └─sql_execute | 5.147 ms | OBSERVER(xx.xx.xx.1)|
| 7 | ├─open | 87 μs | OBSERVER(xx.xx.xx.1)|
| 8 | ├─response_result | 4.945 ms | OBSERVER(xx.xx.xx.1)|
| 9 | │ ├─px_schedule | 2.465 ms | OBSERVER(xx.xx.xx.1)|
| 10| │ │ ├─do_local_das_task | 17 μs | OBSERVER(xx.xx.xx.1)|
| 11| │ │ ├─px_task | 2.339 ms | OBSERVER(xx.xx.xx.2)|
| 12| │ │ │ ├─do_local_das_task | 54 μs | OBSERVER(xx.xx.xx.2)|
| 13| │ │ │ └─close_das_task | 22 μs | OBSERVER(xx.xx.xx.2)|
| 14| │ │ ├─do_local_das_task | 11 μs | OBSERVER(xx.xx.xx.1)|
| 15| │ │ ├─px_task | 2.834 ms | OBSERVER(xx.xx.xx.3)|
| 16| │ │ │ ├─do_local_das_task | 45 μs | OBSERVER(xx.xx.xx.3)|
| 17| │ │ │ └─close_das_task | 14 μs | OBSERVER(xx.xx.xx.3)|
| 18| │ │ └─px_task | 2.758 ms | OBSERVER(xx.xx.xx.1)|
| 19| │ ├─px_schedule | 1 μs | OBSERVER(xx.xx.xx.1)|
| 20| │ └─px_schedule | 1 μs | OBSERVER(xx.xx.xx.1)|
| ..| ...... | ... | ...... |
+---+----------------------------------+-------------+---------------------+
For more details, please run cmd ' cat analyze_flt_result/000605b1-28bb-c15f-8ba0-1206bcc08aa3.txt '
查看详情:
$ cat analyze_flt_result/000605b1-28bb-c15f-8ba0-1206bcc08aa3.txt
root node id: 000605b1-28bb-c15f-8ba0-1206bcc08aa3
TOP time-consuming leaf span:
+---+----------------------------------+-------------+---------------------+
| ID| Span Name | Elapsed Time| NODE |
+---+----------------------------------+-------------+---------------------+
| 18| px_task | 2.758 ms | OBSERVER(xx.xx.xx.1)|
| 5 | pc_get_plan | 52 μs | OBSERVER(xx.xx.xx.1)|
| 16| do_local_das_task | 45 μs | OBSERVER(xx.xx.xx.1)|
| 10| do_local_das_task | 17 μs | OBSERVER(xx.xx.xx.1)|
| 17| close_das_task | 14 μs | OBSERVER(xx.xx.xx.1)|
+---+----------------------------------+-------------+---------------------+
Tags & Logs:
-------------------------------------
18 - px_task Elapsed: 2.758 ms
NODE:OBSERVER(xx.xx.xx.1)
tags: [{'group_id': 0}, {'qc_id': 1}, {'sqc_id': 0}, {'dfo_id': 1}, {'task_id': 1}]
5 - pc_get_plan Elapsed: 52 μs
NODE:OBSERVER(xx.xx.xx.1)
16 - do_local_das_task Elapsed: 45 μs
NODE:OBSERVER(xx.xx.xx.3)
10 - do_local_das_task Elapsed: 17 μs
NODE:OBSERVER(xx.xx.xx.1)
17 - close_das_task Elapsed: 14 μs
NODE:OBSERVER(xx.xx.xx.3)
Details:
+---+----------------------------------+-------------+---------------------+
| ID| Span Name | Elapsed Time| NODE |
+---+----------------------------------+-------------+---------------------+
| 1 | TRACE | - | - |
| 2 | └─com_query_process | 5.351 ms | OBPROXY(xx.xx.xx.1) |
| 3 | └─mpquery_single_stmt | 5.333 ms | OBSERVER(xx.xx.xx.1)|
| 4 | ├─sql_compile | 107 μs | OBSERVER(xx.xx.xx.1)|
| 5 | │ └─pc_get_plan | 52 μs | OBSERVER(xx.xx.xx.1)|
| 6 | └─sql_execute | 5.147 ms | OBSERVER(xx.xx.xx.1)|
| 7 | ├─open | 87 μs | OBSERVER(xx.xx.xx.1)|
| 8 | ├─response_result | 4.945 ms | OBSERVER(xx.xx.xx.1)|
| 9 | │ ├─px_schedule | 2.465 ms | OBSERVER(xx.xx.xx.1)|
| 10| │ │ ├─do_local_das_task | 17 μs | OBSERVER(xx.xx.xx.1)|
| 11| │ │ ├─px_task | 2.339 ms | OBSERVER(xx.xx.xx.2)|
| 12| │ │ │ ├─do_local_das_task | 54 μs | OBSERVER(xx.xx.xx.2)|
| 13| │ │ │ └─close_das_task | 22 μs | OBSERVER(xx.xx.xx.2)|
| 14| │ │ ├─do_local_das_task | 11 μs | OBSERVER(xx.xx.xx.1)|
| 15| │ │ ├─px_task | 2.834 ms | OBSERVER(xx.xx.xx.3)|
| 16| │ │ │ ├─do_local_das_task | 45 μs | OBSERVER(xx.xx.xx.3)|
| 17| │ │ │ └─close_das_task | 14 μs | OBSERVER(xx.xx.xx.3)|
| 18| │ │ └─px_task | 2.758 ms | OBSERVER(xx.xx.xx.1)|
| 19| │ ├─px_schedule | 1 μs | OBSERVER(xx.xx.xx.1)|
| 20| │ └─px_schedule | 1 μs | OBSERVER(xx.xx.xx.1)|
| 21| └─close | 70 μs | OBSERVER(xx.xx.xx.1)|
| 22| └─end_transaction | 3 μs | OBSERVER(xx.xx.xx.1)|
+---+----------------------------------+-------------+---------------------+
Tags & Logs:
-------------------------------------
1 -
2 - com_query_process Elapsed: 5.351 ms
NODE:OBPROXY(xx.xx.xx.1)
tags: [{'sess_id': 3221487633}, {'action_name': ''}, {'module_name': ''}, {'client_info': ''}, {'receive_ts': 1695108311007659}, {'log_trace_id': 'YA9257F000001-000605B0441954BC-0-0'}]
3 - mpquery_single_stmt Elapsed: 5.333 ms
NODE:OBSERVER(xx.xx.xx.1)
4 - sql_compile Elapsed: 107 μs
NODE:OBSERVER(xx.xx.xx.1)
tags: [{'sql_text': 'select /*+parallel(2)*/ count(1) from t1 tt1, t1 tt2'}, {'sql_id': '797B7202BA69D4C2C77C12BFADDC19DC'}, {'database_id': 201001}, {'plan_hash': 150629045171310866}, {'hit_plan': True}]
5 - pc_get_plan Elapsed: 52 μs
NODE:OBSERVER(xx.xx.xx.1)
6 - sql_execute Elapsed: 5.147 ms
NODE:OBSERVER(xx.xx.xx.1)
7 - open Elapsed: 87 μs
NODE:OBSERVER(xx.xx.xx.1)
8 - response_result Elapsed: 4.945 ms
NODE:OBSERVER(xx.xx.xx.1)
9 - px_schedule Elapsed: 2.465 ms
NODE:OBSERVER(xx.xx.xx.1)
tags: [{'used_worker_cnt': 0}, {'qc_id': 1}, {'dfo_id': 2147483647}, {'used_worker_cnt': 0}, {'qc_id': 1}, {'dfo_id': 1}]
10 - do_local_das_task Elapsed: 17 μs
NODE:OBSERVER(xx.xx.xx.1)
11 - px_task Elapsed: 2.339 ms
NODE:OBSERVER(xx.xx.xx.2)
tags: [{'group_id': 0}, {'qc_id': 1}, {'sqc_id': 0}, {'dfo_id': 0}, {'task_id': 0}]
12 - do_local_das_task Elapsed: 54 μs
NODE:OBSERVER(xx.xx.xx.2)
13 - close_das_task Elapsed: 22 μs
NODE:OBSERVER(xx.xx.xx.2)
14 - do_local_das_task Elapsed: 11 μs
NODE:OBSERVER(xx.xx.xx.1)
15 - px_task Elapsed: 2.834 ms
NODE:OBSERVER(xx.xx.xx.3)
tags: [{'group_id': 0}, {'qc_id': 1}, {'sqc_id': 0}, {'dfo_id': 1}, {'task_id': 0}]
16 - do_local_das_task Elapsed: 45 μs
NODE:OBSERVER(xx.xx.xx.3)
17 - close_das_task Elapsed: 14 μs
NODE:OBSERVER(xx.xx.xx.3)
18 - px_task Elapsed: 2.758 ms
NODE:OBSERVER(xx.xx.xx.1)
tags: [{'group_id': 0}, {'qc_id': 1}, {'sqc_id': 0}, {'dfo_id': 1}, {'task_id': 1}]
19 - px_schedule Elapsed: 1 μs
NODE:OBSERVER(xx.xx.xx.1)
20 - px_schedule Elapsed: 1 μs
NODE:OBSERVER(xx.xx.xx.1)
21 - close Elapsed: 70 μs
NODE:OBSERVER(xx.xx.xx.1)
22 - end_transaction Elapsed: 3 μs
NODE:OBSERVER(xx.xx.xx.1)
tags: [{'trans_id': 0}]
OceanBase是原生分布式数据库系统,故障根因分析通常是比较繁琐的,因为涉及的因素可能有很多,如机器环境、配置参数、运行负载等等。专家在排查问题的时候需要获取大量的信息来分析故障,如何高效的获取故障场景下分散在各个节点的信息,挖掘出其中的关联性,帮助用户自助诊断问题便是obdiag的价值。