-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RSS statistics and timings collecting #298
Conversation
69133e9
to
09ba06e
Compare
09ba06e
to
dd49db0
Compare
68547be
to
ae7cddb
Compare
@avtikhon asked me a suggestion how to better track tarantool processes in test-run, so I created a simple sampling infrastructure for test-run as example: 6e73ac4. Not tested thoroughly, but at least I looked over logs for different types of tests and it seems that it tracks everything that I would expect. Sadly, the register-unregister logs are not written to diff --git a/lib/sampler.py b/lib/sampler.py
index 8bcc6ca..95dcec0 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -2,7 +2,7 @@ import os
import sys
import time
-from lib.colorer import color_log
+from lib.colorer import color_stdout as color_log
from lib.colorer import qa_notice
from lib.utils import format_process |
fd9f9bf
to
07f37da
Compare
The patchset is in the good shape, but I would spin for a while around code readability. It is better to keep it good from scratch, rather than fix afterwards. I left my comments above. |
eadd7bf
to
06c4d5c
Compare
Track tarantool and unit test executables that are run using test-run with metainformation: worker, test, test configuration and server name. Add a function that will be called each 0.1 second for each tracked process. The implementation tracks non-default servers and re-register default servers that executes several tests ('core = tarantool' case). Part of #277
06c4d5c
to
f058f1f
Compare
202bce8
to
d7afe4a
Compare
d7afe4a
to
d965925
Compare
d965925
to
44322a8
Compare
Pushed several fixups (mostly around naming and wording). Please, apply if you don't object against them. The fixup commits arranged to ease squashing. Patches.commit 38067df314b47d8c542f8ccf6453a128f3b2d6f2
Author: Alexander Turenko <[email protected]>
Date: Sun May 23 03:23:54 2021 +0300
FIXUP: Actualize comment in print_statistics()
diff --git a/listeners.py b/listeners.py
index 619121e..88d5bb4 100644
--- a/listeners.py
+++ b/listeners.py
@@ -104,13 +104,13 @@ class StatisticsWatcher(BaseWatcher):
fd.close()
def print_statistics(self):
+ """Print statistics and results of testing."""
# Prepare standalone subpath '<vardir>/statistics' for statistics files.
stats_dir = os.path.join(Options().args.vardir, 'statistics')
safe_makedirs(stats_dir)
self.print_rss_summary(stats_dir)
- """Returns are there failed tasks."""
if self.stats:
color_stdout('Statistics:\n', schema='test_var')
for short_status, cnt in self.stats.items(): commit ef322a34fc1fc089e21c104504391135551cc578
Author: Alexander Turenko <[email protected]>
Date: Sun May 23 02:27:47 2021 +0300
FIXUP: sampler: don't use private API
diff --git a/lib/sampler.py b/lib/sampler.py
index 92cbee3..ca977da 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -44,6 +44,10 @@ class SamplerWatcher(object):
def process_timeout(self, delta_seconds):
self._wakeup()
+ @property
+ def sample_interval(self):
+ return self._sample_interval
+
def _wakeup(self):
"""Invoke Sampler.sample() if enough time elapsed since
the previous call.
@@ -94,7 +98,7 @@ class Sampler:
@property
def sample_interval(self):
- return self._watcher._sample_interval
+ return self._watcher.sample_interval
@property
def watcher(self): commit db21cb9fffa19d36bc7f1758ea084c3605de4255
Author: Alexander Turenko <[email protected]>
Date: Sun May 23 02:31:53 2021 +0300
FIXUP: sampler: fixups for is_enabled
* Use proc_stat_rss_supported() directly: the access is not frequent, we
don't need to cache the value in the instance's field.
* Convert from a method to a property (it is already accessed as a
property in dispatcher).
diff --git a/lib/sampler.py b/lib/sampler.py
index ca977da..753ef5b 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -79,7 +79,6 @@ class Sampler:
self._processes = dict()
self._rss_summary = dict()
- self.proc_stat_rss_supported = proc_stat_rss_supported()
def set_queue(self, queue, worker_id, worker_name):
# Called from a worker process (_run_worker()).
@@ -107,8 +106,9 @@ class Sampler:
'main test-run process')
return self._watcher
+ @property
def is_enabled(self):
- return self.proc_stat_rss_supported
+ return proc_stat_rss_supported()
def register_process(self, pid, task_id, server_name, worker_id=None,
worker_name=None):
diff --git a/listeners.py b/listeners.py
index 88d5bb4..d5ab0df 100644
--- a/listeners.py
+++ b/listeners.py
@@ -65,7 +65,7 @@ class StatisticsWatcher(BaseWatcher):
# RSS.
def print_rss_summary(self, stats_dir):
- if not self._sampler.is_enabled():
+ if not self._sampler.is_enabled:
return
rss_summary = self._sampler.rss_summary commit 9dff43866ffff152119998046e0e4021f7583841
Author: Alexander Turenko <[email protected]>
Date: Sun May 23 03:13:02 2021 +0300
FIXUP: Adjust wording in memory statistics report
diff --git a/listeners.py b/listeners.py
index d5ab0df..f4bd9b7 100644
--- a/listeners.py
+++ b/listeners.py
@@ -73,9 +73,7 @@ class StatisticsWatcher(BaseWatcher):
# Print to stdout RSS statistics for all failed tasks.
if self.failed_tasks:
- color_stdout('RSS of the failed tasks in Mb:', schema='info')
- color_stdout(' (may skip tests run faster than {} sec):\n'.format(
- self._sampler.sample_interval), schema='info')
+ color_stdout('Occupied memory in failed tests (RSS, Mb):\n', schema='info')
for task in self.failed_tasks:
task_id = task[0]
if task_id in rss_summary:
@@ -83,17 +81,22 @@ class StatisticsWatcher(BaseWatcher):
self.prettify_task_name(task_id).ljust(self.field_size),
self.get_long_mark(task_id)),
schema='info')
+ color_stdout('\n')
# Print to stdout RSS statistics for some number of most it used tasks.
- color_stdout('Up to {} most RSS used tasks in Mb'.format(top_rss), schema='info')
- color_stdout(' (may skip tests run faster than {} sec):\n'.format(
- self._sampler.sample_interval), schema='info')
+ color_stdout('Top {} tests by occupied memory (RSS, Mb):\n'.format(
+ top_rss), schema='info')
results_sorted = sorted(rss_summary.items(), key=lambda x: x[1], reverse=True)
for task_id, rss in results_sorted[:top_rss]:
color_stdout('* %6.1f %s %s\n' % (float(rss) / 1024,
self.prettify_task_name(task_id).ljust(self.field_size),
self.get_long_mark(task_id)), schema='info')
- color_stdout('-' * 41, "\n", schema='separator')
+ color_stdout('\n')
+
+ color_stdout('(Tests quicker than {} seconds may be missed.)\n'.format(
+ self._sampler.sample_interval), schema='info')
+
+ color_stdout('-' * 81, "\n", schema='separator')
# Print RSS statistics to '<vardir>/statistics/rss.log' file.
filepath = os.path.join(stats_dir, 'rss.log') commit 4eadef6f502d0df71e2e5c5afcaeb440d7b51908
Author: Alexander Turenko <[email protected]>
Date: Sun May 23 03:15:25 2021 +0300
FIXUP: sampler: adjust wording in rss_summary comment
diff --git a/lib/sampler.py b/lib/sampler.py
index 753ef5b..eb83d89 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -89,10 +89,7 @@ class Sampler:
@property
def rss_summary(self):
- """Return dictionary of collected maximum RSS values per each
- run task, set with task_id keys.
- """
-
+ """Task ID to maximum RSS mapping."""
return self._rss_summary
@property commit 1df77515ec8fd2c75845236eb63dfa778256e1b2
Author: Alexander Turenko <[email protected]>
Date: Sun May 23 03:20:45 2021 +0300
FIXUP: Adjust wording in duration statistics
diff --git a/listeners.py b/listeners.py
index 231ea5c..d90c070 100644
--- a/listeners.py
+++ b/listeners.py
@@ -115,7 +115,8 @@ class StatisticsWatcher(BaseWatcher):
# Print to stdout durations for all failed tasks.
if self.failed_tasks:
- color_stdout('Duration of the failed tasks:\n', schema='info')
+ color_stdout('Duration of failed tests (seconds):\n',
+ schema='info')
for task in self.failed_tasks:
task_id = task[0]
if task_id in self.tasks_durations:
@@ -123,16 +124,18 @@ class StatisticsWatcher(BaseWatcher):
self.prettify_task_name(task_id).ljust(self.field_size),
self.get_long_mark(task_id)),
schema='info')
+ color_stdout('\n')
# Print to stdout durations for some number of most long tasks.
- color_stdout('Up to {} most long tasks in seconds:\n'.format(top_durations),
+ color_stdout('Top {} longest tests (seconds):\n'.format(top_durations),
schema='info')
results_sorted = sorted(self.tasks_durations.items(), key=lambda x: x[1], reverse=True)
for task_id, duration in results_sorted[:top_durations]:
color_stdout('* %6.2f %s %s\n' % (duration,
self.prettify_task_name(task_id).ljust(self.field_size),
self.get_long_mark(task_id)), schema='info')
- color_stdout('-' * 41, "\n", schema='separator')
+
+ color_stdout('-' * 81, "\n", schema='separator')
# Print durations statistics to '<vardir>/statistics/duration.log' file.
filepath = os.path.join(stats_dir, 'duration.log') commit 44322a8ff1615c530f227461835bd2639a4217b7
Author: Alexander Turenko <[email protected]>
Date: Sun May 23 03:29:47 2021 +0300
FIXUP: Adjust wording: tasks_durations -> duration_stats
diff --git a/listeners.py b/listeners.py
index d90c070..107e57e 100644
--- a/listeners.py
+++ b/listeners.py
@@ -36,7 +36,7 @@ class StatisticsWatcher(BaseWatcher):
self.stats = dict()
self.field_size = 60
self._sampler = sampler
- self.tasks_durations = dict()
+ self.duration_stats = dict()
self.failed_tasks = []
self.get_logfile = get_logfile
self.long_tasks = set()
@@ -58,7 +58,7 @@ class StatisticsWatcher(BaseWatcher):
obj.result_checksum,
obj.show_reproduce_content))
- self.tasks_durations[obj.task_id] = obj.duration
+ self.duration_stats[obj.task_id] = obj.duration
def get_long_mark(self, task):
return '(long)' if task in self.long_tasks else ''
@@ -119,8 +119,8 @@ class StatisticsWatcher(BaseWatcher):
schema='info')
for task in self.failed_tasks:
task_id = task[0]
- if task_id in self.tasks_durations:
- color_stdout('* %6.2f %s %s\n' % (self.tasks_durations[task_id],
+ if task_id in self.duration_stats:
+ color_stdout('* %6.2f %s %s\n' % (self.duration_stats[task_id],
self.prettify_task_name(task_id).ljust(self.field_size),
self.get_long_mark(task_id)),
schema='info')
@@ -129,7 +129,7 @@ class StatisticsWatcher(BaseWatcher):
# Print to stdout durations for some number of most long tasks.
color_stdout('Top {} longest tests (seconds):\n'.format(top_durations),
schema='info')
- results_sorted = sorted(self.tasks_durations.items(), key=lambda x: x[1], reverse=True)
+ results_sorted = sorted(self.duration_stats.items(), key=lambda x: x[1], reverse=True)
for task_id, duration in results_sorted[:top_durations]:
color_stdout('* %6.2f %s %s\n' % (duration,
self.prettify_task_name(task_id).ljust(self.field_size),
@@ -137,12 +137,12 @@ class StatisticsWatcher(BaseWatcher):
color_stdout('-' * 81, "\n", schema='separator')
- # Print durations statistics to '<vardir>/statistics/duration.log' file.
+ # Print duration statistics to '<vardir>/statistics/duration.log' file.
filepath = os.path.join(stats_dir, 'duration.log')
fd = open(filepath, 'w')
- for task_id in self.tasks_durations:
+ for task_id in self.duration_stats:
fd.write("{} {}\n".format(self.prettify_task_name(task_id),
- self.tasks_durations[task_id]))
+ self.duration_stats[task_id]))
fd.close()
def print_statistics(self): |
Found that some tests may fail due to lack of memory. Mostly it happens in CI on remote hosts. To be able to collect memory used statistic, decided to add RSS memory status collecting routine get_proc_stat_rss() which parses files: /proc/<worker pid>/status for RSS value 'VmRSS' which is size of memory portions. It contains the three following parts (VmRSS = RssAnon + RssFile + RssShmem) [1]: RssAnon - size of resident anonymous memory RssFile - size of resident file mappings RssShmem - size of resident shmem memory (includes SysV shm, mapping of tmpfs and shared anonymous mappings) Decided that the best way for CI not to run this RSS collecting routine for each sent command from tests tasks, but to run it after the test task started each 0.1 second delay, to collect its maximum RSS value during task run. This delay used to run routines in 'SamplerWatcher' listener. Also found that delay of 0.1 sec is completely enough to catch RSS use increase, due to tested check: tarantool> require('clock').bench(function() local t = {} for i = 1, 1024^2 * 100 do t[i] = true end end) Which checked that 100 Mb of data allocated in seconds: - on CI test host: 3.153877479 - on local fast host: 0.54504489 The main idea is to check all test depend processes running at some point in time and choose maximum RSS value reached by it. For it used '_sample_process()' routine which gets RSS for each currently alive process and '_sample()' routine which counts sum of each task alive processes RSS and checks if this value is bigger than previously saved for the current task. Both routines are in 'Sampler()' class which is called by 'process_timeout()' routine from 'SamplerWatcher' listener. Also used print_statistics() routine in listener 'StatisticsWatcher' which prints statistics to stdout after testing. It is used to print RSS usage for failed tasks and up to 10 most used it tasks. Due to sampler delay which collects RSS values based on 0.1 sec then tests that run faster may not have RSS values at all and won't be printed. Created new subdirectory 'statistics' in 'vardir' path to save statistics files. The current patch uses it to save there 'rss.log' file with RSS values per tested tasks in format: <test task name> <maximum RSS value> Closes #277 [1]: https://www.kernel.org/doc/html/latest/filesystems/proc.html
Decided to collect tests run duration in standalone file and print to stdout after testing finished. To stdout printing duration for failed tasks and up to 10 most long tasks. For duration collecting used listener 'StatisticsWatcher' which has the following used routines: process_result() Using 'WorkerTaskResult' queue message to collect tasks that failed and final duration for finished tasks. Duration collecting routine added in the worker running tests. Its values using 'result_queue' passing through 'WorkerTaskResult' queue message into the listener. print_statistics() - statistics printing to stdout after testing. Prints duration for failed tasks and up to 10 most long tasks. We use standalone 'statistics' directory in 'vardir' path to save 'duration.log' file with duration for each tested tasks in format: <test task name> <float duration in secs with 2 digits after point> Closes #286
44322a8
to
b80762a
Compare
All patches LGTM, squashed it, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Updated the test-run submodule in tarantool in the following commits: 2.9.0-35-g213f480e7, 2.8.1-19-g67ff7cc5f, 2.7.2-15-gc35655113, 1.10.10-8-gfb804dcca. |
Patch set consists of 3 commits:
Track tarantool and unit test executables that are run using test-run
with metainformation: worker, test, test configuration and server name.
Add a function that will be called each 0.1 second for each tracked
process.
The implementation tracks non-default servers and re-register default
servers that executes several tests ('core = tarantool' case).
Part of #277
Found that some tests may fail due to lack of memory. Mostly it happens
in CI on remote hosts. To be able to collect memory used statistic
decided to add RSS memory status collecting routine get_proc_stat_rss()
which parses files:
for RSS value 'VmRSS' which is size of memory portions. It contains the
three following parts (VmRSS = RssAnon + RssFile + RssShmem) 1:
Decided that the best way for CI not to run this RSS collecting routine
for each sent command from tests tasks, but to run it after the test
task started each 0.1 second delay, to collect its maximum RSS value
during task run. This delay used to run routines in 'SamplerWatcher'
listener. Also found that delay of 0.1 sec is completely enough to catch
RSS use increase, due to tested check:
Which checked that 100 Mb of data allocated in seconds:
The main idea is to check all test depend processes running at some
point in time and choose maximum RSS reached value by it. For it used
'_sample_process()' routine which gets RSS for each currently alive
process and '_sample()' routine which counts sum of each task alive
processes RSS and checks if this value is bigger than previously
saved for the current task. Both routines are in 'Sampler()' class
which is called by 'process_timeout()' routine from 'SamplerWatcher'
listener.
Also used print_statistics() routine in listener 'StatisticsWatcher'
which prints statistics to stdout after testing. It used to print
RSS usage for failed tasks and up to 10 most used it tasks. Created
new subdirectory 'statistics' in 'vardir' path to save statistics
files. The current patch uses it to save there 'rss.log' file with
RSS values per tested tasks in format:
Closes #277
Decided to collect tests run durations in standalone file and print to
stdout after testing finished. To stdout printing durations for failed
tasks and up to 10 most long tasks.
For durations collecting used listener 'StatisticsWatcher' which has
the following used routines:
Closes #286