Add RSS statistics and timings collecting #298

avtikhon · 2021-05-03T04:46:16Z

Patch set consists of 3 commits:

sampler: add simple sampling infrastructure

Track tarantool and unit test executables that are run using test-run
with metainformation: worker, test, test configuration and server name.

Add a function that will be called each 0.1 second for each tracked
process.

The implementation tracks non-default servers and re-register default
servers that executes several tests ('core = tarantool' case).

Part of #277

Add RSS statistics collecting.

Found that some tests may fail due to lack of memory. Mostly it happens
in CI on remote hosts. To be able to collect memory used statistic
decided to add RSS memory status collecting routine get_proc_stat_rss()
which parses files:

      /proc/<worker pid>/status

for RSS value 'VmRSS' which is size of memory portions. It contains the
three following parts (VmRSS = RssAnon + RssFile + RssShmem) 1:

      RssAnon - size of resident anonymous memory
      RssFile - size of resident file mappings
      RssShmem - size of resident shmem memory (includes SysV shm, mapping of
                 tmpfs and shared anonymous mappings)

Decided that the best way for CI not to run this RSS collecting routine
for each sent command from tests tasks, but to run it after the test
task started each 0.1 second delay, to collect its maximum RSS value
during task run. This delay used to run routines in 'SamplerWatcher'
listener. Also found that delay of 0.1 sec is completely enough to catch
RSS use increase, due to tested check:

      tarantool> require('clock').bench(function() local t = {} for i = 1, 1024^2 * 100 do t[i] = true end end)

Which checked that 100 Mb of data allocated in seconds:

      - on CI test host: 3.153877479
      - on local fast host: 0.54504489

The main idea is to check all test depend processes running at some
point in time and choose maximum RSS reached value by it. For it used
'_sample_process()' routine which gets RSS for each currently alive
process and '_sample()' routine which counts sum of each task alive
processes RSS and checks if this value is bigger than previously
saved for the current task. Both routines are in 'Sampler()' class
which is called by 'process_timeout()' routine from 'SamplerWatcher'
listener.

Also used print_statistics() routine in listener 'StatisticsWatcher'
which prints statistics to stdout after testing. It used to print
RSS usage for failed tasks and up to 10 most used it tasks. Created
new subdirectory 'statistics' in 'vardir' path to save statistics
files. The current patch uses it to save there 'rss.log' file with
RSS values per tested tasks in format:

      <test task name> <maximum RSS value>

Closes #277

Add tests duration collecting

Decided to collect tests run durations in standalone file and print to
stdout after testing finished. To stdout printing durations for failed
tasks and up to 10 most long tasks.

For durations collecting used listener 'StatisticsWatcher' which has
the following used routines:

      process_result()
        Using 'WorkerTaskResult' queue message to collect tasks that failed
        and final durations for finished tasks. Durations collecting routine
        added in the worker running tests. Its values using 'result_queue'
        passing through 'WorkerTaskResult' queue message into the listener.
    
      print_statistics() - statistics printing to stdout after testing.
        Prints durations for failed tasks and up to 10 most long tasks.
        We use standalone 'statistics' directory in 'vardir' path to save
        'durations.log' file with durations for each tested tasks in format:
    
          <test task name> <float duration in secs with 2 digits after point>

Closes #286

listeners.py

dispatcher.py

listeners.py

Totktonada · 2021-05-07T03:46:27Z

@avtikhon asked me a suggestion how to better track tarantool processes in test-run, so I created a simple sampling infrastructure for test-run as example: 6e73ac4.

Not tested thoroughly, but at least I looked over logs for different types of tests and it seems that it tracks everything that I would expect.

Sadly, the register-unregister logs are not written to test/var/log/ due to #247, but we can print them to the terminal for debugging purposes this way:

diff --git a/lib/sampler.py b/lib/sampler.py
index 8bcc6ca..95dcec0 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -2,7 +2,7 @@ import os
 import sys
 import time
 
-from lib.colorer import color_log
+from lib.colorer import color_stdout as color_log
 from lib.colorer import qa_notice
 from lib.utils import format_process

listeners.py

Totktonada · 2021-05-19T23:34:58Z

The patchset is in the good shape, but I would spin for a while around code readability. It is better to keep it good from scratch, rather than fix afterwards. I left my comments above.

Track tarantool and unit test executables that are run using test-run with metainformation: worker, test, test configuration and server name. Add a function that will be called each 0.1 second for each tracked process. The implementation tracks non-default servers and re-register default servers that executes several tests ('core = tarantool' case). Part of #277

lib/sampler.py

listeners.py

lib/sampler.py

listeners.py

lib/test.py

Totktonada · 2021-05-23T00:47:32Z

Pushed several fixups (mostly around naming and wording). Please, apply if you don't object against them. The fixup commits arranged to ease squashing.

Patches.

commit 38067df314b47d8c542f8ccf6453a128f3b2d6f2
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:23:54 2021 +0300

    FIXUP: Actualize comment in print_statistics()

diff --git a/listeners.py b/listeners.py
index 619121e..88d5bb4 100644
--- a/listeners.py
+++ b/listeners.py
@@ -104,13 +104,13 @@ class StatisticsWatcher(BaseWatcher):
         fd.close()
 
     def print_statistics(self):
+        """Print statistics and results of testing."""
         # Prepare standalone subpath '<vardir>/statistics' for statistics files.
         stats_dir = os.path.join(Options().args.vardir, 'statistics')
         safe_makedirs(stats_dir)
 
         self.print_rss_summary(stats_dir)
 
-        """Returns are there failed tasks."""
         if self.stats:
             color_stdout('Statistics:\n', schema='test_var')
         for short_status, cnt in self.stats.items():

commit ef322a34fc1fc089e21c104504391135551cc578
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 02:27:47 2021 +0300

    FIXUP: sampler: don't use private API

diff --git a/lib/sampler.py b/lib/sampler.py
index 92cbee3..ca977da 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -44,6 +44,10 @@ class SamplerWatcher(object):
     def process_timeout(self, delta_seconds):
         self._wakeup()
 
+    @property
+    def sample_interval(self):
+        return self._sample_interval
+
     def _wakeup(self):
         """Invoke Sampler.sample() if enough time elapsed since
            the previous call.
@@ -94,7 +98,7 @@ class Sampler:
 
     @property
     def sample_interval(self):
-        return self._watcher._sample_interval
+        return self._watcher.sample_interval
 
     @property
     def watcher(self):

commit db21cb9fffa19d36bc7f1758ea084c3605de4255
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 02:31:53 2021 +0300

    FIXUP: sampler: fixups for is_enabled
    
    * Use proc_stat_rss_supported() directly: the access is not frequent, we
      don't need to cache the value in the instance's field.
    * Convert from a method to a property (it is already accessed as a
      property in dispatcher).

diff --git a/lib/sampler.py b/lib/sampler.py
index ca977da..753ef5b 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -79,7 +79,6 @@ class Sampler:
 
         self._processes = dict()
         self._rss_summary = dict()
-        self.proc_stat_rss_supported = proc_stat_rss_supported()
 
     def set_queue(self, queue, worker_id, worker_name):
         # Called from a worker process (_run_worker()).
@@ -107,8 +106,9 @@ class Sampler:
                                'main test-run process')
         return self._watcher
 
+    @property
     def is_enabled(self):
-        return self.proc_stat_rss_supported
+        return proc_stat_rss_supported()
 
     def register_process(self, pid, task_id, server_name, worker_id=None,
                          worker_name=None):
diff --git a/listeners.py b/listeners.py
index 88d5bb4..d5ab0df 100644
--- a/listeners.py
+++ b/listeners.py
@@ -65,7 +65,7 @@ class StatisticsWatcher(BaseWatcher):
 
     # RSS.
     def print_rss_summary(self, stats_dir):
-        if not self._sampler.is_enabled():
+        if not self._sampler.is_enabled:
             return
 
         rss_summary = self._sampler.rss_summary

commit 9dff43866ffff152119998046e0e4021f7583841
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:13:02 2021 +0300

    FIXUP: Adjust wording in memory statistics report

diff --git a/listeners.py b/listeners.py
index d5ab0df..f4bd9b7 100644
--- a/listeners.py
+++ b/listeners.py
@@ -73,9 +73,7 @@ class StatisticsWatcher(BaseWatcher):
 
         # Print to stdout RSS statistics for all failed tasks.
         if self.failed_tasks:
-            color_stdout('RSS of the failed tasks in Mb:', schema='info')
-            color_stdout(' (may skip tests run faster than {} sec):\n'.format(
-                         self._sampler.sample_interval), schema='info')
+            color_stdout('Occupied memory in failed tests (RSS, Mb):\n', schema='info')
             for task in self.failed_tasks:
                 task_id = task[0]
                 if task_id in rss_summary:
@@ -83,17 +81,22 @@ class StatisticsWatcher(BaseWatcher):
                                  self.prettify_task_name(task_id).ljust(self.field_size),
                                  self.get_long_mark(task_id)),
                                  schema='info')
+            color_stdout('\n')
 
         # Print to stdout RSS statistics for some number of most it used tasks.
-        color_stdout('Up to {} most RSS used tasks in Mb'.format(top_rss), schema='info')
-        color_stdout(' (may skip tests run faster than {} sec):\n'.format(
-                     self._sampler.sample_interval), schema='info')
+        color_stdout('Top {} tests by occupied memory (RSS, Mb):\n'.format(
+                     top_rss), schema='info')
         results_sorted = sorted(rss_summary.items(), key=lambda x: x[1], reverse=True)
         for task_id, rss in results_sorted[:top_rss]:
             color_stdout('* %6.1f %s %s\n' % (float(rss) / 1024,
                          self.prettify_task_name(task_id).ljust(self.field_size),
                          self.get_long_mark(task_id)), schema='info')
-        color_stdout('-' * 41, "\n", schema='separator')
+        color_stdout('\n')
+
+        color_stdout('(Tests quicker than {} seconds may be missed.)\n'.format(
+                     self._sampler.sample_interval), schema='info')
+
+        color_stdout('-' * 81, "\n", schema='separator')
 
         # Print RSS statistics to '<vardir>/statistics/rss.log' file.
         filepath = os.path.join(stats_dir, 'rss.log')

commit 4eadef6f502d0df71e2e5c5afcaeb440d7b51908
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:15:25 2021 +0300

    FIXUP: sampler: adjust wording in rss_summary comment

diff --git a/lib/sampler.py b/lib/sampler.py
index 753ef5b..eb83d89 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -89,10 +89,7 @@ class Sampler:
 
     @property
     def rss_summary(self):
-        """Return dictionary of collected maximum RSS values per each
-           run task, set with task_id keys.
-        """
-
+        """Task ID to maximum RSS mapping."""
         return self._rss_summary
 
     @property

commit 1df77515ec8fd2c75845236eb63dfa778256e1b2
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:20:45 2021 +0300

    FIXUP: Adjust wording in duration statistics

diff --git a/listeners.py b/listeners.py
index 231ea5c..d90c070 100644
--- a/listeners.py
+++ b/listeners.py
@@ -115,7 +115,8 @@ class StatisticsWatcher(BaseWatcher):
 
         # Print to stdout durations for all failed tasks.
         if self.failed_tasks:
-            color_stdout('Duration of the failed tasks:\n', schema='info')
+            color_stdout('Duration of failed tests (seconds):\n',
+                         schema='info')
             for task in self.failed_tasks:
                 task_id = task[0]
                 if task_id in self.tasks_durations:
@@ -123,16 +124,18 @@ class StatisticsWatcher(BaseWatcher):
                                  self.prettify_task_name(task_id).ljust(self.field_size),
                                  self.get_long_mark(task_id)),
                                  schema='info')
+            color_stdout('\n')
 
         # Print to stdout durations for some number of most long tasks.
-        color_stdout('Up to {} most long tasks in seconds:\n'.format(top_durations),
+        color_stdout('Top {} longest tests (seconds):\n'.format(top_durations),
                      schema='info')
         results_sorted = sorted(self.tasks_durations.items(), key=lambda x: x[1], reverse=True)
         for task_id, duration in results_sorted[:top_durations]:
             color_stdout('* %6.2f %s %s\n' % (duration,
                          self.prettify_task_name(task_id).ljust(self.field_size),
                          self.get_long_mark(task_id)), schema='info')
-        color_stdout('-' * 41, "\n", schema='separator')
+
+        color_stdout('-' * 81, "\n", schema='separator')
 
         # Print durations statistics to '<vardir>/statistics/duration.log' file.
         filepath = os.path.join(stats_dir, 'duration.log')

commit 44322a8ff1615c530f227461835bd2639a4217b7
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:29:47 2021 +0300

    FIXUP: Adjust wording: tasks_durations -> duration_stats

diff --git a/listeners.py b/listeners.py
index d90c070..107e57e 100644
--- a/listeners.py
+++ b/listeners.py
@@ -36,7 +36,7 @@ class StatisticsWatcher(BaseWatcher):
         self.stats = dict()
         self.field_size = 60
         self._sampler = sampler
-        self.tasks_durations = dict()
+        self.duration_stats = dict()
         self.failed_tasks = []
         self.get_logfile = get_logfile
         self.long_tasks = set()
@@ -58,7 +58,7 @@ class StatisticsWatcher(BaseWatcher):
                                       obj.result_checksum,
                                       obj.show_reproduce_content))
 
-        self.tasks_durations[obj.task_id] = obj.duration
+        self.duration_stats[obj.task_id] = obj.duration
 
     def get_long_mark(self, task):
         return '(long)' if task in self.long_tasks else ''
@@ -119,8 +119,8 @@ class StatisticsWatcher(BaseWatcher):
                          schema='info')
             for task in self.failed_tasks:
                 task_id = task[0]
-                if task_id in self.tasks_durations:
-                    color_stdout('* %6.2f %s %s\n' % (self.tasks_durations[task_id],
+                if task_id in self.duration_stats:
+                    color_stdout('* %6.2f %s %s\n' % (self.duration_stats[task_id],
                                  self.prettify_task_name(task_id).ljust(self.field_size),
                                  self.get_long_mark(task_id)),
                                  schema='info')
@@ -129,7 +129,7 @@ class StatisticsWatcher(BaseWatcher):
         # Print to stdout durations for some number of most long tasks.
         color_stdout('Top {} longest tests (seconds):\n'.format(top_durations),
                      schema='info')
-        results_sorted = sorted(self.tasks_durations.items(), key=lambda x: x[1], reverse=True)
+        results_sorted = sorted(self.duration_stats.items(), key=lambda x: x[1], reverse=True)
         for task_id, duration in results_sorted[:top_durations]:
             color_stdout('* %6.2f %s %s\n' % (duration,
                          self.prettify_task_name(task_id).ljust(self.field_size),
@@ -137,12 +137,12 @@ class StatisticsWatcher(BaseWatcher):
 
         color_stdout('-' * 81, "\n", schema='separator')
 
-        # Print durations statistics to '<vardir>/statistics/duration.log' file.
+        # Print duration statistics to '<vardir>/statistics/duration.log' file.
         filepath = os.path.join(stats_dir, 'duration.log')
         fd = open(filepath, 'w')
-        for task_id in self.tasks_durations:
+        for task_id in self.duration_stats:
             fd.write("{} {}\n".format(self.prettify_task_name(task_id),
-                                      self.tasks_durations[task_id]))
+                                      self.duration_stats[task_id]))
         fd.close()
 
     def print_statistics(self):

Found that some tests may fail due to lack of memory. Mostly it happens in CI on remote hosts. To be able to collect memory used statistic, decided to add RSS memory status collecting routine get_proc_stat_rss() which parses files: /proc/<worker pid>/status for RSS value 'VmRSS' which is size of memory portions. It contains the three following parts (VmRSS = RssAnon + RssFile + RssShmem) [1]: RssAnon - size of resident anonymous memory RssFile - size of resident file mappings RssShmem - size of resident shmem memory (includes SysV shm, mapping of tmpfs and shared anonymous mappings) Decided that the best way for CI not to run this RSS collecting routine for each sent command from tests tasks, but to run it after the test task started each 0.1 second delay, to collect its maximum RSS value during task run. This delay used to run routines in 'SamplerWatcher' listener. Also found that delay of 0.1 sec is completely enough to catch RSS use increase, due to tested check: tarantool> require('clock').bench(function() local t = {} for i = 1, 1024^2 * 100 do t[i] = true end end) Which checked that 100 Mb of data allocated in seconds: - on CI test host: 3.153877479 - on local fast host: 0.54504489 The main idea is to check all test depend processes running at some point in time and choose maximum RSS value reached by it. For it used '_sample_process()' routine which gets RSS for each currently alive process and '_sample()' routine which counts sum of each task alive processes RSS and checks if this value is bigger than previously saved for the current task. Both routines are in 'Sampler()' class which is called by 'process_timeout()' routine from 'SamplerWatcher' listener. Also used print_statistics() routine in listener 'StatisticsWatcher' which prints statistics to stdout after testing. It is used to print RSS usage for failed tasks and up to 10 most used it tasks. Due to sampler delay which collects RSS values based on 0.1 sec then tests that run faster may not have RSS values at all and won't be printed. Created new subdirectory 'statistics' in 'vardir' path to save statistics files. The current patch uses it to save there 'rss.log' file with RSS values per tested tasks in format: <test task name> <maximum RSS value> Closes #277 [1]: https://www.kernel.org/doc/html/latest/filesystems/proc.html

Decided to collect tests run duration in standalone file and print to stdout after testing finished. To stdout printing duration for failed tasks and up to 10 most long tasks. For duration collecting used listener 'StatisticsWatcher' which has the following used routines: process_result() Using 'WorkerTaskResult' queue message to collect tasks that failed and final duration for finished tasks. Duration collecting routine added in the worker running tests. Its values using 'result_queue' passing through 'WorkerTaskResult' queue message into the listener. print_statistics() - statistics printing to stdout after testing. Prints duration for failed tasks and up to 10 most long tasks. We use standalone 'statistics' directory in 'vardir' path to save 'duration.log' file with duration for each tested tasks in format: <test task name> <float duration in secs with 2 digits after point> Closes #286

avtikhon · 2021-05-23T05:04:58Z

Pushed several fixups (mostly around naming and wording). Please, apply if you don't object against them. The fixup commits arranged to ease squashing.

All patches LGTM, squashed it, thank you.

Totktonada

LGTM.

Totktonada · 2021-05-23T10:16:06Z

Updated the test-run submodule in tarantool in the following commits: 2.9.0-35-g213f480e7, 2.8.1-19-g67ff7cc5f, 2.7.2-15-gc35655113, 1.10.10-8-gfb804dcca.

avtikhon requested review from Totktonada and kyukhin May 3, 2021 04:46

avtikhon self-assigned this May 3, 2021

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from 69133e9 to 09ba06e Compare May 3, 2021 05:18

avtikhon added the teamQ label May 4, 2021

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from 09ba06e to dd49db0 Compare May 4, 2021 13:35

avtikhon changed the title ~~Add RSS statistics collecting~~ Add RSS statistics and timings collecting May 4, 2021

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch 10 times, most recently from 68547be to ae7cddb Compare May 5, 2021 13:26

Totktonada reviewed May 6, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

Totktonada reviewed May 6, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

Totktonada reviewed May 6, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

Totktonada reviewed May 6, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

Totktonada reviewed May 6, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

Totktonada reviewed May 6, 2021

View reviewed changes

dispatcher.py Outdated Show resolved Hide resolved

Totktonada reviewed May 6, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

This was referenced May 6, 2021

Add test timings collection #294

Closed

Add RSS memory status collecting routine to get RSS per tests and workers #292

Closed

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch 3 times, most recently from fd9f9bf to 07f37da Compare May 11, 2021 06:15

Totktonada reviewed May 19, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch 3 times, most recently from eadd7bf to 06c4d5c Compare May 21, 2021 05:51

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from 06c4d5c to f058f1f Compare May 21, 2021 06:32

avtikhon requested a review from Totktonada May 21, 2021 06:33

Totktonada reviewed May 22, 2021

View reviewed changes

lib/sampler.py Outdated Show resolved Hide resolved

Totktonada reviewed May 22, 2021

View reviewed changes

lib/sampler.py Outdated Show resolved Hide resolved

Totktonada reviewed May 22, 2021

View reviewed changes

listeners.py Show resolved Hide resolved

Totktonada reviewed May 22, 2021

View reviewed changes

lib/sampler.py Outdated Show resolved Hide resolved

Totktonada reviewed May 22, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

Totktonada reviewed May 22, 2021

View reviewed changes

listeners.py Outdated Show resolved Hide resolved

Totktonada reviewed May 22, 2021

View reviewed changes

lib/test.py Outdated Show resolved Hide resolved

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch 3 times, most recently from 202bce8 to d7afe4a Compare May 22, 2021 16:58

avtikhon requested a review from Totktonada May 22, 2021 17:02

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from d7afe4a to d965925 Compare May 22, 2021 17:17

Totktonada force-pushed the avtikhon/gh-qa-98-rss-new branch from d965925 to 44322a8 Compare May 23, 2021 00:47

avtikhon added 2 commits May 23, 2021 04:25

avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from 44322a8 to b80762a Compare May 23, 2021 04:28

Totktonada approved these changes May 23, 2021

View reviewed changes

Totktonada merged commit 0fe05d2 into master May 23, 2021

Totktonada deleted the avtikhon/gh-qa-98-rss-new branch May 23, 2021 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RSS statistics and timings collecting #298

Add RSS statistics and timings collecting #298

avtikhon commented May 3, 2021 •

edited

Loading

Totktonada commented May 7, 2021

Totktonada commented May 19, 2021

Totktonada commented May 23, 2021

avtikhon commented May 23, 2021

Totktonada left a comment

Totktonada commented May 23, 2021

Add RSS statistics and timings collecting #298

Add RSS statistics and timings collecting #298

Conversation

avtikhon commented May 3, 2021 • edited Loading

Totktonada commented May 7, 2021

Totktonada commented May 19, 2021

Totktonada commented May 23, 2021

avtikhon commented May 23, 2021

Totktonada left a comment

Choose a reason for hiding this comment

Totktonada commented May 23, 2021

avtikhon commented May 3, 2021 •

edited

Loading