Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RSS statistics and timings collecting #298

Merged
merged 3 commits into from
May 23, 2021

Conversation

avtikhon
Copy link
Contributor

@avtikhon avtikhon commented May 3, 2021

Patch set consists of 3 commits:

  1. sampler: add simple sampling infrastructure

Track tarantool and unit test executables that are run using test-run
with metainformation: worker, test, test configuration and server name.

Add a function that will be called each 0.1 second for each tracked
process.

The implementation tracks non-default servers and re-register default
servers that executes several tests ('core = tarantool' case).

Part of #277

  1. Add RSS statistics collecting.

Found that some tests may fail due to lack of memory. Mostly it happens
in CI on remote hosts. To be able to collect memory used statistic
decided to add RSS memory status collecting routine get_proc_stat_rss()
which parses files:

      /proc/<worker pid>/status

for RSS value 'VmRSS' which is size of memory portions. It contains the
three following parts (VmRSS = RssAnon + RssFile + RssShmem) 1:

      RssAnon - size of resident anonymous memory
      RssFile - size of resident file mappings
      RssShmem - size of resident shmem memory (includes SysV shm, mapping of
                 tmpfs and shared anonymous mappings)

Decided that the best way for CI not to run this RSS collecting routine
for each sent command from tests tasks, but to run it after the test
task started each 0.1 second delay, to collect its maximum RSS value
during task run. This delay used to run routines in 'SamplerWatcher'
listener. Also found that delay of 0.1 sec is completely enough to catch
RSS use increase, due to tested check:

      tarantool> require('clock').bench(function() local t = {} for i = 1, 1024^2 * 100 do t[i] = true end end)

Which checked that 100 Mb of data allocated in seconds:

      - on CI test host: 3.153877479
      - on local fast host: 0.54504489

The main idea is to check all test depend processes running at some
point in time and choose maximum RSS reached value by it. For it used
'_sample_process()' routine which gets RSS for each currently alive
process and '_sample()' routine which counts sum of each task alive
processes RSS and checks if this value is bigger than previously
saved for the current task. Both routines are in 'Sampler()' class
which is called by 'process_timeout()' routine from 'SamplerWatcher'
listener.

Also used print_statistics() routine in listener 'StatisticsWatcher'
which prints statistics to stdout after testing. It used to print
RSS usage for failed tasks and up to 10 most used it tasks. Created
new subdirectory 'statistics' in 'vardir' path to save statistics
files. The current patch uses it to save there 'rss.log' file with
RSS values per tested tasks in format:

      <test task name> <maximum RSS value>

Closes #277

  1. Add tests duration collecting

Decided to collect tests run durations in standalone file and print to
stdout after testing finished. To stdout printing durations for failed
tasks and up to 10 most long tasks.

For durations collecting used listener 'StatisticsWatcher' which has
the following used routines:

      process_result()
        Using 'WorkerTaskResult' queue message to collect tasks that failed
        and final durations for finished tasks. Durations collecting routine
        added in the worker running tests. Its values using 'result_queue'
        passing through 'WorkerTaskResult' queue message into the listener.
    
      print_statistics() - statistics printing to stdout after testing.
        Prints durations for failed tasks and up to 10 most long tasks.
        We use standalone 'statistics' directory in 'vardir' path to save
        'durations.log' file with durations for each tested tasks in format:
    
          <test task name> <float duration in secs with 2 digits after point>

Closes #286

@avtikhon avtikhon requested review from Totktonada and kyukhin May 3, 2021 04:46
@avtikhon avtikhon self-assigned this May 3, 2021
@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from 69133e9 to 09ba06e Compare May 3, 2021 05:18
@avtikhon avtikhon added the teamQ label May 4, 2021
@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from 09ba06e to dd49db0 Compare May 4, 2021 13:35
@avtikhon avtikhon changed the title Add RSS statistics collecting Add RSS statistics and timings collecting May 4, 2021
@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch 10 times, most recently from 68547be to ae7cddb Compare May 5, 2021 13:26
@Totktonada
Copy link
Member

@avtikhon asked me a suggestion how to better track tarantool processes in test-run, so I created a simple sampling infrastructure for test-run as example: 6e73ac4.

Not tested thoroughly, but at least I looked over logs for different types of tests and it seems that it tracks everything that I would expect.

Sadly, the register-unregister logs are not written to test/var/log/ due to #247, but we can print them to the terminal for debugging purposes this way:

diff --git a/lib/sampler.py b/lib/sampler.py
index 8bcc6ca..95dcec0 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -2,7 +2,7 @@ import os
 import sys
 import time
 
-from lib.colorer import color_log
+from lib.colorer import color_stdout as color_log
 from lib.colorer import qa_notice
 from lib.utils import format_process

@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch 3 times, most recently from fd9f9bf to 07f37da Compare May 11, 2021 06:15
@Totktonada
Copy link
Member

The patchset is in the good shape, but I would spin for a while around code readability. It is better to keep it good from scratch, rather than fix afterwards. I left my comments above.

@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch 3 times, most recently from eadd7bf to 06c4d5c Compare May 21, 2021 05:51
Track tarantool and unit test executables that are run using test-run
with metainformation: worker, test, test configuration and server name.

Add a function that will be called each 0.1 second for each tracked
process.

The implementation tracks non-default servers and re-register default
servers that executes several tests ('core = tarantool' case).

Part of #277
@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from 06c4d5c to f058f1f Compare May 21, 2021 06:32
@avtikhon avtikhon requested a review from Totktonada May 21, 2021 06:33
@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch 3 times, most recently from 202bce8 to d7afe4a Compare May 22, 2021 16:58
@avtikhon avtikhon requested a review from Totktonada May 22, 2021 17:02
@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from d7afe4a to d965925 Compare May 22, 2021 17:17
@Totktonada Totktonada force-pushed the avtikhon/gh-qa-98-rss-new branch from d965925 to 44322a8 Compare May 23, 2021 00:47
@Totktonada
Copy link
Member

Pushed several fixups (mostly around naming and wording). Please, apply if you don't object against them. The fixup commits arranged to ease squashing.

Patches.
commit 38067df314b47d8c542f8ccf6453a128f3b2d6f2
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:23:54 2021 +0300

    FIXUP: Actualize comment in print_statistics()

diff --git a/listeners.py b/listeners.py
index 619121e..88d5bb4 100644
--- a/listeners.py
+++ b/listeners.py
@@ -104,13 +104,13 @@ class StatisticsWatcher(BaseWatcher):
         fd.close()
 
     def print_statistics(self):
+        """Print statistics and results of testing."""
         # Prepare standalone subpath '<vardir>/statistics' for statistics files.
         stats_dir = os.path.join(Options().args.vardir, 'statistics')
         safe_makedirs(stats_dir)
 
         self.print_rss_summary(stats_dir)
 
-        """Returns are there failed tasks."""
         if self.stats:
             color_stdout('Statistics:\n', schema='test_var')
         for short_status, cnt in self.stats.items():
commit ef322a34fc1fc089e21c104504391135551cc578
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 02:27:47 2021 +0300

    FIXUP: sampler: don't use private API

diff --git a/lib/sampler.py b/lib/sampler.py
index 92cbee3..ca977da 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -44,6 +44,10 @@ class SamplerWatcher(object):
     def process_timeout(self, delta_seconds):
         self._wakeup()
 
+    @property
+    def sample_interval(self):
+        return self._sample_interval
+
     def _wakeup(self):
         """Invoke Sampler.sample() if enough time elapsed since
            the previous call.
@@ -94,7 +98,7 @@ class Sampler:
 
     @property
     def sample_interval(self):
-        return self._watcher._sample_interval
+        return self._watcher.sample_interval
 
     @property
     def watcher(self):
commit db21cb9fffa19d36bc7f1758ea084c3605de4255
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 02:31:53 2021 +0300

    FIXUP: sampler: fixups for is_enabled
    
    * Use proc_stat_rss_supported() directly: the access is not frequent, we
      don't need to cache the value in the instance's field.
    * Convert from a method to a property (it is already accessed as a
      property in dispatcher).

diff --git a/lib/sampler.py b/lib/sampler.py
index ca977da..753ef5b 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -79,7 +79,6 @@ class Sampler:
 
         self._processes = dict()
         self._rss_summary = dict()
-        self.proc_stat_rss_supported = proc_stat_rss_supported()
 
     def set_queue(self, queue, worker_id, worker_name):
         # Called from a worker process (_run_worker()).
@@ -107,8 +106,9 @@ class Sampler:
                                'main test-run process')
         return self._watcher
 
+    @property
     def is_enabled(self):
-        return self.proc_stat_rss_supported
+        return proc_stat_rss_supported()
 
     def register_process(self, pid, task_id, server_name, worker_id=None,
                          worker_name=None):
diff --git a/listeners.py b/listeners.py
index 88d5bb4..d5ab0df 100644
--- a/listeners.py
+++ b/listeners.py
@@ -65,7 +65,7 @@ class StatisticsWatcher(BaseWatcher):
 
     # RSS.
     def print_rss_summary(self, stats_dir):
-        if not self._sampler.is_enabled():
+        if not self._sampler.is_enabled:
             return
 
         rss_summary = self._sampler.rss_summary
commit 9dff43866ffff152119998046e0e4021f7583841
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:13:02 2021 +0300

    FIXUP: Adjust wording in memory statistics report

diff --git a/listeners.py b/listeners.py
index d5ab0df..f4bd9b7 100644
--- a/listeners.py
+++ b/listeners.py
@@ -73,9 +73,7 @@ class StatisticsWatcher(BaseWatcher):
 
         # Print to stdout RSS statistics for all failed tasks.
         if self.failed_tasks:
-            color_stdout('RSS of the failed tasks in Mb:', schema='info')
-            color_stdout(' (may skip tests run faster than {} sec):\n'.format(
-                         self._sampler.sample_interval), schema='info')
+            color_stdout('Occupied memory in failed tests (RSS, Mb):\n', schema='info')
             for task in self.failed_tasks:
                 task_id = task[0]
                 if task_id in rss_summary:
@@ -83,17 +81,22 @@ class StatisticsWatcher(BaseWatcher):
                                  self.prettify_task_name(task_id).ljust(self.field_size),
                                  self.get_long_mark(task_id)),
                                  schema='info')
+            color_stdout('\n')
 
         # Print to stdout RSS statistics for some number of most it used tasks.
-        color_stdout('Up to {} most RSS used tasks in Mb'.format(top_rss), schema='info')
-        color_stdout(' (may skip tests run faster than {} sec):\n'.format(
-                     self._sampler.sample_interval), schema='info')
+        color_stdout('Top {} tests by occupied memory (RSS, Mb):\n'.format(
+                     top_rss), schema='info')
         results_sorted = sorted(rss_summary.items(), key=lambda x: x[1], reverse=True)
         for task_id, rss in results_sorted[:top_rss]:
             color_stdout('* %6.1f %s %s\n' % (float(rss) / 1024,
                          self.prettify_task_name(task_id).ljust(self.field_size),
                          self.get_long_mark(task_id)), schema='info')
-        color_stdout('-' * 41, "\n", schema='separator')
+        color_stdout('\n')
+
+        color_stdout('(Tests quicker than {} seconds may be missed.)\n'.format(
+                     self._sampler.sample_interval), schema='info')
+
+        color_stdout('-' * 81, "\n", schema='separator')
 
         # Print RSS statistics to '<vardir>/statistics/rss.log' file.
         filepath = os.path.join(stats_dir, 'rss.log')
commit 4eadef6f502d0df71e2e5c5afcaeb440d7b51908
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:15:25 2021 +0300

    FIXUP: sampler: adjust wording in rss_summary comment

diff --git a/lib/sampler.py b/lib/sampler.py
index 753ef5b..eb83d89 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -89,10 +89,7 @@ class Sampler:
 
     @property
     def rss_summary(self):
-        """Return dictionary of collected maximum RSS values per each
-           run task, set with task_id keys.
-        """
-
+        """Task ID to maximum RSS mapping."""
         return self._rss_summary
 
     @property
commit 1df77515ec8fd2c75845236eb63dfa778256e1b2
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:20:45 2021 +0300

    FIXUP: Adjust wording in duration statistics

diff --git a/listeners.py b/listeners.py
index 231ea5c..d90c070 100644
--- a/listeners.py
+++ b/listeners.py
@@ -115,7 +115,8 @@ class StatisticsWatcher(BaseWatcher):
 
         # Print to stdout durations for all failed tasks.
         if self.failed_tasks:
-            color_stdout('Duration of the failed tasks:\n', schema='info')
+            color_stdout('Duration of failed tests (seconds):\n',
+                         schema='info')
             for task in self.failed_tasks:
                 task_id = task[0]
                 if task_id in self.tasks_durations:
@@ -123,16 +124,18 @@ class StatisticsWatcher(BaseWatcher):
                                  self.prettify_task_name(task_id).ljust(self.field_size),
                                  self.get_long_mark(task_id)),
                                  schema='info')
+            color_stdout('\n')
 
         # Print to stdout durations for some number of most long tasks.
-        color_stdout('Up to {} most long tasks in seconds:\n'.format(top_durations),
+        color_stdout('Top {} longest tests (seconds):\n'.format(top_durations),
                      schema='info')
         results_sorted = sorted(self.tasks_durations.items(), key=lambda x: x[1], reverse=True)
         for task_id, duration in results_sorted[:top_durations]:
             color_stdout('* %6.2f %s %s\n' % (duration,
                          self.prettify_task_name(task_id).ljust(self.field_size),
                          self.get_long_mark(task_id)), schema='info')
-        color_stdout('-' * 41, "\n", schema='separator')
+
+        color_stdout('-' * 81, "\n", schema='separator')
 
         # Print durations statistics to '<vardir>/statistics/duration.log' file.
         filepath = os.path.join(stats_dir, 'duration.log')
commit 44322a8ff1615c530f227461835bd2639a4217b7
Author: Alexander Turenko <[email protected]>
Date:   Sun May 23 03:29:47 2021 +0300

    FIXUP: Adjust wording: tasks_durations -> duration_stats

diff --git a/listeners.py b/listeners.py
index d90c070..107e57e 100644
--- a/listeners.py
+++ b/listeners.py
@@ -36,7 +36,7 @@ class StatisticsWatcher(BaseWatcher):
         self.stats = dict()
         self.field_size = 60
         self._sampler = sampler
-        self.tasks_durations = dict()
+        self.duration_stats = dict()
         self.failed_tasks = []
         self.get_logfile = get_logfile
         self.long_tasks = set()
@@ -58,7 +58,7 @@ class StatisticsWatcher(BaseWatcher):
                                       obj.result_checksum,
                                       obj.show_reproduce_content))
 
-        self.tasks_durations[obj.task_id] = obj.duration
+        self.duration_stats[obj.task_id] = obj.duration
 
     def get_long_mark(self, task):
         return '(long)' if task in self.long_tasks else ''
@@ -119,8 +119,8 @@ class StatisticsWatcher(BaseWatcher):
                          schema='info')
             for task in self.failed_tasks:
                 task_id = task[0]
-                if task_id in self.tasks_durations:
-                    color_stdout('* %6.2f %s %s\n' % (self.tasks_durations[task_id],
+                if task_id in self.duration_stats:
+                    color_stdout('* %6.2f %s %s\n' % (self.duration_stats[task_id],
                                  self.prettify_task_name(task_id).ljust(self.field_size),
                                  self.get_long_mark(task_id)),
                                  schema='info')
@@ -129,7 +129,7 @@ class StatisticsWatcher(BaseWatcher):
         # Print to stdout durations for some number of most long tasks.
         color_stdout('Top {} longest tests (seconds):\n'.format(top_durations),
                      schema='info')
-        results_sorted = sorted(self.tasks_durations.items(), key=lambda x: x[1], reverse=True)
+        results_sorted = sorted(self.duration_stats.items(), key=lambda x: x[1], reverse=True)
         for task_id, duration in results_sorted[:top_durations]:
             color_stdout('* %6.2f %s %s\n' % (duration,
                          self.prettify_task_name(task_id).ljust(self.field_size),
@@ -137,12 +137,12 @@ class StatisticsWatcher(BaseWatcher):
 
         color_stdout('-' * 81, "\n", schema='separator')
 
-        # Print durations statistics to '<vardir>/statistics/duration.log' file.
+        # Print duration statistics to '<vardir>/statistics/duration.log' file.
         filepath = os.path.join(stats_dir, 'duration.log')
         fd = open(filepath, 'w')
-        for task_id in self.tasks_durations:
+        for task_id in self.duration_stats:
             fd.write("{} {}\n".format(self.prettify_task_name(task_id),
-                                      self.tasks_durations[task_id]))
+                                      self.duration_stats[task_id]))
         fd.close()
 
     def print_statistics(self):

avtikhon added 2 commits May 23, 2021 04:25
Found that some tests may fail due to lack of memory. Mostly it happens
in CI on remote hosts. To be able to collect memory used statistic,
decided to add RSS memory status collecting routine get_proc_stat_rss()
which parses files:

  /proc/<worker pid>/status

for RSS value 'VmRSS' which is size of memory portions. It contains the
three following parts (VmRSS = RssAnon + RssFile + RssShmem) [1]:

  RssAnon - size of resident anonymous memory
  RssFile - size of resident file mappings
  RssShmem - size of resident shmem memory (includes SysV shm, mapping of
             tmpfs and shared anonymous mappings)

Decided that the best way for CI not to run this RSS collecting routine
for each sent command from tests tasks, but to run it after the test
task started each 0.1 second delay, to collect its maximum RSS value
during task run. This delay used to run routines in 'SamplerWatcher'
listener. Also found that delay of 0.1 sec is completely enough to catch
RSS use increase, due to tested check:

  tarantool> require('clock').bench(function() local t = {} for i = 1, 1024^2 * 100 do t[i] = true end end)

Which checked that 100 Mb of data allocated in seconds:
  - on CI test host: 3.153877479
  - on local fast host: 0.54504489

The main idea is to check all test depend processes running at some
point in time and choose maximum RSS value reached by it. For it used
'_sample_process()' routine which gets RSS for each currently alive
process and '_sample()' routine which counts sum of each task alive
processes RSS and checks if this value is bigger than previously
saved for the current task. Both routines are in 'Sampler()' class
which is called by 'process_timeout()' routine from 'SamplerWatcher'
listener.

Also used print_statistics() routine in listener 'StatisticsWatcher'
which prints statistics to stdout after testing. It is used to print
RSS usage for failed tasks and up to 10 most used it tasks. Due to
sampler delay which collects RSS values based on 0.1 sec then tests
that run faster may not have RSS values at all and won't be printed.

Created new subdirectory 'statistics' in 'vardir' path to save
statistics files. The current patch uses it to save there 'rss.log'
file with RSS values per tested tasks in format:

  <test task name> <maximum RSS value>

Closes #277

[1]: https://www.kernel.org/doc/html/latest/filesystems/proc.html
Decided to collect tests run duration in standalone file and print to
stdout after testing finished. To stdout printing duration for failed
tasks and up to 10 most long tasks.

For duration collecting used listener 'StatisticsWatcher' which has
the following used routines:

  process_result()
    Using 'WorkerTaskResult' queue message to collect tasks that failed
    and final duration for finished tasks. Duration collecting routine
    added in the worker running tests. Its values using 'result_queue'
    passing through 'WorkerTaskResult' queue message into the listener.

  print_statistics() - statistics printing to stdout after testing.
    Prints duration for failed tasks and up to 10 most long tasks.
    We use standalone 'statistics' directory in 'vardir' path to save
    'duration.log' file with duration for each tested tasks in format:

      <test task name> <float duration in secs with 2 digits after point>

Closes #286
@avtikhon avtikhon force-pushed the avtikhon/gh-qa-98-rss-new branch from 44322a8 to b80762a Compare May 23, 2021 04:28
@avtikhon
Copy link
Contributor Author

Pushed several fixups (mostly around naming and wording). Please, apply if you don't object against them. The fixup commits arranged to ease squashing.

All patches LGTM, squashed it, thank you.

Copy link
Member

@Totktonada Totktonada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Totktonada Totktonada merged commit 0fe05d2 into master May 23, 2021
@Totktonada Totktonada deleted the avtikhon/gh-qa-98-rss-new branch May 23, 2021 09:55
@Totktonada
Copy link
Member

Updated the test-run submodule in tarantool in the following commits: 2.9.0-35-g213f480e7, 2.8.1-19-g67ff7cc5f, 2.7.2-15-gc35655113, 1.10.10-8-gfb804dcca.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collect tests timings Add memory usage profiling for running tests
2 participants