Hanging training jobs (old version of pipeline?!?) #68

jfri3d · 2019-11-06T15:18:13Z

What is the current behaviour?

Hanging training jobs due to an old version of the Training Pipeline.

What is the expected behaviour?

Not this!

How to reproduce? (e.g. logs, minimal example, etc...)

Deploying a training job occasionally deploys multiple jobs due to some "lag" between versions of the Training Pipeline.

$ kaos train list

+--------------------------------------------------------------------------------------------------+
|                                             TRAINING                                             |
+-----+----------+----------+----------------------------------+---------------------+-------------+
| ind | duration | hyperopt |              job_id              |       started       |    state    |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  0  |    97    |  False   | 862e2de2a8c3424e8b39839831040a95 | 2019-11-06 13:36:24 | JOB_SUCCESS |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  1  |    24    |  False   | 1d35a8a88af04fb4b5e8a8c4087e0271 | 2019-11-06 13:21:04 | JOB_FAILURE |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  2  |    10    |  False   | 4c1bc7a26f5b4556bfa4ecf6bac60b1f | 2019-11-06 13:18:38 | JOB_SUCCESS |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  3  |    23    |  False   | 135aefb208c8410ab5381b78547b36b1 | 2019-11-06 12:48:15 | JOB_SUCCESS |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  4  |    1     |  False   | 77e866c4c3dc49109cbce71f45b8d0e3 | 2019-11-06 11:44:49 | JOB_FAILURE |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  5  |    ?     |    ?     | c57d7ce324e74efebc7a77cdc129b41c | 2019-11-06 13:36:10 | JOB_RUNNING |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  6  |    ?     |    ?     | 820ebfb0e4594001ad701c9c75415ec8 | 2019-11-06 13:20:48 | JOB_RUNNING |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  7  |    ?     |    ?     | 308f317cf37d47df9b23e123f6210cb6 | 2019-11-06 12:47:58 | JOB_RUNNING |
+-----+----------+----------+----------------------------------+---------------------+-------------+

Logs are also included:

aos train logs -j 308f317cf37d47df9b23e123f6210cb6

[2019-11-06 12:48:06] skipping job 77e866c4c3dc49109cbce71f45b8d0e3 as it is already in state JOB_FAILURE
[2019-11-06 12:48:36] skipping job 308f317cf37d47df9b23e123f6210cb6 as it uses old pipeline version 5
[2019-11-06 12:48:36] processing job 135aefb208c8410ab5381b78547b36b1
[2019-11-06 12:48:36] blocking on parent commit "f2f542a48f3b49e6926e31c93669a6d1" before writing to output commit "2178eb2a76834f23a6b6208f3217e7dd"
[2019-11-06 12:48:36] starting to download data
[2019-11-06 12:48:36] finished downloading data after 508.2152ms
[2019-11-06 12:48:36] beginning to run user code
[2019-11-06 12:48:38] Hello worldddddd!
[2019-11-06 12:48:38] We are training! Are we?
[2019-11-06 12:48:38] cwd        /opt/program
[2019-11-06 12:48:38] basedir    /opt/program
[2019-11-06 12:48:38] os.listdir()       ['stargazers', 'dist', 'stargazers.egg-info', 'build', 'train', 'README.md', 'requirements.txt', '.DS_Store', 'setup.py']
[2019-11-06 12:48:38] /pfs/hyper         ['params.null']
[2019-11-06 12:48:38] /pfs       ['data', 'build-train', 'hyper', '.scratch', 'out']
[2019-11-06 12:48:38] finished running user code after 1.8068292s
[2019-11-06 12:48:38] starting to upload output
[2019-11-06 12:48:38] finished uploading output after 13.5419ms
[2019-11-06 12:48:38] starting to merge chunk
[2019-11-06 12:48:38] finished merging chunk after 481.2µs
[2019-11-06 12:48:38] starting to merge output
[2019-11-06 12:48:38] finished merging output after 19.8903ms
[2019-11-06 12:48:38] job "135aefb208c8410ab5381b78547b36b1" put in terminal state "JOB_SUCCESS"; cancelling

Context (Environment)

LOCAL

The text was updated successfully, but these errors were encountered:

jfri3d added the bug Something isn't working label Nov 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging training jobs (old version of pipeline?!?) #68

Hanging training jobs (old version of pipeline?!?) #68

jfri3d commented Nov 6, 2019

Hanging training jobs (old version of pipeline?!?) #68

Hanging training jobs (old version of pipeline?!?) #68

Comments

jfri3d commented Nov 6, 2019

What is the current behaviour?

What is the expected behaviour?

How to reproduce? (e.g. logs, minimal example, etc...)

Context (Environment)