You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
aos train logs -j 308f317cf37d47df9b23e123f6210cb6
[2019-11-06 12:48:06] skipping job 77e866c4c3dc49109cbce71f45b8d0e3 as it is already in state JOB_FAILURE
[2019-11-06 12:48:36] skipping job 308f317cf37d47df9b23e123f6210cb6 as it uses old pipeline version 5
[2019-11-06 12:48:36] processing job 135aefb208c8410ab5381b78547b36b1
[2019-11-06 12:48:36] blocking on parent commit "f2f542a48f3b49e6926e31c93669a6d1" before writing to output commit "2178eb2a76834f23a6b6208f3217e7dd"
[2019-11-06 12:48:36] starting to download data
[2019-11-06 12:48:36] finished downloading data after 508.2152ms
[2019-11-06 12:48:36] beginning to run user code
[2019-11-06 12:48:38] Hello worldddddd!
[2019-11-06 12:48:38] We are training! Are we?
[2019-11-06 12:48:38] cwd /opt/program
[2019-11-06 12:48:38] basedir /opt/program
[2019-11-06 12:48:38] os.listdir() ['stargazers', 'dist', 'stargazers.egg-info', 'build', 'train', 'README.md', 'requirements.txt', '.DS_Store', 'setup.py']
[2019-11-06 12:48:38] /pfs/hyper ['params.null']
[2019-11-06 12:48:38] /pfs ['data', 'build-train', 'hyper', '.scratch', 'out']
[2019-11-06 12:48:38] finished running user code after 1.8068292s
[2019-11-06 12:48:38] starting to upload output
[2019-11-06 12:48:38] finished uploading output after 13.5419ms
[2019-11-06 12:48:38] starting to merge chunk
[2019-11-06 12:48:38] finished merging chunk after 481.2µs
[2019-11-06 12:48:38] starting to merge output
[2019-11-06 12:48:38] finished merging output after 19.8903ms
[2019-11-06 12:48:38] job "135aefb208c8410ab5381b78547b36b1" put in terminal state "JOB_SUCCESS"; cancelling
Context (Environment)
LOCAL
The text was updated successfully, but these errors were encountered:
What is the current behaviour?
Hanging training jobs due to an old version of the Training Pipeline.
What is the expected behaviour?
Not this!
How to reproduce? (e.g. logs, minimal example, etc...)
Deploying a training job occasionally deploys multiple jobs due to some "lag" between versions of the Training Pipeline.
Logs are also included:
Context (Environment)
LOCAL
The text was updated successfully, but these errors were encountered: