Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate logging and errors for CWL workflows #227

Closed
LucaCinquini opened this issue Oct 15, 2024 · 2 comments
Closed

Investigate logging and errors for CWL workflows #227

LucaCinquini opened this issue Oct 15, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request U-SPS

Comments

@LucaCinquini
Copy link
Collaborator

Create a simple CWL workflow composed of mock stage-in + process + stage-out steps.
Investigate:
o How to return standard output and error in the Airflow UI
o What happens if one of the steps fails - is all standard output and error lost?

@nikki-t
Copy link
Collaborator

nikki-t commented Oct 15, 2024

I created a nested workflow called cwl_dag_data.cwl which calls cwl_dag_stage_in.cwl, cwl_dag_process.cwl, and cwl_dag_stage_out.cwl.

Test 1) normal operations: I ran cwl_dag_data.cwl using the CWL DAG created by cwl_dag.py passing in {"message": "data"} as input.

  • The logs did not show any of the echo statements I had.
  • I was capturing stdout using this documentation.

Test 2) Normal operations: I ran cwl_dag_data.cwl using the CWL DAG created by cwl_dag.py passing in {"message": "data"} as input BUT did not capture stdout.

  • The logs did show my echo statements.

Sample logs:

[2024-10-15, 21:33:30 UTC] {pod_manager.py:472} INFO - [base] INFO [job stage_in] Data to stage in: data --> staged in
[2024-10-15, 21:33:30 UTC] {pod_manager.py:472} INFO - [base] INFO [job process] Data to process: data --> staged in
[2024-10-15, 21:33:31 UTC] {pod_manager.py:472} INFO - [base] INFO [job stage_out] Data to stage out: data --> staged in --> processed

Test 3) Error operations: I modified cwl_dag_stage_out.cwl so that it threw an error during execution.

  • The logs did capture stderr but the reason for the error was obfuscated and instead errored out on not locating the output file which was never created because of the error was raised prior to creation. This may be because of the toy example I put together and would actually show the appropriate error when running code in a Docker container.
  • The stack trace that is shown is from where the Airflow code base encountered the error in running the CWL DAG.
  • Confirmed stage in and process task stdout was captured.
  • Standard error and standard output are not lost and remain present in the logs.

Sample logs:

[2024-10-15, 21:44:08 UTC] {pod_manager.py:472} INFO - [base] ERROR [job stage_out] Job error: 
[2024-10-15, 21:44:08 UTC] {pod_manager.py:472} INFO - [base] ("Error collecting output for parameter 'stage_out_file': [https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/refs/heads/227-investigate-cwl/demos/cwl_dag_data_stage_out.cwl:27:7:](https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/refs/heads/227-investigate-cwl/demos/cwl_dag_data_stage_out.cwl:27:7:) Did not find output file with glob pattern: ['stage_out.txt'].", {}) 
[2024-10-15, 21:44:12 UTC] {taskinstance.py:3301} ERROR - Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 767, in _execute_task result = _execute_callable(context=context, **execute_callable_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 733, in _execute_callable return ExecutionCallableRunner( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run return self.func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 406, in wrapper return func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 592, in execute return self.execute_sync(context) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 642, in execute_sync self.cleanup( File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 912, in cleanup raise AirflowException( airflow.exceptions.AirflowException: Pod cwl-task-pod-xo3tp7hf returned a failure.

@LucaCinquini - Is this what you had in mind? Should we test anything else?

@LucaCinquini
Copy link
Collaborator Author

@nikki-t : this is very useful. I think the result of this investigation can be summarized as follows - please correct me if I am wrong:

o) CWL will either capture stdout and stderr in specific files if coded to do so, or it will echo them to the default Unix stdout and stderr streams. When running CWL through Airflow as we do, the stdout and stderr streams become the task logs, which is exactly what we want. So, in general, we should instruct users to NOT capture stdout and stderr as files.

o) The job log files are permanently stored to S3, where they can be retrieved long term by the users, if they want, see exhibit #1.

o) In case of error, the error message is indeed capture in the log (see exhibit #2), although it is mixed up with other CWL error messages.

In short, I think the current behavior of SPP/Airflow with respect to logs and errors is what we want, thanks for taking the time to conduct this experiment.

Screenshot 2024-10-17 at 09 06 43 Screenshot 2024-10-17 at 09 02 03

@github-project-automation github-project-automation bot moved this from In Progress to Done in Unity Project Board Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request U-SPS
Projects
Status: Done
Development

No branches or pull requests

2 participants