Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioning script does not complete, ends up producing mostly *.mms.data visibility files #63

Open
Sam-Legodi opened this issue Mar 19, 2024 · 2 comments

Comments

@Sam-Legodi
Copy link
Collaborator

I have been trying to use the master version of the pipeline (via: source /idia/software/pipelines/master/setup.sh) but have been met with errors mostly at the partitioning stage of the pipeline. So far, I've tried running the pipeline on several observations of similar datasets with slightly varying results. The common issue is that the partitioning stage results in *.mms.data visibility files instead of *.mms files in the sub-SPW subdirectories that are specified by the "spw" parameter in the config file. Sometimes I have had successful partitioning of data and sometimes not, leading to my confusion and unsuccessful attempts at a workaround. The partitioning stage error logs report errors like the one near the following log text:

"
2024-03-18 14:10:13 INFO msmetadata_cmpt.cc::open Performing internal consistency checks on /idia/raw/meerkat-cal/EXT-20210318-RT-01/1691421383/1691421383_sdp_l0.ms...
2024-03-18 14:10:16 INFO MSMetaData::_computeScanAndSubScanProperties Computing scan and subscan properties...
2024-03-18 14:10:22 INFO mstransform::::casa ##########################################
2024-03-18 14:10:22 INFO mstransform::::casa ##### Begin Task: mstransform #####
2024-03-18 14:10:22 INFO mstransform::::casa mstransform( vis='/idia/raw/meerkat-cal/EXT-20210318-RT-01/1691421383/1691421383_sdp_l0.ms', outputvis='1691421383_sdp_l0.12991350MHz.mms', createmms=True, separationaxis='scan', numsubms=62, tileshape=[0], field='', spw='*:12991350MHz', scan='', antenna='*&', correlation='', timerange='', intent='', array='', uvrange='', observation='', feed='', datacolumn='DATA', realmodelcol=False, keepflags=True, usewtspectrum=True, combinespws=False, chanaverage=False, chanbin=1, hanning=False, regridms=False, mode='channel', nchan=-1, start=0, width=1, nspw=1, interpolation='linear', phasecenter='', restfreq='', outframe='', veltype='radio', preaverage=False, timeaverage=False, timebin='0s', timespan='', maxuvwdistance=0.0, docallib=False, callib='', douvcontsub=False, fitspw='', fitorder=0, want_cont=False, denoising_lib=True, nthreads=4, niter=1, disableparallel=False, ddistart=-1, taql='', monolithic_processing=False, reindex=True )
2024-03-18 14:10:23 INFO ParallelDataHelper::::casa Analyzing MS for partitioning
2024-03-18 14:36:52 INFO ParallelDataHelper::::casa 15 subMSs failed to be created. This is not an error, if due to selection when creating a Multi-MS
2024-03-18 14:36:52 WARN ParallelDataHelper::go::casa Error post processing MMS results /idia/raw/meerkat-cal/EXT-20210318-RT-01/1691421383/1691421383_sdp_l0.ms: [Errno 39] Directory not empty: '/idia/projects/meerkat-cal/process/Sam/1691421383_run14/12991350MHz/1691421383_sdp_l0.12991350MHz.mms.data/1691421383_sdp_l0.12991350MHz.mms.0007.ms' -> '/idia/projects/meerkat-cal/process/Sam/1691421383_run14/12991350MHz/1691421383_sdp_l0.12991350MHz.mms.data/1691421383_sdp_l0.12991350MHz.mms.0000.ms'
2024-03-18 14:36:52 INFO mstransform::::casa Task mstransform complete. Start time: 2024-03-18 16:10:22.129315 End time: 2024-03-18 16:36:52.293223
2024-03-18 14:36:52 INFO mstransform::::casa ##### End Task: mstransform #####
2024-03-18 14:36:52 INFO mstransform::::casa ##########################################
"

OTHER error messages that I've seen are similar to the following:
"
2024-03-18 14:10:30 SEVERE mstransform::::casa::MPIServer-4 Task mstransform raised an exception of class OSError with the following message: Output MS /idia/projects/meerkat-cal/process/Sam/1691421383_run14/880933MHz/1691421383_sdp_l0.880933MHz.mms.data/1691421383_sdp_l0.880~933MHz.mms.0003.ms already exists - will not overwrite it.
"

ANOTHER thing I've noticed is that the "vis" parameter in the sub-SPW subdirectory config files is set to the raw data visibility file that is not writable, is this normal behaviour? I have been manually changing these to the name of the .mms file for each specific sub-SPW directory. I do this before submitting my jobs. I've also ran the pipeline without doing this and have seen cases where some of the scripts try to write to the raw data .ms file which is not supposed to happen.

I've attached a typical config file I use for your perusal, if needed.
1691421383_run14-default_config.txt

@Jordatious
Copy link
Collaborator

Hi @Sam-Legodi, that's strange. Are you trying to run the pipeline in a directory in which it was already run? You should not need to overwrite the vis parameter inside the SPW directories, as that is done by the pipeline itself after the partition step (although inside the hidden config file - .config.tmp). Can you try running it again without overwriting this parameter?

@Sam-Legodi
Copy link
Collaborator Author

Sam-Legodi commented Mar 22, 2024

Hi @Jordatious . I've ran the pipeline with and without changing the "vis" parameter manually and I still produced similar behaviour as in my query above. For some reason, partitioning doesn't always "finish running" and errors out with the above errors while some partitioned data will be produced successfully. Maybe my resource request via the config may be causing an issue? The "slurm" section of my main config files generally are something like:

[slurm]
nodes = 1
ntasks_per_node = 8
plane = 1
mem = 232
partition = 'Main'
exclude = ''
time = '12:00:00'   # sometimes '24:00:00'
submit = True
container = '/idia/software/containers/casa-6.5.0-modular.sif'
mpi_wrapper = 'mpirun'
name = '1383b'
dependencies = ''
account = 'b53-meerkat-cal-ag'
reservation = ''
modules = ['openmpi/4.0.3']

...
...

Running "$./findErrors.sh " in the working directory of one of my last attempts to run the master version pipeline gave:
...
...

-----------------------------------------------------------------------------------------------
SPW #11: /idia/projects/meerkat-cal/process/Sam/1691421383_run14b/1630~1680MHz
logs/1383bvalidate_input-9365675.casa  logs/1383bvalidate_input-9365675.err  logs/1383bvalidate_input-9365675.mpi  logs/1383bvalidate_input-9365675.out
2024-03-15 14:44:43,958 ERROR: Exception found in previous pipeline job, which set "continue=False" in [run] section of ".config.tmp". Skipping ".config.tmp".
srun: error: compute-203: task 0: Exited with exit code 1
logs/*9365676* logs don't exist (yet)
logs/*9365677* logs don't exist (yet)
-----------------------------------------------------------------------------------------------
All SPWs: /idia/projects/meerkat-cal/process/Sam/1691421383_run14b
logs/1383bpartition-9365638_0.err  logs/1383bpartition-9365638_10.err  logs/1383bpartition-9365638_3.err  logs/1383bpartition-9365638_5.err  logs/1383bpartition-9365638_7.err	logs/1383bpartition-9365638_9.err
logs/1383bpartition-9365638_0.out  logs/1383bpartition-9365638_10.out  logs/1383bpartition-9365638_3.out  logs/1383bpartition-9365638_5.out  logs/1383bpartition-9365638_7.out	logs/1383bpartition-9365638_9.out
logs/1383bpartition-9365638_1.err  logs/1383bpartition-9365638_2.err   logs/1383bpartition-9365638_4.err  logs/1383bpartition-9365638_6.err  logs/1383bpartition-9365638_8.err
logs/1383bpartition-9365638_1.out  logs/1383bpartition-9365638_2.out   logs/1383bpartition-9365638_4.out  logs/1383bpartition-9365638_6.out  logs/1383bpartition-9365638_8.out
2024-03-15 14:38:00,668 ERROR: Exception found in previous pipeline job, which set "continue=False" in [run] section of ".config.tmp". Skipping ".config.tmp".
2024-03-15 15:11:16	WARN	ParallelDataHelper::go::casa	Error post processing MMS results /idia/raw/meerkat-cal/EXT-20210318-RT-01/1691421383/1691421383_sdp_l0.ms: [Errno 39] Directory not empty: '/idia/projects/meerkat-cal/process/Sam/1691421383_run14b/960~1010MHz/1691421383_sdp_l0.960~1010MHz.mms.data/1691421383_sdp_l0.960~1010MHz.mms.0003.ms' -> '/idia/projects/meerkat-cal/process/Sam/1691421383_run14b/960~1010MHz/1691421383_sdp_l0.960~1010MHz.mms.data/1691421383_sdp_l0.960~1010MHz.mms.0000.ms'
2024-03-15 14:44:39,172 ERROR: Exception found in previous pipeline job, which set "continue=False" in [run] section of ".config.tmp". Skipping ".config.tmp".
2024-03-15 15:08:57	WARN	ParallelDataHelper::go::casa	Error post processing MMS results /idia/raw/meerkat-cal/EXT-20210318-RT-01/1691421383/1691421383_sdp_l0.ms: [Errno 39] Directory not empty: '/idia/projects/meerkat-cal/process/Sam/1691421383_run14b/1010~1060MHz/1691421383_sdp_l0.1010~1060MHz.mms.data/1691421383_sdp_l0.1010~1060MHz.mms.0007.ms' -> '/idia/projects/meerkat-cal/process/Sam/1691421383_run14b/1010~1060MHz/1691421383_sdp_l0.1010~1060MHz.mms.data/1691421383_sdp_l0.1010~1060MHz.mms.0000.ms'
2024-03-15 15:06:38	WARN	ParallelDataHelper::go::casa	Error post processing MMS results /idia/raw/meerkat-cal/EXT-20210318-RT-01/1691421383/1691421383_sdp_l0.ms: [Errno 39] Directory not empty: '/idia/projects/meerkat-cal/process/Sam/1691421383_run14b/1060~1110MHz/1691421383_sdp_l0.1060~1110MHz.mms.data/1691421383_sdp_l0.1060~1110MHz.mms.0007.ms' -> '/idia/projects/meerkat-cal/process/Sam/1691421383_run14b/1060~1110MHz/1691421383_sdp_l0.1060~1110MHz.mms.data/1691421383_sdp_l0.1060~1110MHz.mms.0000.ms'
2024-03-15 15:06:38	WARN	ParallelDataHelper::go::casa	Error post processing MMS results /idia/raw/meerkat-cal/EXT-20210318-RT-01/1691421383/1691421383_sdp_l0.ms: [Errno 39] Directory not empty: '/idia/projects/meerkat-cal/process/Sam/1691421383_run14b/1110~1163MHz/1691421383_sdp_l0.1110~1163MHz.mms.data/1691421383_sdp_l0.1110~1163MHz.mms.0007.ms' -> '/idia/projects/meerkat-cal/process/Sam/1691421383_run14b/1110~1163MHz/1691421383_sdp_l0.1110~1163MHz.mms.data/1691421383_sdp_l0.1110~1163MHz.mms.0000.ms'
terminate called after throwing an instance of 'casa6core::AipsError'
[compute-249:25907] *** End of error message ***
2024-03-15 14:44:21,865 ERROR: Exception found in previous pipeline job, which set "continue=False" in [run] section of ".config.tmp". Skipping ".config.tmp".

...
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants