Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nemomodel gives me OverflowError: integer does not fit in 'int' #73

Open
Saladino93 opened this issue Jul 8, 2024 · 6 comments
Open

Comments

@Saladino93
Copy link

Hi all. I am a new user running on Perlmutter.

On running

srun -u -l -n 64 nemoModel "/pscratch/sd/o/omard/FGSIMS_OUT/agora/${nemo_run}/${nemo_run}_optimalCatalog.fits" $mask $beam "/pscratch/sd/o/omard/FGSIMS_OUT/agora/${nemo_run}/nemomodel_${freq}_snr4.fits" --min-snr 4.0 --freq $freq -M -n"

(note I added by hand the min-snr argument)

I get

54:   File "mpi4py/MPI/Comm.pyx", line 1406, in mpi4py.MPI.Comm.send
54:   File "mpi4py/MPI/msgpickle.pxi", line 211, in mpi4py.MPI.PyMPI_send
54:   File "mpi4py/MPI/msgpickle.pxi", line 147, in mpi4py.MPI.pickle_dump
54:   File "mpi4py/MPI/msgbuffer.pxi", line 50, in mpi4py.MPI.downcast
54: OverflowError: integer 3566595060 does not fit in 'int'

even if

54: ... rank 54 image complete (took 1895.205 sec)
54: ... rank = 54 sending sky model image

Any ideas how to debug this? I thought it might be related to my survey mask, but I still keep getting this even after reducing the area.

Thanks in advance.

@mattyowl
Copy link
Collaborator

mattyowl commented Jul 8, 2024

Hi - I'm assuming you're running the 'dev' branch? If so, this would probably be due to me trying to save memory, which didn't work out (caused more problems than it solved), and so I fixed this at the weekend. So I think if you just pull from 'dev', this should go away. Please let me know if not.

@Saladino93
Copy link
Author

I installed through pip. Let me see if using the 'dev' branch improves the situation. Thanks.

@mattyowl
Copy link
Collaborator

mattyowl commented Jul 8, 2024

Ok - it's unlikely to be what I said then, but I'm not sure what the issue would be without more info. Maybe you could post the whole traceback?

@Saladino93
Copy link
Author

Saladino93 commented Jul 8, 2024

Indeed I ran without the saving model hack (that converts to a np.float16). I am running now with it and waiting for the results.

This is what I get from my previous pip installation:

19: Traceback (most recent call last):
19:   File "/global/homes/o/omard/.conda/envs/act/bin/nemoModel", line 240, in <module>
19:     comm.send(modelImage, dest = 0)
19:   File "mpi4py/MPI/Comm.pyx", line 1406, in mpi4py.MPI.Comm.send
19:   File "mpi4py/MPI/msgpickle.pxi", line 211, in mpi4py.MPI.PyMPI_send
19:   File "mpi4py/MPI/msgpickle.pxi", line 147, in mpi4py.MPI.pickle_dump
19:   File "mpi4py/MPI/msgbuffer.pxi", line 50, in mpi4py.MPI.downcast
19: OverflowError: integer 3566595060 does not fit in 'int'

(note that I clone the mpi4py environment of Perlmutter)

@Saladino93
Copy link
Author

Ok, I actually manage to run by doing

print("Saving memory by converting to float16 before applying pixel window function...")
        modelMap=np.float16(modelMap) #NOTE: this is a bit of a hack to save memory

The total file size is 3.2 GB. Does this make sense to you?

I am not sure if this is due to some limitation on Perlmutter (doubt it), mpi4py, or something else (perhaps I ran my initial PS search wrongly...).

@mattyowl
Copy link
Collaborator

mattyowl commented Jul 8, 2024

That's a mystery to me, because I've taken that out as I mentioned above. I don't think I've managed to get the OverflowError you've been getting, running on the sims I've been making or the real data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants