Distribution of detector blocks across MPI processes #334

anand-avinash · 2024-10-09T06:23:23Z

This pull request is related to the discussion #325.

Here I have added a new method _make_detector_blocks() to the Observation class. It takes a list of the detectors and divides them into different groups as specified by det_blocks_attributes parameter passed to the Observation class. The group of detectors are saved in the variable detector_blocks of Observation class. For instance, with det_blocks_attributes = ["wafer", "pixel"], the detector groups will be made in such a way that each group will have detectors with same wafer and pixel attributes.

Once the detectors are divided into the groups, it sets the number of detector blocks. The number of time blocks are computed accordingly (n_blocks_time = comm.size // n_blocks_det). Given the structure of the code base, this scheme was pretty easy to implement than I thought, and it doesn't breaks any other functionality.

On the other side, this implementation can lead to load imbalance in different MPI processes. Consider a simulation with 2 MPI processes and 4 detectors. If the detectors are ended up in two groups, with 3 and 1 elements respectively, one MPI process will have to work with 3 times more data than the other MPI process.

Another caveat is related to the det_idx member of Observation class. Previously, it corresponded to the global list of detectors passed to the Observation class. Now with the introduction of detector blocks, it won't be true anymore. For a simple solution, I have introduced a member variable detectors_global, that is the global list of detectors sorted according the detector groups. Now the det_idx will correspond to the indices of the detectors in detectors_global. But then we end up with three variables detectors (the list of detectors used to initialize the Observation class), Observation.detector_blocks, and Observation.detectors_global storing the same data in different manners.

After adding documentation, this pull request should become ready to be merged. But please let me know if you have other solutions.

…cheme

ziotom78

Excellent work, @anand-avinash. Thanks a lot! We can merge this once the documentation is updated and tests are added. Thank you for duly annotating all the types.

Great work!

litebird_sim/detectors.py

litebird_sim/distribute.py

litebird_sim/observations.py

litebird_sim/simulations.py

ziotom78

Thanks for having fixed the comments!

…ome attributes in observations.py

anand-avinash · 2024-10-30T14:25:11Z

I believe this PR is ready to be merged, but there are a few minor issues that need discussion. The implementation of attribute-based detector blocks works well in the full simulation pipeline when comm.size equals n_blocks_det * n_blocks_time. However, we encounter errors when comm.size is larger, as this leaves some MPI processes unused. The Observation classes on the unused processes are still being queried at several places, leading to the errors. When we provide det_blocks_attributes, the n_blocks_time is computed during runtime using the value of comm.size (n_blocks_time = comm.size // n_blocks_det). So we cannot avoid this situation unless we do the simulations with a well-balanced set of detectors.

Nonetheless, I tried to resolve some of the error which I am documenting here.

The first error I encountered was:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 325, in make_binned_map
    detector_mask_list = _build_mask_detector_split(detector_split, obs_list)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 515, in _build_mask_detector_split
    detector_mask.append(np.ones(cur_obs.n_detectors, dtype=bool))
                                 ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/observations.py", line 212, in n_detectors
    return len(self.det_idx)
           ^^^^^^^^^^^^^^^^^
TypeError: object of type 'NoneType' has no len()

which I solved simply by adding the following:

litebird_sim/litebird_sim/observations.py

Lines 209 to 212 in a941fd0

    
           if self.det_idx is None: 
        
               return 0 
        
           else: 
        
               return len(self.det_idx)

The next error I encountered was

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 329, in make_binned_map
    nobs_matrix = _build_nobs_matrix(
                  ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 209, in _build_nobs_matrix
    cur_weights = get_map_making_weights(cur_obs, check=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 107, in get_map_making_weights
    weights = observations.sampling_rate_hz * observations.net_ukrts**2
                                              ~~~~~~~~~~~~~~~~~~~~~~^^~
TypeError: unsupported operand type(s) for ** or pow(): 'NoneType' and 'int'

It appears because setattr_det_global() sets the detector attributes to None on the processes that are unused. I was able to fix it by setting the numerical attributes to 0 instead of None:

litebird_sim/litebird_sim/observations.py

Lines 677 to 682 in a941fd0

    
           if not is_in_grid:  # The process does not own any detector (and TOD) 
        
               null_det = DetectorInfo() 
        
               attribute = getattr(null_det, name, None) 
        
               value = 0 if isinstance(attribute, numbers.Number) else None 
        
               setattr(self, name, value) 
        
               return

The next error I encountered is

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 329, in make_binned_map
    nobs_matrix = _build_nobs_matrix(
                  ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 224, in _build_nobs_matrix
    pixidx_all, polang_all = _compute_pixel_indices(
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 222, in _compute_pixel_indices
    del curr_pointings_det
        ^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'curr_pointings_det' where it is not associated with a value

It can be solved by replacing line 222 of litebird_sim/common.py by following:

        try:
            del curr_pointings_det
        except UnboundLocalError:
            pass

Next one I encountered while calling the destriper:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1619, in make_destriped_map
    len(params.samples_per_baseline)
TypeError: object of type 'int' has no len()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 131, in <module>
    destriped_maps = lbs.make_destriped_map(nside, sim.observations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1630, in make_destriped_map
    baseline_lengths_list = [
                            ^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1631, in <listcomp>
    split_items_evenly(
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 101, in split_items_evenly
    assert sub_n < n, "sub_n={0} is not smaller than n={1}".format(sub_n, n)
           ^^^^^^^^^
AssertionError: sub_n=100 is not smaller than n=0

This one seems little non-trivial to solve.

At the end we may face errors elsewhere as well that I have not tested. And it brings me to the question: should we worry about fixing this issue or should we leave it to the user to supply a well-balanced set of detectors?

ziotom78 · 2024-10-30T14:40:22Z

At the end we may face errors elsewhere as well that I have not tested. And it brings me to the question: should we worry about fixing this issue or should we leave it to the user to supply a well-balanced set of detectors?

The simplest solution would probably be to check in Simulation.__init__ and crash if the number of MPI processes is too large. I’m not sure that this is the best option, we should ask the opinion of people of the Simulation Production Team.

Andrea Sabatucci and others added 2 commits October 1, 2024 17:47

Added the Squid attribute to DetectorInfo() class.

b06a955

Preliminary implementation of attribute-based detector distribution s…

dc42297

…cheme

ziotom78 requested changes Oct 11, 2024

View reviewed changes

anand-avinash added 2 commits October 15, 2024 13:26

resolved the review request

e87dd3f

resolved the review request

35e5d4d

ziotom78 self-requested a review October 16, 2024 10:08

ziotom78 reviewed Oct 16, 2024

View reviewed changes

paganol mentioned this pull request Oct 23, 2024

Map distribution over MPI processes #292

Open

anand-avinash added 2 commits October 30, 2024 15:02

added the documentation for detector blocks; minor fix for obs.det_idx

2eeab2c

added test for detector block distribution; fixed the assignment of s…

a941fd0

…ome attributes in observations.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribution of detector blocks across MPI processes #334

Distribution of detector blocks across MPI processes #334

anand-avinash commented Oct 9, 2024 •

edited

Loading

ziotom78 left a comment

ziotom78 left a comment

anand-avinash commented Oct 30, 2024

ziotom78 commented Oct 30, 2024

Distribution of detector blocks across MPI processes #334

Are you sure you want to change the base?

Distribution of detector blocks across MPI processes #334

Conversation

anand-avinash commented Oct 9, 2024 • edited Loading

ziotom78 left a comment

Choose a reason for hiding this comment

ziotom78 left a comment

Choose a reason for hiding this comment

anand-avinash commented Oct 30, 2024

ziotom78 commented Oct 30, 2024

anand-avinash commented Oct 9, 2024 •

edited

Loading