Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution of detector blocks across MPI processes #334

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

anand-avinash
Copy link
Contributor

@anand-avinash anand-avinash commented Oct 9, 2024

This pull request is related to the discussion #325.

Here I have added a new method _make_detector_blocks() to the Observation class. It takes a list of the detectors and divides them into different groups as specified by det_blocks_attributes parameter passed to the Observation class. The group of detectors are saved in the variable detector_blocks of Observation class. For instance, with det_blocks_attributes = ["wafer", "pixel"], the detector groups will be made in such a way that each group will have detectors with same wafer and pixel attributes.

Once the detectors are divided into the groups, it sets the number of detector blocks. The number of time blocks are computed accordingly (n_blocks_time = comm.size // n_blocks_det). Given the structure of the code base, this scheme was pretty easy to implement than I thought, and it doesn't breaks any other functionality.

On the other side, this implementation can lead to load imbalance in different MPI processes. Consider a simulation with 2 MPI processes and 4 detectors. If the detectors are ended up in two groups, with 3 and 1 elements respectively, one MPI process will have to work with 3 times more data than the other MPI process.

Another caveat is related to the det_idx member of Observation class. Previously, it corresponded to the global list of detectors passed to the Observation class. Now with the introduction of detector blocks, it won't be true anymore. For a simple solution, I have introduced a member variable detectors_global, that is the global list of detectors sorted according the detector groups. Now the det_idx will correspond to the indices of the detectors in detectors_global. But then we end up with three variables detectors (the list of detectors used to initialize the Observation class), Observation.detector_blocks, and Observation.detectors_global storing the same data in different manners.

After adding documentation, this pull request should become ready to be merged. But please let me know if you have other solutions.

Copy link
Member

@ziotom78 ziotom78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, @anand-avinash. Thanks a lot! We can merge this once the documentation is updated and tests are added. Thank you for duly annotating all the types.

Great work!

litebird_sim/detectors.py Outdated Show resolved Hide resolved
litebird_sim/distribute.py Outdated Show resolved Hide resolved
litebird_sim/observations.py Outdated Show resolved Hide resolved
litebird_sim/observations.py Outdated Show resolved Hide resolved
litebird_sim/observations.py Outdated Show resolved Hide resolved
litebird_sim/observations.py Outdated Show resolved Hide resolved
litebird_sim/simulations.py Outdated Show resolved Hide resolved
@ziotom78 ziotom78 self-requested a review October 16, 2024 10:08
Copy link
Member

@ziotom78 ziotom78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having fixed the comments!

@anand-avinash
Copy link
Contributor Author

I believe this PR is ready to be merged, but there are a few minor issues that need discussion. The implementation of attribute-based detector blocks works well in the full simulation pipeline when comm.size equals n_blocks_det * n_blocks_time. However, we encounter errors when comm.size is larger, as this leaves some MPI processes unused. The Observation classes on the unused processes are still being queried at several places, leading to the errors. When we provide det_blocks_attributes, the n_blocks_time is computed during runtime using the value of comm.size (n_blocks_time = comm.size // n_blocks_det). So we cannot avoid this situation unless we do the simulations with a well-balanced set of detectors.

Nonetheless, I tried to resolve some of the error which I am documenting here.

The first error I encountered was:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 325, in make_binned_map
    detector_mask_list = _build_mask_detector_split(detector_split, obs_list)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 515, in _build_mask_detector_split
    detector_mask.append(np.ones(cur_obs.n_detectors, dtype=bool))
                                 ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/observations.py", line 212, in n_detectors
    return len(self.det_idx)
           ^^^^^^^^^^^^^^^^^
TypeError: object of type 'NoneType' has no len()

which I solved simply by adding the following:

if self.det_idx is None:
return 0
else:
return len(self.det_idx)


The next error I encountered was

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 329, in make_binned_map
    nobs_matrix = _build_nobs_matrix(
                  ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 209, in _build_nobs_matrix
    cur_weights = get_map_making_weights(cur_obs, check=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 107, in get_map_making_weights
    weights = observations.sampling_rate_hz * observations.net_ukrts**2
                                              ~~~~~~~~~~~~~~~~~~~~~~^^~
TypeError: unsupported operand type(s) for ** or pow(): 'NoneType' and 'int'

It appears because setattr_det_global() sets the detector attributes to None on the processes that are unused. I was able to fix it by setting the numerical attributes to 0 instead of None:

if not is_in_grid: # The process does not own any detector (and TOD)
null_det = DetectorInfo()
attribute = getattr(null_det, name, None)
value = 0 if isinstance(attribute, numbers.Number) else None
setattr(self, name, value)
return


The next error I encountered is

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 329, in make_binned_map
    nobs_matrix = _build_nobs_matrix(
                  ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 224, in _build_nobs_matrix
    pixidx_all, polang_all = _compute_pixel_indices(
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 222, in _compute_pixel_indices
    del curr_pointings_det
        ^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'curr_pointings_det' where it is not associated with a value

It can be solved by replacing line 222 of litebird_sim/common.py by following:

        try:
            del curr_pointings_det
        except UnboundLocalError:
            pass

Next one I encountered while calling the destriper:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1619, in make_destriped_map
    len(params.samples_per_baseline)
TypeError: object of type 'int' has no len()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 131, in <module>
    destriped_maps = lbs.make_destriped_map(nside, sim.observations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1630, in make_destriped_map
    baseline_lengths_list = [
                            ^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1631, in <listcomp>
    split_items_evenly(
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 101, in split_items_evenly
    assert sub_n < n, "sub_n={0} is not smaller than n={1}".format(sub_n, n)
           ^^^^^^^^^
AssertionError: sub_n=100 is not smaller than n=0

This one seems little non-trivial to solve.

At the end we may face errors elsewhere as well that I have not tested. And it brings me to the question: should we worry about fixing this issue or should we leave it to the user to supply a well-balanced set of detectors?

@ziotom78
Copy link
Member

At the end we may face errors elsewhere as well that I have not tested. And it brings me to the question: should we worry about fixing this issue or should we leave it to the user to supply a well-balanced set of detectors?

The simplest solution would probably be to check in Simulation.__init__ and crash if the number of MPI processes is too large. I’m not sure that this is the best option, we should ask the opinion of people of the Simulation Production Team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants