Skip to content

Commit

Permalink
Merge pull request #14 from SWIFTSIM/iterator
Browse files Browse the repository at this point in the history
Efficient iteration over SWIFTGalaxy objects
  • Loading branch information
kyleaoman authored Oct 18, 2024
2 parents 42e7e95 + 9001630 commit 2eb0e78
Show file tree
Hide file tree
Showing 22 changed files with 5,649 additions and 767 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ jobs:
- uses: actions/checkout@v4
- name: Run Black
uses: psf/black@stable
with:
options: "--check --verbose"
jupyter: true
- name: Set up Python environment
uses: actions/setup-python@v5
with:
Expand All @@ -29,9 +32,12 @@ jobs:
run: mypy --install-types --non-interactive
- name: Run mypy
run: mypy

- name: Install numpydoc
run: pip install numpydoc
- name: Run docstring validation
run: ./run_validate_docstrings

build:
tests:
needs: lint
runs-on: ubuntu-latest
strategy:
Expand Down
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@
*~
\#*
/docs/build
*.hdf5
*.hdf5
/examples/.ipynb_checkpoints/
*.ewah
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Documentation contents
additional_coordinates/index
halo_catalogues/index
masking/index
iterating/index
modules/index

Citing SWIFTGalaxy
Expand Down
128 changes: 128 additions & 0 deletions docs/source/iterating/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
Iterating SWIFTGalaxies
=======================

When similar operations need to be applied to multiple :class:`~swiftgalaxy.reader.SWIFTGalaxy` objects it often makes sense to iterate over them (possibly in parallel, see below). The :class:`~swiftgalaxy.iterator.SWIFTGalaxies` class aims to make this convenient and efficient, mostly by internally managing the order of iteration to avoid re-reading data from disk as much as possible - most often when two objects of interest share a common top-level cell in a SWIFT simulation.

Setting up a :class:`~swiftgalaxy.iterator.SWIFTGalaxies` instance is deliberately designed to be as similar to creating a :class:`~swiftgalaxy.reader.SWIFTGalaxy` as possible. There are two differences:

- The halo catalogue helper class is initialized with a list of targets instead of a single target identifier.
- To minimize I/O operations, the :class:`~swiftgalaxy.iterator.SWIFTGalaxies` class needs to know what particle data fields will be accessed, using the ``preload`` argument.

The :class:`~swiftgalaxy.iterator.SWIFTGalaxies` class otherwise broadly accepts the same initialization arguments as the :class:`~swiftgalaxy.reader.SWIFTGalaxy` class, such as ``auto_recentre``.

It is crucial to know that the order of iteration used by a :class:`~swiftgalaxy.iterator.SWIFTGalaxies` is internally optimized; see the `order of iteration`_ section below.

Multiple-target halo catalogues
-------------------------------

The halo catalogue helper classes (such as :class:`~swiftgalaxy.halo_catalogues.SOAP`, :class:`~swiftgalaxy.halo_catalogues.Velociraptor`, :class:`~swiftgalaxy.halo_catalogues.Caesar`, ...) can be initialized with a list of targets instead of a single target identifier (:mod:`numpy` arrays or :mod:`unyt` arrays are also OK). This results in some new behaviour:
- When a halo catalogue is created with multiple targets, accessing attributes of the object (such as via the :class:`~swiftgalaxy.reader.SWIFTGalaxy` ``halo_catalogue`` attribute) will return the relevant property of all targets in the catalogue. For example:

.. code-block:: python
soap = SOAP("my_soap_catalogue.hdf5", soap_index=[11, 22, 33])
m200 = soap.spherical_overdensity_200_crit.total_mass
sets ``m200`` to something like ``cosmo_array([1.2e+12, 3.4e+12, 5.6e+12], dtype=float32, units='1.98841586e+30*kg', comoving=False)``. Keep in mind that the format of halo catalogue values depends on the halo catalogue, keeping formats that should already be familiar to users of each catalogue type.

Once iteration over the targets contained in a :class:`~swiftgalaxy.iterator.SWIFTGalaxies` begins, this behaviour changes: during an iteration step, the catalogue is "masked" down to a single object and behaves just like it would for a single-target :class:`~swiftgalaxy.reader.SWIFTGalaxy`:

.. code-block:: python
sgs = SWIFTGalaxies("my_snapshot.hdf5", soap)
for sg in sgs:
m200 = sg.halo_catalogue.spherical_overdensity_200_crit.total_mass
Then on each iteration the value of ``m200`` will look similar to ``cosmo_array([1.2e+12], dtype=float32, units='1.98841586e+30*kg', comoving=False)``.

Preloading particle data fields
-------------------------------

Unfortunately :class:`~swiftgalaxy.iterator.SWIFTGalaxies` is not prescient (or not yet sufficiently introspective...) to know what data will need to be accessed during iteration. However, to keep I/O operations to a minimum, relevant data needs to be pre-loaded for all targets that share a top level cell (or group of cells) in a SWIFT snapshot. The :class:`~swiftgalaxy.iterator.SWIFTGalaxies` class therefore accepts an argument at initialization called ``preload`` where the data to be used can be specified with a convenient syntax mimicking the attribute lookup syntax from the :class:`~swiftsimio.reader.SWIFTDataset`. The set of fields to pre-load is expected as a :obj:`set`, but other collections (like :obj:`list`) that can be cast to a :obj:`set` are also acceptable. As a schematic example:

.. code-block:: python
SWIFTGalaxies(
"my_snapshot.hdf5",
SOAP(
"my_soap_catalogue.hdf5",
soap_index=[11, 22, 33],
)
preload={
"gas.coordinates",
"dark_matter.masses",
"stars.element_abundances.carbon",
"black_holes.velocities",
}
)
If nothing is specified in the ``preload`` argument, a warning will be generated as a reminder that some fields probably should be specified here. In this case the :class:`~swiftgalaxy.iterator.SWIFTGalaxies` can still be used but will be likely to duplicate I/O operations, so this warning should not be ignored. Similarly, if particle data field is accessed and it has not been pre-loaded, a warning will be generated as a hint that this field should probably be added to the ``preload`` set; these warnings should similarly not be ignored.

Order of iteration
------------------

.. warning::
The most important thing to remember when using :class:`~swiftgalaxy.iterator.SWIFTGalaxies` is that it determines the best order to iterate over your chosen galaxies itself to minimize I/O operations. Be careful not to assume that the iteration is in the order of the target identifiers that you passed to the halo catalogue helper class (such as :class:`~swiftgalaxy.halo_catalogues.SOAP`).

The main purpose of the :class:`~swiftgalaxy.iterator.SWIFTGalaxies` class is to determine an order to iterate through the list of target objects without duplicating I/O operations more than necessary.

In brief, the class evaluates two iteration schemes at initialization and chooses the one that will be most efficient. The first scheme is the "sparse" solution. This determines which top-level cells need to be read for each target galaxy and groups targets together when they share a common region. The second scheme is the "dense" solution. This determines the largest region needed by any single target and then covers the simulation volume in regions of that size on a grid, and also on a second grid offset by half a grid spacing in each dimension from the first. Targets are then assigned to the region with the closest centre (guaranteeing that they fit in their assigned region). Any regions without targets in them are discarded. This second scheme is most often optimal when there are many closely packed targets that often overlap the boundaries of top-level cells, such as when iterating over all galaxies in a simulation volume.

The iteration order of the targets is available from the :attr:`~swiftgalaxy.iterator.SWIFTGalaxies.iteration_order` property. However, if obtaining ordered results is required, using the :func:`~swiftgalaxy.iterator.SWIFTGalaxies.map` method is usually a more convenient approach than e.g. sorting the output of iteration in the optimal order.

Map
...

The :func:`swiftgalaxy.iterator.SWIFTGalaxies.map` function can be used to apply a function to the collection of :class:`~swiftgalaxy.reader.SWIFTGalaxy`'s represented by the :class:`~swiftgalaxy.iterator.SWIFTGalaxies` object (one at a time), and obtain the results. As a simple example:

.. code-block:: python
from swiftgalaxy import SWIFTGalaxies, SOAP
# define the function that we will apply to each SWIFTGalaxy object:
def dm_median_position(sg):
return np.median(sg.dark_matter.coordinates, axis=0)
sgs = SWIFTGalaxies(
"my_snapshot.hdf5",
SOAP(
"my_soap.hdf5",
soap_index=[11, 22, 33],
preload={"dark_matter.coordinates"},
),
)
my_result = sgs.map(dm_median_position)
The ``my_result`` variable will be a list containing the result of the ``dm_median_position`` function applied to the galaxies at ``soap_index=11``, ``22`` and ``33`` in that order. If the list ``[22, 33, 11]`` was instead passed to the :class:`~swiftgalaxy.halo_catalogues.SOAP` (or other halo catalogue class from :mod:`swiftgalaxy`), the output order would change correspondingly (even though internally the galaxies are most likely iterated over in the same order in both cases). Using :func:`~swiftgalaxy.iterator.SWIFTGalaxies.map` is in general not equivalent to:

.. code-block:: python
my_result = [dm_median_position(sg) for sg in sgs]
Relying on the iteration order ``for sg in sgs`` should be avoided if the order of iteration or output matters, but can be appropriate if order is unimportant (such as a function or operation that produces an output file for each input :class:`~swiftgalaxy.reader.SWIFTGalaxy` named according to its unique identifier).

The :func:`~swiftgalaxy.iterator.SWIFTGalaxies.map` function can also accept additional argument values and/or keyword argument values. For example, the following code:

.. code-block:: python
sg1 = SWIFTGalaxy(...)
sg2 = SWIFTGalaxy(...)
my_result = [my_func(sg1, 123, extra_data=456), my_func(sg2, 789, extra_data=None)]
Can be more succinctly written as (and will run more efficiently as):

.. code-block:: python
sgs = SWIFTGalaxies(...) # contains the same galaxies as sg1 and sg2, in that order
my_result = sgs.map(
my_func,
args=[(123, ), (789, )],
kwargs=[dict(extra_data=456), dict(extra_data=None)]
)
Notice especially that the argument lists are bundled in :obj:`tuple`'s (of one element, in this case). The comma in ``(123, )`` is therefore not optional. If the function accepted two arguments they could be passed as something like ``args=[(123, "abc"), (456, "def")]``. Additional keyword arguments can similarly be added by adding additional :obj:`dict` entries.

Parallel iteration
------------------

There is an obvious opportunity to support iterating over :class:`~swiftgalaxy.reader.SWIFTGalaxy` objects in parallel through the :class:`~swiftgalaxy.iterator.SWIFTGalaxies` interface. The initial release of the :class:`~swiftgalaxy.iterator.SWIFTGalaxies` feature has omitted this to focus on a working serial implementation first. Tools for parallel analysis are planned for a future release.
7 changes: 7 additions & 0 deletions docs/source/modules/swiftgalaxy.iterator.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
swiftgalaxy.iterator module
===========================

.. automodule:: swiftgalaxy.iterator
:members:
:show-inheritance:
:inherited-members:
2 changes: 1 addition & 1 deletion docs/source/modules/swiftgalaxy.reader.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ swiftgalaxy.reader module
:members:
:show-inheritance:
:inherited-members:
:private-members: _SWIFTParticleDatasetHelper, _SWIFTNamedColumnDatasetHelper, _CoordinateHelper
:private-members: _SWIFTGroupDatasetHelper, _SWIFTNamedColumnDatasetHelper, _CoordinateHelper
1 change: 1 addition & 0 deletions docs/source/modules/swiftgalaxy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ Submodules
swiftgalaxy.halo_catalogues
swiftgalaxy.masks
swiftgalaxy.reader
swiftgalaxy.iterator
272 changes: 197 additions & 75 deletions examples/SWIFTGalaxy_Colibre_QuickStart.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion mypy.ini
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[mypy]
ignore_missing_imports = True
modules = swiftgalaxy.reader, swiftgalaxy.halo_catalogues, swiftgalaxy.masks
modules = swiftgalaxy.reader, swiftgalaxy.halo_catalogues, swiftgalaxy.masks, swiftgalaxy.iterator
37 changes: 35 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "swiftgalaxy"
version="1.1.0"
version="1.2.0"
authors = [
{ name="Kyle Oman", email="[email protected]" },
]
Expand All @@ -30,4 +30,37 @@ dependencies = [
"Documentation" = "https://swiftgalaxy.readthedocs.io/en/latest"

[project.optional-dependencies]
velociraptor = ["velociraptor"]
velociraptor = ["velociraptor"]

[tool.numpydoc_validation]
checks = [
"GL01",
"GL02",
"GL03",
"GL05",
"GL06",
"GL07",
"GL08",
"GL10",
"SS01",
"SS02",
"SS03",
"SS04",
"PR01",
"PR02",
"PR03",
"PR04",
"PR05",
"PR06",
"PR07",
"PR08",
"PR10",
"RT03",
"RT04",
"RT05",
]
exclude = [
'\.__init__$',
"__init__",
"__version__",
]
1 change: 1 addition & 0 deletions run_validate_docstrings
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python -m numpydoc lint swiftgalaxy**/*.py
1 change: 1 addition & 0 deletions swiftgalaxy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .reader import SWIFTGalaxy
from .iterator import SWIFTGalaxies
from .halo_catalogues import SOAP, Velociraptor, Caesar, Standalone
from .masks import MaskCollection
from .__version__ import __version__
Expand Down
2 changes: 1 addition & 1 deletion swiftgalaxy/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.1.0"
__version__ = "1.2.0"
Loading

0 comments on commit 2eb0e78

Please sign in to comment.