Updating to v0.7-dev

LLNL · Oct 31, 2024 · 4e8cfa3 · 4e8cfa3
2 parents 3abb92f + 1869500
commit 4e8cfa3
Show file tree

Hide file tree

Showing 97 changed files with 3,718 additions and 2,474 deletions.
diff --git a/.github/workflows/ci-test.yaml b/.github/workflows/ci-test.yaml
@@ -14,15 +14,23 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        os: [ubuntu-latest, ubuntu-22.04]
-        gcc-version: [11, 12]
+        os: [ubuntu-24.04, ubuntu-22.04]
+        gcc-version: [11, 12, 13, 14]
         mpi-type: [mpich, openmpi]
         exclude:
-          - os: ubuntu-latest
-            gcc-version: 8
+          - os: ubuntu-22.04
+            gcc-version: 13
+          - os: ubuntu-22.04
+            gcc-version: 14
+          - os: ubuntu-24.04
+            mpi-type: mpich
     runs-on: ${{ matrix.os }}
     steps:
       - uses: actions/checkout@v4
+      - name: Update apt
+        run: |
+          sudo add-apt-repository -y universe
+          sudo apt-get update
       - name: Cache boost
         uses: actions/cache@v4
         id: cache-boost
@@ -49,8 +57,8 @@ jobs:
         if: matrix.mpi-type == 'openmpi'
         run: sudo apt-get install openmpi-bin libopenmpi-dev
       - name: Install GCC-${{ matrix.gcc-version }}
-        if: matrix.gcc-version == '8'
-        run: sudo apt-get install gcc-8 g++-8
+        if: (matrix.gcc-version == '11' && matrix.os == 'ubuntu-24.04')
+        run: sudo apt-get install gcc-11 g++-11
       - name: Make
         run: |
           echo Run 'make'

diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,5 @@
 *#*
 build*
 .vscode*
-.idea*
+.idea*
+.cache/
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -2,7 +2,7 @@ version: 2
 
 # Set OS and Python versions
 build:
-  os: ubuntu-22.04
+  os: ubuntu-24.04
   tools:
     python: "3.12"
 
@@ -13,4 +13,4 @@ python:
 
 # Change the location of the configuration file
 sphinx:
-  configuration: docs/rtd/conf.py
+  configuration: docs/rtd/conf.py
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -144,7 +144,7 @@ endif ()
 #
 include(FindArrowParquet)
 option(YGM_REQUIRE_ARROW_PARQUET "YGM requires Apache Arrow Parquet." OFF)
-find_arrow_parquet()
+find_or_install_arrow_parquet()
 
 #
 # Create the YGM target library

diff --git a/Readme.md b/Readme.md
@@ -1,33 +1,70 @@
-# What is YGM?
-
-YGM is an asynchronous communication library designed for irregular communication patterns. It is built on a
-communicator abstraction, much like MPI, but communication is handled asynchronously and is initiated by senders without
-any interaction with receivers. YGM features
-* **Message buffering** - Increases application throughput.
-* **Fire-and-Forget RPC Semantics** - A sender provides the function and function arguments for execution on a specified
-  destination rank through an `async` call. This function will complete on the destination rank at an unspecified time
-  in the future, but YGM does not explicitly make the sender aware of this completion.
-* **Storage Containers** - YGM provides a collection of distributed storage containers with asynchronous
-  interfaces, used for many common distributed memory operations. Containers are designed to partition data, allowing
-insertions to occur from any rank. Data is accessed through collective `for_all` operations that execute a user-provided
-function on every stored object, or, when a particular piece of data's location is known, `visit`-type operations that
-perform a user-provided function only on the desired data. These containers are found
-[here](/include/ygm/container/).
-
-# Getting Started
+## What is YGM?
+
+YGM is an asynchronous communication library written in C++ and designed for high-performance computing (HPC) use cases featuring 
+irregular communication patterns. YGM includes a collection of
+distributed-memory storage containers designed to express common algorithmic and data-munging tasks. These containers
+automatically partition data, allowing insertions and, with most containers, processing of individual elements to be
+initiated from any runninng YGM process.
+
+Underlying YGM's containers is a communicator abstraction. This communicator asynchronously sends messages spawned by
+senders with receivers needing no knowledge of incoming messages prior to their arrival. YGM communications take the
+form of *active messages*; each message contains a function object to execute (often in the form of C++ lambdas), data
+and/or pointers to data for this function to execute on, and a destination process for the message to be executed at.
+
+YGM also includes a set of I/O primitives for parsing collections of input documents in parallel as independent lines of
+text and streaming output lines to
+large numbers of destination files. Current parsing functionality supports reading input as CSV, ndjson, and
+unstructured lines of data.
+
+## General YGM Operations
+
+YGM is built on its ability to communicate active messages asynchronously between running processes. This does not
+capture every operation that can be useful, for instance collective operations are still widely needed. YGM uses
+prefixes on function names to distinguish their behaviors in terms of the processes involved. These prefixes are:
+   * `async_`: Asynchronous operation initiated on a single process. The execution of the underlying function may
+     occur on a remote process.
+   * `local_`: Function performs only local operations on data of the current process. In uses within YGM containers
+     with partitioning schemes that determine item ownership, care must be taken to ensure the process a `local_`
+     operation is called from aligns with the item's owner. For instance, calling `ygm::container::map::local_insert`
+     will store an item on the process where the call is made, but the `ygm::container::map` may not be able to look
+     up this location if it is on the wrong process.
+   * No Prefix: Collective operation that must be called from all processes.
+
+The primary workhorse functions in YGM fall into the two categories of `async_` and `for_all` operations. In an
+`async_` operation, a lambda is asynchronously sent to a (potentially) remote process for execution. In many cases
+with YGM containers, the lambda being executed is not provided by the user and is instead part of the function itself,
+e.g. `async_insert` calls on most containers. A `for_all` operation is a collective operation in which a lambda is 
+executed locally on every process while iterating over all locally held items of some YGM object. The items iterated
+over can be items in a YGM container, items coming from a map, filter, or flatten applied to a container, or all lines
+in a collection of files in a YGM I/O parser.
+
+### Lambda Capture Rules
+Certain `async_` and `for_all` operations require users to provide lambdas as part of their executions. The lambdas
+that can be accepted by these two classes of functions follow different rules pertaining to the capturing of variables:
+   * `async_` calls cannot capture (most) variables in lambdas. Variables necessary for lambda execution must be
+     provided as arguments to the `async_` call. In the event that the data for the lambda resides on the remote
+     process the lambda will execute on, a `ygm::ygm_ptr` should be passed as an argument to the `async_`.
+   * `for_all` calls assume lambdas take only the arguments inherently provided by the YGM object being iterated over.
+     All other necessary variables *must* be captured. The types of arguments provided to the lambda can be identified
+     by the `for_all_args` type within the YGM object.
+
+These differences in behavior arise from the distinction that `async_` lambdas may execute on a remote process, while
+`for_all` lambdas are guaranteed to execute locally to a process. In the case of `async_` operations, the lambda and
+all arguments must be serialized for communication, but C++ does not provide a method for inspection of variables
+captured in the closure of a lambda. In the case of `for_all` operations, the execution is equivalent to calling
+[`std::for_each`](https://en.cppreference.com/w/cpp/algorithm/for_each) on entire collection of items held locally.
 
 ## Requirements
-* C++17 - GCC versions 8, 9 and 10 are tested. Your mileage may vary with other compilers.
+* C++20 - GCC versions 11 and 12 are tested. Your mileage may vary with other compilers.
 * [Cereal](https://github.com/USCiLab/cereal) - C++ serialization library
 * MPI
 * Optionally, Boost 1.77 to enable Boost.JSON support.  
 
-
 ## Using YGM with CMake
 YGM is a header-only library that is easy to incorporate into a project through CMake. Adding the following to
 CMakeLists.txt will install YGM and its dependencies as part of your project:
 ```
-set(DESIRED_YGM_VERSION 0.4)
+set(DESIRED_YGM_VERSION 0.6)
 find_package(ygm ${DESIRED_YGM_VERSION} CONFIG)
 if (NOT ygm_FOUND)
     FetchContent_Declare(
@@ -52,62 +89,6 @@ else ()
 endif ()
 ```
 
-# Anatomy of a YGM Program
-Here we will walk through a basic "hello world" YGM program. The [examples directory](/examples/) contains several other
-examples, including many using YGM's storage containers.
-
-To begin, headers for a YGM communicator are needed
-``` C++
-#include <ygm/comm.hpp>
-```
-
-At the beginning of the program, a YGM communicator must be constructed. It will be given `argc` and `argv` like
-`MPI_Init`, and it has an optional third argument that specifies the aggregate size (in bytes) allowed for all send
-buffers before YGM begins flushing sends. Here, we will make a buffer with 32MB of aggregate send buffer space.
-``` C++
-ygm::comm world(&argc, &argv, 32*1024*1024);
-```
-
-Next, we need a lambda to send through YGM. We'll do a simple hello\_world type of lambda.
-``` C++
-auto hello_world_lambda = [](const std::string &name) {
-	std::cout << "Hello " << name << std::endl;
-};
-```
-
-Finally, we use this lambda inside of our `async` calls. In this case, we will have rank 0 send a message to rank 1,
-telling it to greet the world
-``` C++
-if (world.rank0()) {
-	world.async(1, hello_world_lambda, std::string("world"));
-}
-```
-
-The full, compilable version of this example is found [here](/examples/hello_world.cpp). Running it prints a single
-"Hello world".
-
-# Potential Pitfalls
-
-## Allowed Lambdas
-There are two distinct classes of lambdas that can be given to YGM: *remote lambdas* and *local lambdas*, each of which
-has different requirements.
-
-### Remote Lambdas
-A *remote lambda* is any lambda that may potentially be executed on a different rank. These lambdas are identified as
-being those given to a `ygm::comm` or any of the storage containers through a function prefixed by `async_`.
-
-The defining feature of remote lambdas is they **must not** capture any variables; all variables must be provided as
-arguments. This limitation is due to the lack of
-ability for YGM to inspect and extract these arguments when serializing messages to be sent to other ranks.
-
-### Local Lambdas
-A *local lambda* is any lambda that is guaranteed not to be sent to a remote rank. These lambdas are identified as being
-those given to a `for_all` operation on a storage container.
-
-The defining feature of local lambdas is that all arguments besides what is stored in the container must be captured.
-Internally, these lambdas may be given to a [`std::for_each`](https://en.cppreference.com/w/cpp/algorithm/for_each) that
-iterates over the container's elements stored locally on each rank.
-
 # License
 YGM is distributed under the MIT license.
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,4 +2,5 @@ @@
     *#*
     build*
     .vscode*
-    .idea*
+    .idea*
+    .cache/