URL: https://github.com/cyberaide/poster-summit-cylon/edit/main/README.md
Authors: Arup Sakar, Niranda Perera, Gregor von Laszewski, Mills Staylor, Geoffrey C. Fox
The proliferation of sensors, internet-connected devices, and social media has exposed us with an unprecedented volume of data. Scientific data, sourced from various channels, has become increasingly complex with diverse attributes, high dimensions, and intricate variable interconnections. Managing, structuring, and preparing such data is essential for applying deep learning, the dominant approach in large-scale data science. However, this process often becomes a bottleneck, compounded by challenges in transferring data for model training. These issues impact scientific domains such as genomics, climate modeling, accelerator physics, astronomy, and neuroscience, with data sizes ranging from 200GB to 10PB. To address these challenges on high-performance computing (HPC) platforms, we propose integrating scalable runtime tools with data frameworks. Parallel computing and distributed communication protocols play a pivotal role in addressing these challenges.
Our goal is to create an integrated, deployable approach across clouds, supercomputers, and HPC platforms, accommodating GPUs alongside CPUs and supporting heterogeneous federated distributed systems. By combining RADICAL-Pilot and Cylon, we create a heterogeneous runtime environment for scalable compute and data-intensive workloads, including data-parallel (MPI) jobs. Our design includes multiple masters and thousands of workers with function-based task scheduling and resource allocation, simplifying job execution. Optimizing heterogeneous systems with compiler-based technologies like MLIR holds promise. Our approach excels in scientific and engineering research HPC systems, scaling to exascale performance and showcasing robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.
- https://cylondata.org/)https://cylondata.org/
- Poster pptx: https://github.com/cyberaide/poster-summit-cylon/raw/main/vonLaszewski-heterogeneous-data-pipeline-2.pptx
- Poster PDF: https://github.com/cyberaide/poster-summit-cylon/blob/main/vonLaszewski-heterogeneous-data-pipeline-2.pdf
Abstract: The surge in sensors, internet-linked devices, and social media has led to an unprecedented influx of data. Scientific data, sourced from various outlets, has grown more intricate with diverse attributes, high dimensions, and complex interrelationships among variables. The process of managing, structuring, and preparing this data, essential for applying deep learning - the prevailing approach in large-scale data science, can be a bottleneck. The transfer of data across systems for model training also presents challenges. These issues affect scientific domains such as genomics, climate modeling, accelerator physics, astronomy and neuroscience. For instance, genomics generates over 200GB of data per genome sequencing, while climate simulations yield up to 10PB, necessitating more efficient data analysis techniques. One approach to address these challenges on modern high-performance computing (HPC) platforms involves integrating scalable runtime tools with data frameworks. Parallel computing and distributed communication protocols play a crucial role in resolving these sizable issues.
Google Pathways offers a comparable distributed execution environment for its LLMs and deep learning models, though it remains proprietary. Establishing a common data processing pathway with a diverse pipeline presents technical hurdles. RADICAL-Pilot, a Python runtime engine, efficiently manages various workloads on HPC machines. Our prior work, Cylon, a high-performance distributed memory data parallel library, further adds to the complexity. Combining these components is intricate due to differences in system architectures, technologies, programming languages, and communities.
Our goal is to create a seamless integrated approach deployable on clouds, supercomputers, and HPC platforms. This approach also supports heterogeneous federated distributed systems and accommodates accelerators like GPUs alongside CPUs. Employing RADICAL-Pilot to encapsulate Cylon or other deep learning frameworks establishes a heterogeneous runtime environment capable of managing scalable compute and data-intensive workloads including islands of data-parallel(MPI) jobs. Pathways illustrates the need to support workflows that link together components of deep learning with those of the pre and post-processing data engineering steps. The proposed design consists of multiple masters with thousands of workers with function based task scheduling and resource allocation. The independent underlying jobs don’t need to concern about resource allocation and releasing when the task is finished. This allows us to use homogeneous tasks to be executed in a set of allocated nodes that combine multiple heterogeneous data pipelines to be executed in the same run and leverage the hyperparameter parallelism along with distributed operation. The optimization of heterogeneous systems with compiler-based technologies like MLIR shows promise. Our approach aims to excel in both scientific and engineering research HPC systems, scaling up to exascale performance, while also demonstrating robust performance on cloud infrastructures commonly used in commercial and distributed applications. This dual capability serves as a gateway to foster collaboration and innovation within the open-source scientific research community.