-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro.tex
222 lines (206 loc) · 12.1 KB
/
intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
\chapter{INTRODUCTION AND CONTRIBUTIONS}
\label{chp:intro}
\blfootnote{Portions of this chapter have been submitted to:
C.~W. Smith, B.~Granzow, G.~Diamond, D.~A. Ibanez, O.~Sahni, K.~E. Jansen
\emph{et~al.}, ``In-memory integration of existing software components for
parallel adaptive unstructured mesh workflows,'' submitted for publication.}
Unstructured mesh methods, like finite elements~\cite{hughes2012finite} or
finite volumes~\cite{abgrallFiniteVolume2016}, support the
effective analysis of complex physical behaviors modeled by partial differential
equations over general three-dimensional domains.
The most reliable and efficient methods apply adaptive procedures with \textit{a
posteriori} error estimators that indicate where and how the mesh is to be
modified.
Although adaptive meshes can have two to three orders of magnitude fewer
elements than a more uniform mesh for the same level of accuracy, there are many
complex simulations where the meshes required are so large that they can only be
solved on massively parallel systems.
The parallel simulations of interest are defined by a series of procedures which
we refer to as a workflow.
Fig.~\ref{fig:workflow} provides a high-level overview of this sequence.
Starting at the top-left the problem definition is specified on the
computational domain, typically a CAD model.
Next, automated mesh generation procedures driven by the problem definition and
optional mesh controls produce a spatial discretization of the computational
domain, a mesh.
With the mesh, the associated CAD model, and the problem definition, an analysis
is executed until some criterion is met that indicates adaptation is required.
In a workflow using mesh adaptation procedures this criterion is most effective
when it is based on the discretization error~\cite{hughes2012finite}.
Following the analysis, the adaptation procedure is executed to reduce the error
by locally refining and coarsening the mesh~\cite{li20053d,alauzet2006parallel}.
During these mesh modification operations local transfer
procedures~\cite{ibanezthesis} are executed to maintain an accurate
distribution of physical quantities of interest (e.g., the velocity of a fluid
or the displacement of a solid).
Once an adapted mesh has been created, the analysis procedure is executed again.
This solve-adapt cycle is repeated until a stopping criterion is met, such as a
pre-defined number of time/load steps.
After the last cycle completes, post-processing tools are executed to extract
spatial and temporal characteristics of the physical quantities of interest.
\begin{figure} \centering
\includegraphics[width=.8\textwidth]{presentation/figs/SimulationBasedEngineeringWorkflow2.eps}
\caption{
The series of steps in an adaptive unstructured mesh-based workflow and
(bottom) a sequence of adapted meshes and (top) the corresponding solution
fields for an adaptive manifold flow
simulation~\cite{ovcharenko2013parallel}.
}
\label{fig:workflow}
\end{figure}
The time spent running a parallel workflow is dominated by the repeated
execution of PDE analysis and adaptation procedures.
Efficiency of the analysis is maintained by redistributing the mesh after
adaptation procedures have non-uniformly refined and coarsened it.
Given an existing mesh distribution across the processes running on a parallel system,
a partition, dynamic load balancing methods determine which mesh elements should
be moved between processes to reduce the imbalance and communication costs.
For part counts numbering in the tens of thousands, multi-level methods
operating on the dual graph of the mesh are sufficient~\cite{karypis1999parallel},
but beyond this concurrency level these methods often fail due to memory usage
that increases significantly with process count~\cite{harlacherMortonSFCvsParmetis2012}.
The goal of \textbf{Par}titioning using \textbf{M}esh \textbf{A}djacencies (ParMA)
is to perform efficient, multi-criteria, dynamic load balancing of
unstructured meshes by directly using the existing mesh adjacency information.
Results will demonstrate the ability of ParMA to dynamically re-balance meshes
for multiple criteria with billions of mesh regions on over one million
processors.
Thus, ParMA addresses a key limitation to workflow scalability by enabling
efficient transformation of data within the computationally dominant analysis
procedures.
Within the solve-adapt cycle are transfers of mesh and field data between
analysis and adaptation procedures (depicted by the pair of bold arrows in
Fig.~\ref{fig:workflow}).
In massively parallel simulations requiring frequent adaptation, the reading and
writing of files between the execution of analysis and adaptation procedures
introduces significant computational overheads.
To avoid shared and contentious parallel filesystems, our in-memory coupling
approaches using data streams and APIs provide scalable and efficient transfers
of data by using node-local memory.
Thus, in-memory transfer approaches address a second key limitation to parallel
workflow scalability.
\section{Contributions}
The work of this thesis develops load balancing and in-memory workflow construction
methods for conformal 2D and 3D unstructured meshes composed of quadrilaterals,
triangles, tetrahedra, prisms, hexahedra, and pyramids.
A conformal mesh is one in which the intersection of two elements is a lower
order mesh entity shared by both (i.e., a face for two regions, an edge for two faces,
and a vertex for two edges).
\subsection{Dynamic Partition Improvement}
ParMA provides dynamic load balancing methods that compliment existing graph and
geometric partitioners to create mesh partitions with billions of elements
on millions of processors that are tailored to the needs of a given application.
Partition improvement results at various per part element counts on over one
million processors are discussed.
\subimport{parmaimprovement/}{parmaContributions}
\subsection{In-memory Component Coupling}
Critical to the construction of parallel workflows is the ability to couple existing
pieces of software.
We define bulk and atomic level couplings implemented using API- and data
stream-based approaches.
These in-memory couplings are applied to a monolithic, mixed C/C++ and FORTRAN,
computational fluid dynamics (CFD) analysis code, a C++, hp-adaptive, finite
element framework for linear accelerator frequency analysis, and a C++
multi-physics framework.
Scaling results up to 16,384 processes are provided for the coupling of the
massively parallel PHASTA CFD analysis code with mesh adaptation, and the memory
overhead for the linear accelerator framework when coupled with mesh adaptation
is studied.
\subsection{Parallel Workflows for Industrial Applications}
In addition to the three in-memory workflows discussed above, we also construct
parallel workflows for three industrial applications.
The first workflow demonstrates an adaptive, multi-phase flow simulation using an
in-memory coupling to a closed-source, serial procedure provided by the
industrial partner.
The second and third workflows focus on reducing the time engineers and analysts
have to spend setting up problems and running jobs by applying automation and
abstraction techniques.
\section{Thesis Organization}
The thesis is organized into the following chapters.
\begin{itemize}
\item Chapter~\ref{chp:ptnImprovement} details the ParMA load balancing
software and its support for extreme scale workflows.
\item Chapter~\ref{chp:workflow} discusses the construction of parallel,
unstructured mesh-based, in-memory workflows and the components that define
them.
\item Chapter~\ref{chp:inmem} describes the in-memory coupling for
three unstructured mesh-based workflows and examines their performance
relative to file-based approaches on massively parallel systems.
\item Chapter~\ref{chp:hpcnyWorkflows} discusses three industrial workflows:
one using in-memory coupling techniques, and two others using automation
technologies.
\item Chapter~\ref{chp:conclusion} summarizes the work and discusses some
possible future efforts.
\end{itemize}
\section{Terminology and Notation}
\begin{tabular}{l|p{10cm}}
2D,3D & two- and three-dimensional.\\
petascale,exascale & a computer system capable of executing $10^{15}$
and $10^{18}$ floating-point operations
per second, respectively.\\
CFD & computational fluid dynamics.\\
CAE & computer-aided engineering.\\
CAD & computer-aided design/drafting.\\
ISV & independent software vendor; typically a for-profit
organization with a closed-source product.\\
I/O & input and output.\\
API & application program interface.\\
CPU & central processing unit; typically a socketed device
on the motherboard with multiple, independent, out-of-order
processing units.\\
(CPU) core & a processing unit within a CPU.\\
GPU & graphical processing unit; typically a bus-attached
device with thousands of group-synchronized, simple,
in-order processing units.\\
$O(1)$ & an operation that executes in constant time.\\
workflow & the sequence of steps to set up and execute a
simulation.\\
(mesh) part & a set of mesh elements and their closure assigned
to a given process.\\
(mesh) partition & the set of parts forming a distributed mesh.\\
Ki & suffix to denote $2^{10}$. So, for example,
16Ki is equal to $16*2^{10} = 16384$.\\
$l.N$ & denotes the $N$th line of an Algorithm or
Listing.\\
\texttt{code} & C/C++/Fortran is written with a fixed-pitch font.
For example, the function \texttt{printf(...)} is
declared in \texttt{stdio.h}; the ellipsis
represent omitted arguments.\\
\end{tabular}
\newpage
The following notation describes the topological entities of geometric models,
meshes, and partitions, the associations between the topological entities, and
the distribution of mesh entities in a partitioned mesh~\cite{ibanez2016pumi}.
\\
\begin{tabular}{l|p{10cm}}
$V$ & the model, $V$ $\in$ \{$G$, $P$, $M$\}, where
$G$ is the geometric model, $P$
is the partition model, and $M$
is the mesh.
\\
$\{V^d\}$ & the set of dimension $d$ entities in model $V$.\\
$V^d_i$ & the $i^{th}$ entity of dimension $d$ in model
$V$.
$d=0$ for vertex, $d=1$ for edge, $d=2$ for
face, and $d=3$ for region.\\
\{$M^d_i$\{$M^q$\}\} & a set of mesh entities of dimension $q$ that
are adjacent to $M^d_i$. For instance,
\{$M^1_3$\{$M^3$\}\} is a set of mesh regions
adjacent to mesh edge $M^1_3$. \\
$M^{d}_i \sqsubset G^{q}_j$ & the geometric classification indicating the
unique association of mesh entity $M^{d}_i$
with geometric model entity
$G^{q}_j$, $d \le q$.\\
$M^{d}_i \sqsubset P^{q}_j$ & the partition classification indicating the
unique association of mesh entity $M^{d}_i$
with partition model entity
$P^{q}_j$, $d \le q$.\\
$w_p(M^d)$ & the weight of mesh entities $M^d$ on part $p$.\\
$I^d_p$ & the weighted imbalance of dimension $d$ mesh entities
on part $p$;
$w_p(M^d_i) / avg(w_{q=0..N-1}(M^d_i))$
where $N$ is the total number of parts in the
mesh.\\
$I^d$ & the maximum imbalance of entity dimension $d$;
$max(I^d_{p=0..N-1})$.
\end{tabular}