-
Notifications
You must be signed in to change notification settings - Fork 7
/
ch-exec-comp-appx.tex
160 lines (91 loc) · 13 KB
/
ch-exec-comp-appx.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
\chapter{Computing Roles and Collaborative Projects}
\label{appx:es-comp}
\section{Roles}
\label{appx:comp-roles}
This appendix lists computing roles for \dshort{dune} derived from a comparison with existing similar roles on the \dword{lhcb} experiment at \dshort{cern}. \dword{lhcb} is similar in size and data volumes to \dword{dune}.
\begin{description}
\item {Distributed Computing Development and Maintenance - 5.0 FTE}
This role includes oversight of all software engineering and development activities for packages needed to operate on distributed computing resources. The role requires a good understanding of the distributed computing infrastructure used by \dword{dune} as well as the \dword{dune} computing model.
\item {Software and Computing Infrastructure Development and Maintenance - 6.0 FTE}
This %task includes
role includes software engineering, development, and maintenance for central services operated by \dword{dune} to support software and computing activities of the project. %6.0 FTE
\item {Database Design and Maintenance - 0.5 FTE}
This role includes designing, maintaining, and scaling databases for
tasks within \dword{dune}.
\item {Data Preservation Development - 0.5 FTE}
This role includes activities related to reproducibility of analysis as well as data preservation, which requires expert knowledge of analysis and the computing model.
\item {Application Managers and Librarians - 2.0 FTE}
Application managers handle software applications for data processing, simulation, and analysis, and also coordinate activites in the areas of development, release preparation, and %properly
deployment of software package releases needed by \dword{dune}. Librarians organize the overall setup of software packages needed for releases.
\item {Central Services Manager and Operators - 1.5 FTE}
The site manager and operators are responsible for the central infrastructure and services of the \dword{dune} distributed computing infrastructure. This includes coordination with the host laboratory for services provided to \dword{dune}.
\item {Distributed Production Manager - 0.5 FTE}
Distributed production managers are responsible for the setup, launch, monitoring, and %finishing
completion of processing campaigns executed on distributed computing resources for the experiment. %Production management is necessary for
These include data processing, \dword{mc} simulation, and working group productions.
\item {Distributed Data Manager - 0.5 FTE}
The distributed data manager is responsible for operational interactions with distributed computing disk and tape resources. The role includes but is not limited to helping to establish new storage areas and data replication, deletion, and movement.
\item {Distributed Workload Manager - 0.5 FTE}
The distributed workload manager is responsible for operational interactions with distributed computing resources. The role includes activities such as helping to establish grid and cloud sites.
\item {Computing Shift Leaders - 1.4 FTE}
The shift leader is %mainly
responsible for the experiment's distributed computing operations for a week-long period %one person covers shifts
starting on a Monday to the following Sunday. Shift leaders chair regular operations meetings during their week and attend general \dword{dune} operations meetings as appropriate. %1.4 FTE as it also includes week-ends.
\item {Distributed Computing Resource Contacts - 0.5 FTE}
Distributed computing resource contacts are the primary contacts for the \dword{dune} distributed computing operations team and for the operators of large (Tier-1) sites and regional federations. They interact directly with the computing shift leaders at operations meetings.
\item {User Support - 1.0 FTE}
User support (software infrastructure, applications, and distributed computing) underpins all user activities of the \dword{dune} computing project.
User support %also includes
personnel %who
respond to questions from users on mailing lists, Slack-style chat systems, and/or ticketing systems, %as well as documented
and are responsible for documenting solutions in knowledge bases and wikis.
\item {Resource Board Chair - 0.1 FTE}
This role is responsible for chairing quarterly meetings of the Computing Resource Board, which includes representatives from the %national \dword{dune} collaborations,
various national funding agencies that support \dword{dune}, to discuss %the level of
funding for and delivery of the computing resources required for successful processing and exploitation of \dword{dune} data. %0.1 FTE
\item {Computing Coordination - 2.0 FTE}
Coordinators oversee management of the computing project.
\end{description}
%%%%%%%%%%%%%%%%%%%%
\section{Specific Collaborative Computing Projects}
\label{ch:exec-comp-gov-coop}
The \dword{hep} computing community has come together to form a \dword{hsf}\cite{Alves:2017she} that, through working groups, workshops, and white papers, guides the next generation of shared \dword{hep} software.
The \dword{dune} experiment's time scale, particularly the planning and evaluation phase, is almost ideal for allowing the \dword{hsf} to develop effective contributions. Our overall strategy for computing infrastructure is to carefully evaluate existing and proposed field-wide solutions, to participate in useful designs, and to build our own solutions only where common solutions do not fit and additional joint development is not feasible. This section describes some of these common activities.
\subsection{LArSoft for Event Reconstruction}
Several neutrino experiments using the \dword{lartpc} technology share the \dword{larsoft}\cite{Snider:2017wjd} reconstruction package. \dword{microboone}, \dword{sbnd}, \dword{dune}, and others share in developing a common core software framework that can be customized for each experiment. This software suite and earlier efforts in other experiments made the rapid reconstruction of the \dword{pdsp} data possible. \dword{dune} will contribute heavily to the future evolution of this package, in particular, by introducing full multi-threading to allow parallel reconstruction of parts of large events, thus anticipating the very large events expected from the full detector.
\subsection{WLCG/OSG and the HEP Software Foundation}
The \dword{wlcg} organization~\cite{Bird:2014ctt}, which combines the resource and infrastructure missions of the \dword{lhc} experiments, has proposed a governance structure called \dword{sci} that splits out dedicated resources for \dword{lhc} experiments from the general middleware infrastructure used to access those resources. In a white paper submitted to the European Strategy Group in December 2018~\cite{bib:BirdEUStrategy}, a formal \dword{sci} organization is proposed. Many other experiments worldwide are already using this structure. As part of the formal transition to \dword{sci}, the \dword{dune} collaboration was provisionally invited to join the \dword{wlcg} management board as observers and to participate in the Grid Deployment Board and task forces. Our participation allows us to contribute to the technical decisions on global computing infrastructure while also contributing to that infrastructure. Many of these contributions involve the broader \dword{hep} Software Foundation efforts.
Areas of collaboration are described in the following sections.
\subsubsection{Rucio Development and Extension}
\dword{rucio}\cite{Barisits:2019fyl}
is a data management system originally developed by the \dword{atlas} collaboration and is now an open-source project. \dword{dune} has chosen to use \dword{rucio} for large-scale data movement. Over the short term, it is combined with the \dword{sam} data catalog used by \dword{fnal} experiments. \dword{dune} collaborators at \dword{fnal} and in the UK are actively collaborating on the \dword{rucio} project, adding value for both \dword{dune} and the wider community. %effort.
Besides \dword{dune}, the global \dword{rucio} team now includes \dword{fnal} and \dword{bnl} staff, \dword{cms} collaborators, and the core developers on \dword{atlas} who initially wrote \dword{rucio}. \dword{dune} \dword{csc} members have begun collaborating on several projects: (1) making object stores (such as Amazon S3 and compatible utilities) work with \dword{rucio} (a large object store in the UK exists for which \dword{dune} has a sizable allocation); (2) monitoring and administering the \dword{rucio} system, and leveraging the landscape system at \dword{fnal}; and (3) designing a data description engine that can be used to replace the \dword{sam} system we now use.
\dword{rucio} has already proved to be a powerful and useful tool for moving defined datasets from point A to point B. %Our initial observation is that
\dword{rucio} appears to offer a good solution for file localization but it lacks %does not have
the detailed tools for data description and granular dataset definition available in the %current
\dword{sam} system. The rapidly varying conditions in the test beam have shown that we need a sophisticated data description database interfaced to \dword{rucio}'s location functions.
%In addition, \dword{lhc} experiments such as \dword{atlas} and \dword{cms} work with disk storage and tape storage that are independent of each other. This is unlike the dCache model used in most \dword{fnal} experiments where most of dCache is a caching front end for a tape store. Efficient integration of caching into the \dword{rucio} model will be an important component for \dword{dune} unless we can afford to have most data on disk to avoid staging.
Efficient integration of caching into the \dword{rucio} model will be an important component for \dword{dune} unless we can afford to have most data on disk to avoid staging. The dCache model, a caching front end for a tape store, is used in most \dword{fnal} experiments. In contrast, \dword{lhc} experiments such as \dword{atlas} and \dword{cms} work with disk storage and tape storage that are independent of each other.
%\todo{ Comment on metadata project}
\subsubsection{Testing New Storage Technologies and Interfaces}
The larger \dword{hep} community\cite{Berzano:2018xaa} currently has a \dword{doma} task force
in which several \dword{dune} collaborators participate. %There are task forces for
It includes groups working on authorization, caching, third party copy, hierarchical storage, and quality of service. All are of interest to \dword{dune} because they will determine the long-term standards for common computing infrastructure in the field.
%In particular, the authorization issues should significantly affect \dword{dune};
Authorization is of particular interest; they are covered in Section~\ref{ch-comp-auth}.
\subsubsection{Data Management and Retention Policy Development}
A data life cycle is built into the \dword{dune} data model. Obsolete samples (old simulations and histograms and old commissioning data) need not be maintained indefinitely.
We are organizing the structure of lower storage to store the various retention types separately for easy deletion when necessary.
\subsubsection{Authentication and Authorization Security and Interoperability}\label{ch-comp-auth}
Within the next few years, we expect the global \dword{hep} community to change significantly the methods of authentication and authorization of computing and storage.
Over that period, \dword{dune} must collaborate with the USA and European \dword{hep} computing communities on improved authentication methods that will allow secure but transparent access to storage and other resources such as logbooks and code repositories. The current model requires individuals to %, where individuals must
be authenticated through different mechanisms for access to USA and European resources. %, is already a bottleneck to efficiently integrating personnel and storage.
Current efforts to expand the trust realm between \dword{cern} and \dword{fnal} should allow a single sign-on for access to the two laboratories. % for each to access the other laboratory.
\subsection{Evaluations of Other Important Infrastructure}
The \dword{dune} \dword{csc} is still evaluating some major infrastructure components, notably databases, and workflow management systems.
%For databases~\cite{Laycock:2019ynk}, t
The \dword{fnal} \textit{conditions} database is being used for the first run of \dword{protodune}, but the Belle II~\cite{Ritter:2018jxh} system supported by \dword{bnl} is being considered for subsequent runs~\cite{Laycock:2019ynk}.
%For workflow management, w
We are evaluating \dword{dirac}~\cite{Falabella:2016waj} as a workflow management tool and plan to investigate PANDA~\cite{Megino:2017ywl}, as well, to compare against the current GlideInWMS, HT Condor, and POMS solution that was successfully used for the 2018 \dword{protodune} campaigns.
%Both \dword{dirac} and PANDA are used by several \dword{lhc} and non-\dword{lhc} experiments use \dword{dirac} and PANDA, and are already being integrated with \dword{rucio}.
Both \dword{dirac} and PANDA are being integrated with \dword{rucio} and several \dword{lhc} and non-\dword{lhc} experiments use them.