colorSchema | favicon | color | layout | routerMode | title | theme | neversink_string |
---|---|---|---|---|---|---|---|
light |
/public/images/diracx-logo-square.png |
orange-light |
cover |
hash |
The neXt Dirac incarnation |
neversink |
DiracX CHEP |
Federico Stagni
October 23rd 2024 __ CHEP 2024
This is the story of why and how we decided to take a successful project and rewrite its code from scratch
layout: iframe-left title: DIRAC url: https://dirac.readthedocs.io/en/latest/ class: DIRAC slide_info: false color: gray-light
:: title ::
:: content ::
Few real life examples, also reported in this conference:
- LHCb stores the metadata and provenance of every produced file in a LHCb-specific database (with an Oracle backend)
- -->
see talk in Track 3 on Monday and poster #461 on
LbMCSubmit
- -->
see talk in Track 3 on Monday and poster #461 on
- Belle2 is a HEP experiment. Uses Rucio as a data management solution.
- CTAO has radically different requirements (compared to HEP experiments) on how to process its data.
- HERD is an astronomy and particle astrophysics experiment using dHTC for data management.
- EGI uses DIRAC as WMS, and EGI-CheckIn as an identity provider. Hosts (among others) WeNMR (structural biology and life science)
:: title ::
:: content ::
%%{init: {'theme': 'base', 'timeline': {'disableMulticolor': true}}}%%
timeline
section LHCb software
around 2000 : MC production system: bash scripts running at production sites
2002 : DIRAC2 <br> Rewritten in Python, using xml-rpc, interfacing to EDG
Data Challenge 04 : First successful grid usage ever.
: First use of pilot jobs based WMS
2006-2007 : DIRAC3<br> Full rewriting, development of the DISET protocol -- still in use today!
: the current DIRAC framework is still based on this work
section Open sourced, wider adoption
2008 : Large-ish reshuffling to become multi-VO
: LHCbDIRAC extension separated from core DIRAC code
2009 : CLIC community adopts DIRAC
2011 : France-Grilles is the first multi-VO DIRAC installation
2012 : Belle2, BES3, CTA adopt DIRAC
2021 : Python3 full support
:: title ::
- Pull model based on Pilot jobs
- Also "Push" solution for HPCs that do not support pilots (because of limited internet access).
- Will integrate CWL (Common Workflow Language) as a way of defining jobs (replacing JDL) --> see poster #217
:: content ::
%%{init: { 'theme': 'default' }}%%
flowchart LR;
Jobs["`Users see only **Jobs**`"]
A@{ shape: sl-rect, label: "APIs" }
WMS[("`**Workload
Management
System**`")]
style WMS fill:#bbf
HPC["`High
Perfomance
Computers`"]
style HPC fill:#A145
clusters["`Computer clusters`"]
style clusters fill:#A145
Grid_Nodes["Grid"]
Pilots["`**Pilots**
administer computing slots, and match (pull) jobs`"]
style HTCondorCE fill:#F23A
style ARC-AREX fill:#F23A
style libcloud fill:#F23A
style SSH fill:#F23A
style Grid_Nodes fill:#A145
style Iaas:Clouds fill:#A145
style HTCondor fill:#F26
style SLURM fill:#F26
style Jobs fill:#FFF
style Pilots fill:#FFF
A-->|jobs|WMS
WMS-->|pilots|libcloud
WMS-->|pilots|HTCondorCE
WMS-->|pilots|ARC-AREX
WMS-. jobs .->HPC
WMS-->|pilots|SSH
libcloud-->|VMs starting pilots|Iaas:Clouds
HTCondorCE-->Grid_Nodes
ARC-AREX-->Grid_Nodes
ARC-AREX-->HPC
SSH-->|pilots|SLURM
SSH-->|pilots|HTCondor
SSH-->|pilots|clusters
SLURM-->HPC
SLURM-->clusters
HTCondor-->clusters
:: title ::
It’s about files: placing, replicating, removing files
- there are LFNs (logical file names)
- LFNs are registered in catalog(s)
- where are the LFNs? (in the DIRAC File Catalog (DFC), or in Rucio)
- where are their metadata? (in the DFC, or in the LHCb Bookkeeping, or in AMGA)
- LFNs may have PFNs (physical file names), stored in SEs (Storage Elements), that can be accessed with several protocols.
:: content ::
%%{init: { 'theme': 'default' }}%%
flowchart LR;
A@{ shape: sl-rect, label: "APIs" }
DMS[("`**Data
Management
System**`")]
style DMS fill:#bbf
FC[["`**Catalog**`"]]
style FC fill:#bbf
StorageBase[["`**Storage Base**`"]]
style StorageBase fill:#bbf
DFC[("`DIRAC
Files
Catalog`")]
Rucio[("Rucio")]
style Rucio fill:#6001
TS[("`DIRAC
Transformation
System`")]
WebDav@{ shape: lin-cyl, label: "WebDav (http)" }
XRootD@{ shape: lin-cyl, label: "XRootD" }
style WebDav fill:#F23A
style XRootD fill:#F23A
A-->DMS
DMS-->FC
DMS-->StorageBase
FC-->DFC
FC-->Rucio
FC-->TS
StorageBase-->WebDav
StorageBase-->XRootD
:: title ::
-
A Data Processing transformation (e.g. Simulation, Merge, DataReconstruction...) creates jobs in the WMS (and re-submits them if needed, eventually destroys them).
-
A Data Manipulation transformation replicates, or removes, data from storage elements.
:: content ::
The Transformation System is used to automate common tasks related to production activities. It can handle thousands of productions, millions of files and jobs.
%%{init: { 'theme': 'default' }}%%
flowchart LR;
TS[("`**Transformation
System**`")]
style TS fill:#bbf
WMS[("`**Workload
Management
System**`")]
style WMS fill:#bba
RMS[("`**Request
Management
System**`")]
style RMS fill:#bba
DMS[("`**Data
Management
System**`")]
style DMS fill:#bba
PM@{ shape: sl-rect, label: "Productions Management" }
DM@{ shape: sl-rect, label: "DataSets Management" }
PM-->|Productions Definitions|TS
DM-->|DataSets Operations|TS
TS-->|Jobs|WMS
TS-->|Data Operations|RMS
RMS-->DMS
:: title ::
:: left ::
DIRAC also provides a WebApp:
:: right ::
Dashboards can be created within the DIRAC Web App:
and/or in Grafana:
:: title ::
:: content ::
- DIRAC is written in python 3 (the Pilot still supports Python 2.7)
- Services are exposed at urls like
dips://box.some.where:9132/WorloadManagement/
dips
stands for "DIRAC protocol"
- The DIRAC framework provides also "Agents" (~ cron jobs) and "Executors" (~ tasks execution) to animate the system
- As backends, and are supported (for different purposes)
- The Web App is implemented using
ExtJS
, and fully custom Python "bindings" - For its internal AuthN/Z, DIRAC understands certificates and proxies
- VOMS (Virtual Organization Membership Service) is effectively a hard DIRAC dependency
What is the best way to keep up with these trends? Can we do it within the current framework?
:: title ::
:: left ::
You authenticate with an external "Identity provider":
For authorization purposes you are using tokens everywhere:
:: right ::
(Nicely documented) REST APIs are a de-facto standard:
# "get a tag" from github
curl -L \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer <YOUR-TOKEN>" \
-H "X-GitHub-Api-Version: 2022-11-28" \
https://api.github.com/repos/OWNER/REPO/git/tags/TAG_SHA
What is the best way to implement these recommendations? Can we do it within the current framework?
:: title ::
:: content ::
- VOMS (Virtual Organization Membership Service) has been, for many years, a de-facto standard for community management
- it issues VOMS proxies ("short" certificates)
- Outside of WLCG and EGI, proxies are not a thing
- --> There are new Identity Providers delivering tokens instead of proxies
In this conference:
- WLCG transition from X.509 to Tokens: Progress and Outlook
- CMS Token Transition
- Fermilab's Transition to Token Authentication
Easy to test (will make it easier to code), but also modern, fun, and accessible to new developers We need to ensure business continuity
:: title ::
:: left ::
Ease of use, including ease of access
Fast and responsive interfaces
Scalable and flexible
:: right ::
Ease of installation and update
Up-to-date documentation
Clear confguration
Ready-to-use dashboards
:: title ::
:: content ::
- complex, with high entrance bar
- somewhat cumbersome deployment
- late on “standards”
- No http services
- No tokens
- Old monitoring
- "old"-ish design (RPC, "cron" agents...)
- not very developer-friendly: rather un-appealing/confusing, especially for new (and young) developers
- multi-VO, but was not designed to do so since the beginning
- a custom interface is needed to interact with a running DIRAC instance
- meaning that you need to install a DIRAC client for interacting with DIRAC
:: title ::
:: content ::
- A cloud native app
- Multi-VO from the get-go
- Standards-based
layout: iframe-right title: Web API url: https://diracx-cert.app.cern.ch/api/docs class: webAPI slide_info: false color: gray-light align: lm
DIRAC Web APIs with
-
Nicely documented by
- --> this is what you see on the right
- Follows the specification, with the (python) client generated by AutoREST.
:: title ::
:: content ::
- Logging in (using the
diracx cli
):
❯ dirac login gridpp
Logging in with scopes: ['vo:gridpp']
Now go to: https://diracx-cert.app.cern.ch/api/auth/device?user_code=XYZXYZXYZ
...Saved credentials to /home/fstagni/.cache/diracx/credentials.json
Login successful!
- Submitting a job (using Python
requests
):
import requests
requests.post('https://diracx-cert.app.cern.ch/api/jobs/', headers={'accept': 'application/json', 'Authorization': 'Bearer eyJhbG...', 'Content-Type': 'application/json'}, json=jdl)
- Getting its status (using
curl
):
```bash
curl -X 'GET' \
'https://diracx-cert.app.cern.ch/api/jobs/status?job_ids=8971' \
-H 'accept: application/json' \
-H 'Authorization: Bearer eyJhbG...' | jq
```
```json
{
"8971": {
"Status": "Done",
"MinorStatus": "Execution Complete",
"ApplicationStatus": "Unknown"
}
}
```
layout: iframe-left title: WebApp url: https://diracx-cert.app.cern.ch class: webapp slide_info: false color: gray-light align: lm
We are also rewriting the Web App from scratch.
Software stack:
- NextJS
- Material UI
- TypeScript
:: title ::
:: left ::
Kubernetes - Standard to define a distributed system
- Separates infrastructure from applications
- "Please IT department(/cloud provider) run this for me"
Helm gives the ability:
- to parameterise
- to distribute a kubernetes config
:: right ::
- DiracX Helm chart
- If your institution provides a kubernetes service: use it
- If you work with public clouds: use their container services
- Alternatively, follow these k3s instructions
- Used for:
- DiracX testing (GitHub actions)
- Local development instance
- Running a demo instance
- Running the test instance you saw in the previous slides
- Soon: running production instances
"OK, but there are several communities using DIRAC right now. How do they migrate?"
:: title ::
Services of DIRAC v9 and DiracX will need to live together for some time
:: content ::
1 2 3 DIRAC and DiracX share the databases A legacy adaptor moves traffic from DIRAC to DiracX services DIRAC services can be removed:: title ::
:: content ::
By now, we know that it is sometimes necessary to extend all Dirac(X) componentsDiracX aims to provide an easy way to do so.
# entrypoints in pyproject.toml
[project.entry-points."diracx.db.sql"]
AuthDB = "diracx.db.sql:AuthDB"
JobDB = "<extension>.db.sql:ExtendedJobDB"
layout: quote color: sky-light quotesize: text-m authorsize: text-s author: 'Again, some of you out there'
"You have shown tokens-based authorizations for DiracX. But the Grid still uses proxies.
VOMS is alive!"
:: title ::
:: content ::
- Identity (community membership): "in transition"
- Submitting pilots: The computing elements right now prefer the tokens
- Data access: at least in WLCG, proxies. One day, will be token
- Verifying a user's identity (internally to Dirac):
- DiracX uses only tokens (link to security model)
- DIRAC uses only X509 proxies and certificates to verify identities
:: title ::
:: left ::
DiracX: Authorization with "standard" Authorization Code Flow redirecting to IdP
%%{init: { 'theme': 'forest' }}%%
sequenceDiagram
create actor U as User
create participant DiracX
U->>DiracX: Login
DiracX->>U: Redirect
create participant External_IdP
U->>External_IdP:
destroy External_IdP
External_IdP->>DiracX: ID token
DiracX->>U: DiracX token
:: right ::
DIRAC+DiracX: working with proxies and tokens
%%{init: { 'theme': 'forest' }}%%
sequenceDiagram
create actor U as User
create participant dirac-proxy-init
U->>dirac-proxy-init:
create participant VOMS
dirac-proxy-init->>VOMS:
destroy VOMS
VOMS->>dirac-proxy-init: VOMS proxy
create participant DIRAC
dirac-proxy-init->>DIRAC: exchange proxy for token
destroy DIRAC
DIRAC->>dirac-proxy-init: DiracX token
dirac-proxy-init->>U: proxy+token bundle
U->>DIRAC_service: proxy
U->>DiracX: token
:: title ::
:: content ::
:: title ::
:: content ::
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'timeline': {'disableMulticolor': true}}}%%
timeline
May 2022 : DIRAC v8.0
Oct 2023 : EOL DIRAC v7.3
: First DiracX demo
Q4 2024 : DIRAC v9.0.0a30
: DiracX v0.0.1a19
Q1 2025 : DIRAC v9.0
: DiracX v0.1
: can start using DiracX services
:: title ::
:: content ::
- code (github.com/DIRACGrid)
- tests: (as you could see we have a somewhat open test deployment infrastructure). Try something out, and let us know!
Run the demo (on your laptop):
git clone https://github.com/DIRACGrid/diracx-charts
diracx-charts/run_demo.sh # this is run for each and every commit in Github Actions
- mattermost: https://mattermost.web.cern.ch/diracx/
- meetings: (almost) every week on Thursday morning (CET)
- hackathons: we have been doing 2-days DiracX hackathons every quarter, at CERN
- workshops: once per year, more or less
:: title ::
:: left ::
:: right ::
- DiracX is "the neXt Dirac incarnation", ensuring the future of the widely used Dirac
- We are rewriting the code, but it is still Dirac that you love!
- DiracX will ease the interoperability with Rucio and/or dask and/or any other tool out there
- DiracX will still have the Data Management part, but its Workload Management functionalities will come first
- The first DiracX release will soon be here
- It will live together with DIRAC v9 for a while, until it will replace it completely
Christophe Haen CERN, LHCb
Natthan Piggoux LUPM (FR), CTA
Cedric Serfon Brookhaven National Laboratory (US), Belle2
Ryunosuke O'Neil CERN, LHCb
Jorge Lisa Laborda Univ. of Valencia and CSIC (ES), LHCb
Daniela Bauer Imperial college (UK), GridPP
Simon Fayer Imperial college (UK), GridPP
Janusz Martyniak Imperial college (UK), GridPP
Bertrand Rigaud IN2P3 (FR)
Luisa Arrabito LUPM (FR), CTA
Xiaomei Zhang Beijing, Inst. High Energy Phys. (CN), Juno
André Sailer CERN
Andrei Tsaregorotsev CPPM (FR), EGI and LHCb
:: title ::
or just click here (for DiracX web) and here (for the Web API docs)
:: content ::
WebApp:
WebAPI:
:: title ::
Q/A
:: content ::
- I am using {Rucio|dask|another_tool}. I could use DiracX as WMS but do not want to fiddle with DIRAC
--> It will probably be possible, but we do not know when.
- What is in a DiracX token (is it "special")?
--> It carries the dirac_properties
(which are the same as in current DIRAC)
- What did you use to make these slides?
--> slidev with neversink theme. Diagrams with mermaid
:: title ::
:: content ::
- we use Github Actions "massively"
- our Integration tests create a "grid-in-a-box":
- run DIRAC and DiracX servers, including databases
- run ancillary services (e.g. IdP, CA)
- authenticate, submit pilots, match and run jobs, upload files, etc