Skip to content

Commit

Permalink
give up on openml.org (#16)
Browse files Browse the repository at this point in the history
* add dataset directly to git repo

* get data from the local copy, not openml.org
  • Loading branch information
jpivarski authored Jan 24, 2025
1 parent 9453667 commit 7c653ec
Show file tree
Hide file tree
Showing 7 changed files with 11 additions and 23 deletions.
5 changes: 0 additions & 5 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,6 @@ jobs:
cache-environment: true
post-cleanup: "all"

# Preload the main project data
- name: Preload main project data
run: |
python -c 'import sklearn.datasets; d = sklearn.datasets.fetch_openml("hls4ml_lhc_jets_hlf"); d["data"], d["target"]'
# Build the book
- name: Build the book
run: |
Expand Down
9 changes: 3 additions & 6 deletions deep-learning-intro-for-hep/20-main-project.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.datasets
import torch
from torch import nn, optim
from torch.utils.data import TensorDataset, DataLoader, random_split
Expand All @@ -48,12 +47,10 @@ The data comes from an online catalog: [hls4ml_lhc_jets_hlf](https://openml.org/

The full description is online, with references to the paper in which it was published.

Scikit-Learn has a tool for downloading it, which takes a minute or two.

```{code-cell} ipython3
hls4ml_lhc_jets_hlf = sklearn.datasets.fetch_openml("hls4ml_lhc_jets_hlf")
features, targets = hls4ml_lhc_jets_hlf["data"], hls4ml_lhc_jets_hlf["target"]
hls4ml_lhc_jets_hlf = pd.read_parquet("data/hls4ml_lhc_jets_hlf.parquet")
features = hls4ml_lhc_jets_hlf.drop("jet_type", axis=1)
targets = hls4ml_lhc_jets_hlf["jet_type"]
```

View the features (16 numerical properties of jets) as a Pandas DataFrame:
Expand Down
7 changes: 3 additions & 4 deletions deep-learning-intro-for-hep/21-main-project-solutions.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.datasets
import torch
from torch import nn, optim
from torch.utils.data import TensorDataset, DataLoader, random_split
Expand Down Expand Up @@ -90,9 +89,9 @@ expected_ROC = np.array([
## Step 1: download and understand the data

```{code-cell} ipython3
hls4ml_lhc_jets_hlf = sklearn.datasets.fetch_openml("hls4ml_lhc_jets_hlf")
features, targets = hls4ml_lhc_jets_hlf["data"], hls4ml_lhc_jets_hlf["target"]
hls4ml_lhc_jets_hlf = pd.read_parquet("data/hls4ml_lhc_jets_hlf.parquet")
features = hls4ml_lhc_jets_hlf.drop("jet_type", axis=1)
targets = hls4ml_lhc_jets_hlf["jet_type"]
```

## Step 2: split the data into training, validation, and test samples
Expand Down
9 changes: 3 additions & 6 deletions deep-learning-intro-for-hep/23-autoencoders.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn.datasets
import torch
from torch import nn, optim
```
Expand All @@ -57,12 +56,10 @@ from torch import nn, optim
Let's use the jet data from the main project.

```{code-cell} ipython3
hls4ml_lhc_jets_hlf = sklearn.datasets.fetch_openml("hls4ml_lhc_jets_hlf")
hls4ml_lhc_jets_hlf = pd.read_parquet("data/hls4ml_lhc_jets_hlf.parquet")
features_unnormalized = torch.tensor(
hls4ml_lhc_jets_hlf["data"].values, dtype=torch.float32,
hls4ml_lhc_jets_hlf.drop("jet_type", axis=1).values, dtype=torch.float32
)
features = (features_unnormalized - features_unnormalized.mean(axis=0)) / features_unnormalized.std(axis=0)
```

Expand Down Expand Up @@ -189,7 +186,7 @@ The exact distribution isn't meaningful (and it would change if we used a differ
How well do these clumps correspond to the known jet sources?

```{code-cell} ipython3
hidden_truth = hls4ml_lhc_jets_hlf["target"].values
hidden_truth = hls4ml_lhc_jets_hlf["jet_type"].values
```

```{code-cell} ipython3
Expand Down
3 changes: 1 addition & 2 deletions deep-learning-intro-for-hep/24-convolutional.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ import matplotlib as mpl
import matplotlib.pyplot as plt
import h5py
import sklearn.datasets
import torch
from torch import nn, optim
```
Expand All @@ -42,7 +41,7 @@ from torch import nn, optim
The jet dataset that you used for your [main project](20-main-project.md) is based on 16 hand-crafted features:

```{code-cell} ipython3
list(sklearn.datasets.fetch_openml("hls4ml_lhc_jets_hlf")["data"].columns)
list(pd.read_parquet("data/hls4ml_lhc_jets_hlf.parquet").columns[:-1])
```

Suppose we didn't know that these are a useful way to characterize jet substructure, or suppose that there are better ways not listed here (very plausible!). A model trained on these 16 features wouldn't have as much discriminating power as it could.
Expand Down
Binary file not shown.
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ dependencies:
- pandas
- iminuit
- scikit-learn
- fastparquet
- pytorch-cpu # this is `torch` in pip

# used in very few sections (optional)
Expand Down

0 comments on commit 7c653ec

Please sign in to comment.