Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data oob notebook #636

Merged
merged 43 commits into from
Jan 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
aa6f4fa
WIP: data-oob adaptation to pydvl.valuation
schroedk Sep 16, 2024
fc0f16d
Fix warning message
mdbenito Nov 13, 2024
d76dd02
Missing type
mdbenito Nov 13, 2024
37dba9d
OOB notebook: fix text, update code to new api, use joblib
mdbenito Nov 13, 2024
aae6b3b
Move legacy code outside core dist
mdbenito Nov 13, 2024
ce42b0c
Add deletion note
mdbenito Nov 13, 2024
6ae19eb
Minor text tweaks to nb
mdbenito Nov 14, 2024
3555960
Fix warning message
mdbenito Nov 14, 2024
7266c03
Export both variances and counts to dataframe
mdbenito Nov 14, 2024
c037e5a
Add BaggingModel type and predicate
mdbenito Nov 14, 2024
2bdbea2
Revert typo leading to bogus stderr computation
mdbenito Nov 14, 2024
9d0ff80
Simplify type check and improve tests
mdbenito Nov 14, 2024
b754227
Remove support for implicit training of bagging models in data_oob
mdbenito Nov 14, 2024
2d9ce58
Missing type
mdbenito Nov 14, 2024
d49f6a3
Fix plotting for oob
mdbenito Nov 16, 2024
01628ff
Fix naming of OOB scoring function
mdbenito Nov 16, 2024
1b3794f
Ensure sorting of values prior to plotting
mdbenito Nov 19, 2024
611ae47
Require sklearn >= 1.3 for TargetEncoder
mdbenito Nov 19, 2024
cfcd148
Update tests for ValuationResult
mdbenito Nov 19, 2024
f53e8a8
WIP
mdbenito Nov 22, 2024
df5023e
WIP2
mdbenito Nov 29, 2024
6a0fd35
More requirement to correct file
mdbenito Jan 14, 2025
fcf504c
Merge develop into feature/619-oob-valuation
mdbenito Jan 14, 2025
cda1f99
Fixes for new dataset interface
mdbenito Jan 14, 2025
9f82e8e
Merge branch 'feature/619-oob-valuation' into feature/619/temp
mdbenito Jan 14, 2025
59ddcc6
Fixes for new dataset interface
mdbenito Jan 15, 2025
fa0db1b
Import future annotations
mdbenito Jan 15, 2025
f8994f3
Fix dtype of names in results
mdbenito Jan 15, 2025
5ec3a76
Comments
mdbenito Jan 15, 2025
c6648f3
Docstrings
mdbenito Jan 15, 2025
727f5a4
Fix bogus sorting of data in plot_ci_array
mdbenito Jan 15, 2025
943d688
Remove unnecessary type hint using private import
mdbenito Jan 15, 2025
2209ba7
Copy names and ignore type
mdbenito Jan 15, 2025
6cdd207
Extend accepted types for dataset slices
mdbenito Jan 15, 2025
41ced28
Cleanup
mdbenito Jan 15, 2025
00a0f34
Finish notebook and analysis
mdbenito Jan 15, 2025
5cbee9c
Fix doc
mdbenito Jan 15, 2025
27224ee
Update CHANGELOG.md
mdbenito Jan 15, 2025
70d0956
Fix docs build
mdbenito Jan 15, 2025
705713e
Revert fe03352bb56f29ae4f4d9628bbb6fa2818324458
mdbenito Jan 15, 2025
20863ed
Move test to its correct location
mdbenito Jan 15, 2025
12c7534
Faster CI exec of data oob notebook, and a missing import
mdbenito Jan 15, 2025
61a6c15
Adapt notebook to new ValuationResult
mdbenito Jan 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

### Added

- Added `run_removal_experiment` for easy removal experiments
[PR #636](https://github.com/aai-institute/pyDVL/pull/636)
- Refactor Classwise Shapley valuation with the interfaces and sampler
architecture [PR #616](https://github.com/aai-institute/pyDVL/pull/616)
- Refactor KNN Shapley values with the new sampler architecture
Expand Down Expand Up @@ -41,6 +43,8 @@

### Fixed

- Fixed the analysis of the adult dataset in the Data-OOB notebook
[PR #636](https://github.com/aai-institute/pyDVL/pull/636)
- Replace `np.float_` with `np.float64` and `np.alltrue` with `np.all`,
as the old aliases are removed in NumPy 2.0
[PR #604](https://github.com/aai-institute/pyDVL/pull/604)
Expand All @@ -67,6 +71,8 @@
- Dropped black, isort and pylint from the CI pipeline, in favour of ruff
[PR #633](https://github.com/aai-institute/pyDVL/pull/633)
- **Breaking Changes**
- Changed `DataOOBValuation` to only accept bagged models
[PR #636](https://github.com/aai-institute/pyDVL/pull/636)
- Dropped support for python 3.8 after EOL
[PR #633](https://github.com/aai-institute/pyDVL/pull/633)
- Rename parameter `hessian_regularization` of `DirectInfluence`
Expand Down
155 changes: 155 additions & 0 deletions docs/value/data-oob.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
---
title: Data-OOB
---

# Data valuation for bagged models with Data-OOB

Data-OOB [@kwon_dataoob_2023] is a method for valuing data used to train bagged
models. It defines value as the out-of-bag (OOB) performance estimate for the
model, overcoming the computational bottleneck of Shapley-based data valuation
methods: Instead of fitting a large number of models to accurately estimate
marginal contributions like Shapley-based methods, Data-OOB evaluates each weak
learner in an ensemble over samples it hasn't seen during training, and averages
the performance across all weak learners.

More precisely, for a bagging model with $B$ estimators $\hat{f}_b, b \in [B]$,
we define $w_{bj}$ as the number of times that the $j$-th sample is in the
training set of the $b$-th estimator. For a **fixed** choice of bootstrapped
training sets, the Data-OOB value of sample $(x_i, y_i)$ is defined as:

$$ \psi_i := \frac{\sum_{b=1}^{B}\mathbb{1}(w_{bi}=0)T(y_i,
\hat{f}_b(x_i))}{\sum_{b=1}^{B} \mathbb{1} (w_{bi}=0)},
$$

where $T: Y \times Y \rightarrow \mathbb{R}$ is a score function that represents
the goodness of weak learner $\hat{f}_b$ at the $i$-th datum $(x_i, y_i)$.

$\psi$ can therefore be interpreted as a per-sample partition of the standard
OOB error estimate for a bagging model, which is: $\frac{1}{n} \sum_{i=1}^n
\psi_i$.

## Computing values

The main class is
[DataOOBValuation][pydvl.valuation.methods.data_oob.DataOOBValuation]. It takes
a *fitted* bagged model and uses data precomputed during training to calculate
the values. It is therefore very fast, and can be used to value large datasets.

This is how you would use it with a [[RandomForestClassifier]]:

```python
from sklearn.ensemble import RandomForestClassifier
from pydvl.valuation import DataOOBValuation, Dataset

train, test = Dataset(...), Dataset(...)
model = RandomForestClassifier(...)
model.fit(*train.data())
valuation = DataOOBValuation(model)
valuation.fit(train)
values = valuation.values()
```

`values` is then a [ValuationResult][pydvl.valuation.result.ValuationResult] to
be used for data inspection, cleaning, etc.

Data-OOB is not limited to sklearn's [[RandomForest]], but can be used with
any bagging model that defines the attribute `estimators_` after fitting and
makes the list of bootstrapped samples available in some way. This includes
[[BaggingRegressor]], [[BaggingClassifier]], [[ExtraTreesClassifier]],
[[ExtraTreesRegressor]] and [[IsolationForest]].

## Bagging arbitrary models

Through [[BaggingClassifier]] and [[BaggingRegressor]], one can compute values
for any model that can be bagged. Bagging in itself is not necessarily always
beneficial, and there are cases where it can be detrimental. However, for data
valuation we are not interested in the performance of the bagged model, but in
the valuation coming out of it, which can then be used to work on the original
model and data.

```python
from sklearn.ensemble import BaggingClassifier
from pydvl.valuation import DataOOBValuation, Dataset

train, test = Dataset(...), Dataset(...)
model = BaggingClassifier(
estimator=KNeighborsClassifier(n_neighbors=10),
n_estimators=20)
model.fit(*train.data())
valuation = DataOOBValuation(model)
valuation.fit(train)
values = valuation.values()
values.sort()
low_values = values[:int(0.05*len(train))] # select lowest 5%

# Inspect the data with lowest values:
...
```

### Off-topic: When not to use bagging as the main model

Here are some guidelines for when bagging might unnecessarily increase
computational cost, or even be detrimental:

1. **Low-variance models**: Models like linear regression, support vector
machines, or other inherently stable algorithms typically have low variance.
However, even these models can benefit from bagging in certain scenarios,
particularly with noisy data or when there are influential outliers.

2. **When the model is already highly regularized**: If a model is regularized
(e.g., Lasso, Ridge, or Elastic Net), it is already tuned to avoid
overfitting and reduce variance, so bagging might not provide much of a
benefit for its high computational cost.

3. **When data is limited**: Bagging works by creating multiple subsets of the
data via bootstrapping. If the dataset is too small, the bootstrap samples
might overlap significantly or exclude important patterns, reducing the
effectiveness of bagging.

4. **When features are highly correlated**: If features are highly correlated,
the individual models trained on different bootstrap samples may end up being
too similar.

5. **For inherently stable models**: Models that are naturally resistant to
changes in the training data (like nearest neighbors) may not benefit
significantly from bagging's variance reduction properties.

6. **When interpretability is critical**: Bagging produces an ensemble of
models, which makes the overall model less interpretable compared to a single
model. There are however manye techniques to maintain interpretability, like
partial dependence plots.

7. **When the bias-variance trade-off favors bias reduction**: If the model's
error is primarily due to bias rather than variance, techniques that address
bias (like boosting) might be more appropriate than bagging.

## Transferring values

As with any other valuation method, you can transfer the values to a different
model, and given the efficiency of Data-OOB, this can be done very quickly. A
simple workflow is to compute values using a random forest, then use them to
inspect the data and clean it, and finally train a more complex model on the
cleaned data. Whether this is a valid idea or not will depend on the specific
dataset.

```python
...
```
...

## A comment about sampling

One might fear that there is a problem because the computation of the value
$\psi_i$ requires at least some bootstrap samples *not* to include the $i$-th
sample. But we can see that this is rarely an issue, and its probability of
happening can be easily computed: For a training set of size $n$ and
bootstrapping sample size $m \le n$, the probability that index $i$ is not
included in a bootstrap sample is $\prod_{j=1}^m \mathbb{P}(i \text{ is not
drawn at pos. } j) = (1 - 1/n)^m$, i.e. for each of the $m$ draws, the number is
not picked (for $m=n$ this converges to $1/e \approx 0.368$). The probability
that across $B$ bootstrapped samples a point is not included is therefore $(1 -
1/n)^{mB}$, which is typically extremely low.

Incidentally, this allows us to estimate the estimated number of unique indices
in a bootstrap sample of size $m$ as $m(1 - 1/n)^m$.

5 changes: 3 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ nav:
- value/index.md
- Shapley values: value/shapley.md
- Semi-values: value/semi-values.md
- The Core: value/the-core.md
- Class-wise Shapley: value/classwise-shapley.md
- Least Core: value/the-core.md
- Data-OOB: value/data-oob.md
- The Influence Function:
- influence/index.md
- Influence Function Model: influence/influence_function_model.md
Expand All @@ -32,9 +33,9 @@ nav:
- Shapley values: examples/shapley_basic_spotify.ipynb
- KNN Shapley: examples/shapley_knn_flowers.ipynb
- Data utility learning: examples/shapley_utility_learning.ipynb
- Banzhaf semivalues: examples/msr_banzhaf_digits.ipynb
- Least Core: examples/least_core_basic.ipynb
- Data OOB: examples/data_oob.ipynb
- Banzhaf Semivalues: examples/msr_banzhaf_digits.ipynb
- Influence Function:
- For CNNs: examples/influence_imagenet.ipynb
- For mislabeled data: examples/influence_synthetic.ipynb
Expand Down
Loading