Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/explain pred #158

Merged
merged 21 commits into from
Jan 18, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Added `plot.state` function in pylace to render PCC states
- Added `analysis.explain_prediction` in pylace to explain predictions
- Added `plot.prediction_explanation` in pylace to render prediction explanations
- Added `analysis.held_out_uncertainty` in pylace
- Added `analysis.attributable_[neglogp | inconsistrncy | uncertainty]` in pylace to quantify the amount of surprisal (neglogp), inconsistency, and uncertainty attributable to other features

### Changed

- Updated all packages to have the correct SPDX for the Business Source License
- Removed internal implimentation of `logsumexp` in favor of `rv::misc::logsumexp`
- Update to rv 0.16.2
- Impute and prediction uncertainty are the mean total variation distance between each state's distribution and the average distribution divided by the potential max: `(n-1) / n`, where `n` is the number of states. This normalization is meant to ensure that the interpretation is the same regardless of the number of states -- zero is lowest, one is highest.

### Fixed

Expand Down
6 changes: 6 additions & 0 deletions book/src/pcc/html/sats-high-unc.html

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions book/src/pcc/html/sats-low-unc.html

Large diffs are not rendered by default.

45 changes: 42 additions & 3 deletions book/src/pcc/pred.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,49 @@ Determining how certain the model is in its ability to capture a prediction is d

Mathematically, uncertainty is formalized as the Jensen-Shannon divergence (JSD) between the state-level predictive distributions. Uncertainty goes from 0 to 1, 0 meaning that there is only one way to model a prediction, and 1 meaning that there are many ways to model a prediction and they all completely disagree.

![Prediction uncertainty in unimodal data](prediction-uncertainty.png)
<div class=tabbed-blocks>

**Above.** Prediction uncertainty when predicting *Period_minutes* of a satellite in the satellites data set. Note that the uncertainty value here is driven mostly by the difference variances of the state-level predictive distributions.
```python
from lace import examples, plot

Certain ignorance is when the model has zero data by which to make a prediction and instead falls back to the prior distribution. This is rare, but when it happens it will be apparent. To be as general as possible, the priors for a column's component distributions are generally much more broad than the predictive distribution, so if you see a predictive distribution that is senselessly wide and does not looks like the marginal distribution of that variable (which should follow the histogram of the data), you have a certain ignorance. The fix is to fill in the data for items similar to the one you are predicting.
satellites = examples.Satellites()

plot.prediction_uncertainty(
satellites,
"Period_minutes",
given={ "Class_of_Orbit": "GEO"}
)
```
</div>

{{#include html/sats-low-unc.html}}

**Above.** Prediction uncertainty when predicting *Period_minutes* of a geosynchronous satellite in the satellites dataset. Uncertainty is low. Though the stat distributions differ slightly in their variance, they're relatively close, with similar means.

To visualize a higher uncertainty prediction, well use `given` conditions from a record with a know data entry error.

<div class=tabbed-blocks>

```python
given = sats["Intelsat 902", :].to_dicts()[0]

# remove all missing data
given = { k: v for k, v in given.items() if not pd.isnull(v) }

# remove the index and the target value
_ = row.pop("index")
_ = row.pop("Period_minutes")

plot.prediction_uncertainty(
satellites,
"Period_minutes",
given=given
)
```
</div>

{{#include html/sats-high-unc.html}}

**Above.** Prediction uncertainty when predicting *Period_minutes* of Intelsat 902. Though the mean predictive distribution (black line) has a relatively low variance, there is a lot of disagreement between some of the samples, leading to high epistemic uncertainty.

Certain ignorance is when the model has zero data by which to make a prediction and instead falls back to the prior distribution. This is rare, but when it happens it will be apparent. To be as general as possible, the priors for a column's component distributions are generally much more broad than the predictive distribution, so if you see a predictive distribution that is senselessly wide and does not looks like the marginal distribution of that variable (which should follow the histogram of the data), you have a certain ignorance. The fix is to fill in the data for items similar to the one you are predicting.
Binary file removed book/src/pcc/prediction-uncertainty.png
Binary file not shown.
Loading
Loading