Skip to content

Commit

Permalink
Merge pull request #158 from promised-ai/feature/explain-pred
Browse files Browse the repository at this point in the history
Feature/explain pred
  • Loading branch information
BaxterEaves authored Jan 18, 2024
2 parents 82940ee + 0220d5b commit edc1ac3
Show file tree
Hide file tree
Showing 47 changed files with 2,080 additions and 1,461 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Added `plot.state` function in pylace to render PCC states
- Added `analysis.explain_prediction` in pylace to explain predictions
- Added `plot.prediction_explanation` in pylace to render prediction explanations
- Added `analysis.held_out_uncertainty` in pylace
- Added `analysis.attributable_[neglogp | inconsistrncy | uncertainty]` in pylace to quantify the amount of surprisal (neglogp), inconsistency, and uncertainty attributable to other features

### Changed

- Updated all packages to have the correct SPDX for the Business Source License
- Removed internal implimentation of `logsumexp` in favor of `rv::misc::logsumexp`
- Update to rv 0.16.2
- Impute and prediction uncertainty are the mean total variation distance between each state's distribution and the average distribution divided by the potential max: `(n-1) / n`, where `n` is the number of states. This normalization is meant to ensure that the interpretation is the same regardless of the number of states -- zero is lowest, one is highest.
- Bump min rust version to `1.62` to support `f64::total_cmp`

### Fixed

Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
COUCHBASE BUSINESS SOURCE LICENSE AGREEMENT
REDPOLL BUSINESS SOURCE LICENSE AGREEMENT

Business Source License 1.1

Expand Down
4 changes: 2 additions & 2 deletions book/lace_preprocess_mdbook_yaml/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ path = "src/main.rs"
anyhow = "1.0"
clap = "4.2"
env_logger = "0.10"
lace_codebook = { path = "../../lace/lace_codebook", version = "0.4.0" }
lace_stats = { path = "../../lace/lace_stats", version = "0.2.0" }
lace_codebook = { path = "../../lace/lace_codebook", version = "0.5.0" }
lace_stats = { path = "../../lace/lace_stats", version = "0.2.1" }
log = "0.4"
mdbook = "0.4"
pulldown-cmark = { version = "0.9", default-features = false }
Expand Down
6 changes: 6 additions & 0 deletions book/src/pcc/html/sats-high-unc.html

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions book/src/pcc/html/sats-low-unc.html

Large diffs are not rendered by default.

46 changes: 43 additions & 3 deletions book/src/pcc/pred.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,50 @@ Determining how certain the model is in its ability to capture a prediction is d

Mathematically, uncertainty is formalized as the Jensen-Shannon divergence (JSD) between the state-level predictive distributions. Uncertainty goes from 0 to 1, 0 meaning that there is only one way to model a prediction, and 1 meaning that there are many ways to model a prediction and they all completely disagree.

![Prediction uncertainty in unimodal data](prediction-uncertainty.png)
<div class=tabbed-blocks>

**Above.** Prediction uncertainty when predicting *Period_minutes* of a satellite in the satellites data set. Note that the uncertainty value here is driven mostly by the difference variances of the state-level predictive distributions.
```python
import pandas as pd
from lace import examples, plot

Certain ignorance is when the model has zero data by which to make a prediction and instead falls back to the prior distribution. This is rare, but when it happens it will be apparent. To be as general as possible, the priors for a column's component distributions are generally much more broad than the predictive distribution, so if you see a predictive distribution that is senselessly wide and does not looks like the marginal distribution of that variable (which should follow the histogram of the data), you have a certain ignorance. The fix is to fill in the data for items similar to the one you are predicting.
satellites = examples.Satellites()

plot.prediction_uncertainty(
satellites,
"Period_minutes",
given={ "Class_of_Orbit": "GEO"}
)
```
</div>

{{#include html/sats-low-unc.html}}

**Above.** Prediction uncertainty when predicting *Period_minutes* of a geosynchronous satellite in the satellites dataset. Uncertainty is low. Though the stat distributions differ slightly in their variance, they're relatively close, with similar means.

To visualize a higher uncertainty prediction, well use `given` conditions from a record with a know data entry error.

<div class=tabbed-blocks>

```python
given = satellites["Intelsat 902", :].to_dicts()[0]

# remove all missing data
given = { k: v for k, v in given.items() if not pd.isnull(v) }

# remove the index and the target value
_ = given.pop("index")
_ = given.pop("Period_minutes")

plot.prediction_uncertainty(
satellites,
"Period_minutes",
given=given
)
```
</div>

{{#include html/sats-high-unc.html}}

**Above.** Prediction uncertainty when predicting *Period_minutes* of Intelsat 902. Though the mean predictive distribution (black line) has a relatively low variance, there is a lot of disagreement between some of the samples, leading to high epistemic uncertainty.

Certain ignorance is when the model has zero data by which to make a prediction and instead falls back to the prior distribution. This is rare, but when it happens it will be apparent. To be as general as possible, the priors for a column's component distributions are generally much more broad than the predictive distribution, so if you see a predictive distribution that is senselessly wide and does not looks like the marginal distribution of that variable (which should follow the histogram of the data), you have a certain ignorance. The fix is to fill in the data for items similar to the one you are predicting.
Binary file removed book/src/pcc/prediction-uncertainty.png
Binary file not shown.
22 changes: 11 additions & 11 deletions book/src/workflow/analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ animals.predict("swims")
animals.predict(
"swims",
&Given::<usize>::Nothing,
Some(PredictUncertaintyType::JsDivergence),
true,
None,
);
```
Expand All @@ -94,7 +94,7 @@ animals.predict(
Which outputs

```python
(0, 0.008287057807910558)
(0, 0.04384630488890182)
```

The first number is the prediction. Lace predicts that *an* animal does not
Expand All @@ -121,7 +121,7 @@ animals.predict(
&Given::Conditions(vec![
("flippers", Datum::Categorical(lace::Category::U8(1)))
]),
Some(PredictUncertaintyType::JsDivergence),
true,
None,
);
```
Expand All @@ -130,10 +130,10 @@ animals.predict(
Output:

```python
(1, 0.05008037071634858)
(1, 0.09588592928237495)
```

The uncertainty is higher, but still quite low.
The uncertainty is a little higher, but still quite low.

Let's add some more conditions that are indicative of a swimming animal and see
how that effects the uncertainty.
Expand All @@ -151,7 +151,7 @@ animals.predict(
("flippers", Datum::Categorical(lace::Category::U8(1))),
("water", Datum::Categorical(lace::Category::U8(1))),
]),
Some(PredictUncertaintyType::JsDivergence),
true,
None,
);
```
Expand All @@ -160,10 +160,10 @@ animals.predict(
Output:

```python
(1, 0.05116664361415335)
(1, 0.06761776764962134)
```

The uncertainty is basically the same.
The uncertainty is a bit lower now that we've added swim-consistent evidence.

How about we try to mess with Lace? Let's try to confuse it by asking it to
predict whether an animal with flippers that does not go in the water swims.
Expand All @@ -181,7 +181,7 @@ animals.predict(
("flippers", Datum::Categorical(lace::Category::U8(1))),
("water", Datum::Categorical(lace::Category::U8(0))),
]),
Some(PredictUncertaintyType::JsDivergence),
true,
None,
);
```
Expand All @@ -190,14 +190,14 @@ animals.predict(
Output:

```python
(0, 0.32863593091906085)
(0, 0.36077426258767503)
```

The uncertainty is really high! We've successfully confused lace.

## Evaluating likelihoods

Let's compute the likemportlihood to see what is going on
Let's compute the likelihood to see what is going on

<div class=tabbed-blocks>

Expand Down
Loading

0 comments on commit edc1ac3

Please sign in to comment.