Merge pull request #158 from promised-ai/feature/explain-pred

Feature/explain pred
promised-ai · Jan 18, 2024 · edc1ac3 · edc1ac3
2 parents 82940ee + 0220d5b
commit edc1ac3
Show file tree

Hide file tree

Showing 47 changed files with 2,080 additions and 1,461 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,10 +10,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 
 - Added `plot.state` function in pylace to render PCC states
+- Added `analysis.explain_prediction` in pylace to explain predictions
+- Added `plot.prediction_explanation` in pylace to render prediction explanations
+- Added `analysis.held_out_uncertainty` in pylace
+- Added `analysis.attributable_[neglogp | inconsistrncy | uncertainty]` in pylace to quantify the amount of surprisal (neglogp), inconsistency, and uncertainty attributable to other features
 
 ### Changed
 
 - Updated all packages to have the correct SPDX for the Business Source License
+- Removed internal implimentation of `logsumexp` in favor of `rv::misc::logsumexp`
+- Update to rv 0.16.2
+- Impute and prediction uncertainty are the mean total variation distance between each state's distribution and the average distribution divided by the potential max: `(n-1) / n`, where `n` is the number of states. This normalization is meant to ensure that the interpretation is the same regardless of the number of states -- zero is lowest, one is highest.
+- Bump min rust version to `1.62` to support `f64::total_cmp`
 
 ### Fixed
 

diff --git a/LICENSE b/LICENSE
@@ -1,4 +1,4 @@
-COUCHBASE BUSINESS SOURCE LICENSE AGREEMENT
+REDPOLL BUSINESS SOURCE LICENSE AGREEMENT
 
 Business Source License 1.1
 

diff --git a/book/lace_preprocess_mdbook_yaml/Cargo.toml b/book/lace_preprocess_mdbook_yaml/Cargo.toml
@@ -16,8 +16,8 @@ path = "src/main.rs"
 anyhow = "1.0"
 clap = "4.2"
 env_logger = "0.10"
-lace_codebook = { path = "../../lace/lace_codebook", version = "0.4.0" }
-lace_stats = { path = "../../lace/lace_stats", version = "0.2.0" }
+lace_codebook = { path = "../../lace/lace_codebook", version = "0.5.0" }
+lace_stats = { path = "../../lace/lace_stats", version = "0.2.1" }
 log = "0.4"
 mdbook = "0.4"
 pulldown-cmark = { version = "0.9", default-features = false }

diff --git a/book/src/pcc/html/sats-high-unc.html b/book/src/pcc/html/sats-high-unc.html
diff --git a/book/src/pcc/html/sats-low-unc.html b/book/src/pcc/html/sats-low-unc.html
diff --git a/book/src/pcc/pred.md b/book/src/pcc/pred.md
@@ -18,10 +18,50 @@ Determining how certain the model is in its ability to capture a prediction is d
 
 Mathematically, uncertainty is formalized as the Jensen-Shannon divergence (JSD) between the state-level predictive distributions. Uncertainty goes from 0 to 1, 0 meaning that there is only one way to model a prediction, and 1 meaning that there are many ways to model a prediction and they all completely disagree.
 
-![Prediction uncertainty in unimodal data](prediction-uncertainty.png)
+<div class=tabbed-blocks>
 
-**Above.** Prediction uncertainty when predicting *Period_minutes* of a satellite in the satellites data set. Note that the uncertainty value here is driven mostly by the difference variances of the state-level predictive distributions.
+```python
+import pandas as pd
+from lace import examples, plot
 
-Certain ignorance is when the model has zero data by which to make a prediction and instead falls back to the prior distribution. This is rare, but when it happens it will be apparent. To be as general as possible, the priors for a column's component distributions are generally much more broad than the predictive distribution, so if you see a predictive distribution that is senselessly wide and does not looks like the marginal distribution of that variable (which should follow the histogram of the data), you have a certain ignorance. The fix is to fill in the data for items similar to the one you are predicting.
+satellites = examples.Satellites()
+
+plot.prediction_uncertainty(
+  satellites,
+  "Period_minutes",
+  given={ "Class_of_Orbit": "GEO"}
+)
+```
+</div>
+
+{{#include html/sats-low-unc.html}}
+
+**Above.** Prediction uncertainty when predicting *Period_minutes* of a geosynchronous satellite in the satellites dataset. Uncertainty is low. Though the stat distributions differ slightly in their variance, they're relatively close, with similar means.
+
+To visualize a higher uncertainty prediction, well use `given` conditions from a record with a know data entry error.
+
+<div class=tabbed-blocks>
 
+```python
+given = satellites["Intelsat 902", :].to_dicts()[0]
 
+# remove all missing data
+given = { k: v for k, v in given.items() if not pd.isnull(v) }
+
+# remove the index and the target value
+_ = given.pop("index")
+_ = given.pop("Period_minutes")
+
+plot.prediction_uncertainty(
+  satellites,
+  "Period_minutes",
+  given=given
+)
+```
+</div>
+
+{{#include html/sats-high-unc.html}}
+
+**Above.** Prediction uncertainty when predicting *Period_minutes* of Intelsat 902. Though the mean predictive distribution (black line) has a relatively low variance, there is a lot of disagreement between some of the samples, leading to high epistemic uncertainty.
+
+Certain ignorance is when the model has zero data by which to make a prediction and instead falls back to the prior distribution. This is rare, but when it happens it will be apparent. To be as general as possible, the priors for a column's component distributions are generally much more broad than the predictive distribution, so if you see a predictive distribution that is senselessly wide and does not looks like the marginal distribution of that variable (which should follow the histogram of the data), you have a certain ignorance. The fix is to fill in the data for items similar to the one you are predicting.
diff --git a/book/src/pcc/prediction-uncertainty.png b/book/src/pcc/prediction-uncertainty.png
diff --git a/book/src/workflow/analysis.md b/book/src/workflow/analysis.md
@@ -84,7 +84,7 @@ animals.predict("swims")
 animals.predict(
     "swims",
     &Given::<usize>::Nothing,
-    Some(PredictUncertaintyType::JsDivergence),
+    true,
     None,
 );
 ```
@@ -94,7 +94,7 @@ animals.predict(
 Which outputs
 
 ```python
-(0, 0.008287057807910558)
+(0, 0.04384630488890182)
 ```
 
 The first number is the prediction. Lace predicts that *an* animal does not
@@ -121,7 +121,7 @@ animals.predict(
     &Given::Conditions(vec![
         ("flippers", Datum::Categorical(lace::Category::U8(1)))
     ]),
-    Some(PredictUncertaintyType::JsDivergence),
+    true,
     None,
 );
 ```
@@ -130,10 +130,10 @@ animals.predict(
 Output:
 
 ```python
-(1, 0.05008037071634858)
+(1, 0.09588592928237495)
 ```
 
-The uncertainty is higher, but still quite low.
+The uncertainty is a little higher, but still quite low.
 
 Let's add some more conditions that are indicative of a swimming animal and see
 how that effects the uncertainty.
@@ -151,7 +151,7 @@ animals.predict(
         ("flippers", Datum::Categorical(lace::Category::U8(1))),
         ("water", Datum::Categorical(lace::Category::U8(1))),
     ]),
-    Some(PredictUncertaintyType::JsDivergence),
+    true,
     None,
 );
 ```
@@ -160,10 +160,10 @@ animals.predict(
 Output:
 
 ```python
-(1, 0.05116664361415335)
+(1, 0.06761776764962134)
 ```
 
-The uncertainty is basically the same.
+The uncertainty is a bit lower now that we've added swim-consistent evidence.
 
 How about we try to mess with Lace? Let's try to confuse it by asking it to
 predict whether an animal with flippers that does not go in the water swims.
@@ -181,7 +181,7 @@ animals.predict(
         ("flippers", Datum::Categorical(lace::Category::U8(1))),
         ("water", Datum::Categorical(lace::Category::U8(0))),
     ]),
-    Some(PredictUncertaintyType::JsDivergence),
+    true,
     None,
 );
 ```
@@ -190,14 +190,14 @@ animals.predict(
 Output:
 
 ```python
-(0, 0.32863593091906085)
+(0, 0.36077426258767503)
 ```
 
 The uncertainty is really high! We've successfully confused lace.
 
 ## Evaluating likelihoods
 
-Let's compute the likemportlihood to see what is going on
+Let's compute the likelihood to see what is going on
 
 <div class=tabbed-blocks>