Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs/2023 09 28 updates and fixes #134

Merged
merged 3 commits into from
Sep 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 89 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,26 @@ distribution over your dataset, which enables users to...

and more, all in one place, without any explicit model building.

```python
import pandas as pd
import lace

# Create an engine from a dataframe
df = pd.read_csv("animals.csv", index_col=0)
engine = lace.Engine.from_df(df)

# Fit a model to the dataframe over 5000 steps of the fitting procedure
engine.update(5000)

# Show the statistical structure of the data -- which features are likely
# dependent (predictive) on each other
engine.clustermap("depprob", zmin=0, zmax=1)
```

![Animals dataset dependence probability](assets/animals-depprob.png)



## The Problem

The goal of lace is to fill some of the massive chasm between standard machine
Expand Down Expand Up @@ -105,36 +125,62 @@ themselves from scratch, meaning they must know (or at least guess) the model.
PPL users must also know how to specify such a model in a way that is
compatible with the underlying inference procedure.

### Who should not use lace
### Example use cases

- **Combine data sources and understand how they interact.** For example, we
may wish to predict cognitive decline from demographics, survey or task
performance, EKG data, and other clinical data. Combined, this data would
typically be very sparse (most patients will not have all fields filled
in), and it is difficult to know how to explicitly model the interaction of
these data layers. In Lace, we would just concatenate the layers and run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Concatenate the layers and pull them through"-i don't understand this, it sounds too simple-like how could it be that easy? Maybe a rewording that explains why you can't do that yourself? Or maybe I'm not getting it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is that easy 😉

them through.
- **Understanding the amount and causes of uncertainty over time.** For
example, a farmer may wish to understand the likelihood of achieving a
specific yield over the growing season. As the season progresses, new
weather data can be added to the prediction in the form of conditions.
Uncertainty can be visualized as variance in the prediction, disagreement
between posterior samples, or multi-modality in the predictive distribution
(see [this blog post](https://redpoll.ai/blog/ml-uncertainty/) for more
information on uncertainty)
- **Data quality control.** Use `surprisal` to find anomalous data in the table
and use `-logp` to identify anomalies before they enter the table. Because
Lace creates a model of the data, we can also contrive methods to find data
that are *inconsistent* with that model, which we have used to good effect
in error finding.

### Who should not use Lace

There are a number of use cases for which Lace is not suited

- Non-tabular data such as images and text
- Highly optimizing specific predictions
+ Lace would rather over-generalize than over fit.


## Quick start

Install the CLI and pylace (requires [rust and
cargo](https://www.rust-lang.org/tools/install))
### Installation

```console
Lace requires rust.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylace shouldn't require Rust if you install it with pip


To install the CLI:
```
$ cargo install --locked lace
$ pip install py-lace
```

First, use the CLI to fit a model to your data
To install pylace

```console
$ lace run --csv satellites.csv -n 5000 -s 32 --seed 1337 satellites.lace
```
$ pip install pylace
```

Then load the model and start asking questions
### Examples

Lace comes with two pre-fit example data sets: Satellites and Animals.

```python
>>> from lace import Engine
>>> engine = Engine(metadata='satellites.lace')
>>> from lace.examples import Satellites
>>> engine = Satellites()

# Predict the class of orbit given the satellite has a 75-minute
# orbital period and that it has a missing value of geosynchronous
Expand Down Expand Up @@ -176,9 +222,13 @@ And similarly in rust:

```rust,noplayground
use lace::prelude::*;
use lace::examples::Example;

fn main() {
let mut engine = Engine::load("satellites.lace").unrwap();
// In rust, you can create an Engine or and Oracle. The Oracle is an
// immutable version of an Engine; it has the same inference functions as
// the Engine, but you cannot train or edit data.
let mut engine = Example::Satellites.engine().unwrap();

// Predict the class of orbit given the satellite has a 75-minute
// orbital period and that it has a missing value of geosynchronous
Expand All @@ -196,6 +246,33 @@ fn main() {
}
```

### Fitting a model

To fit a model to your own data you can use the CLI

```console
$ lace run --csv my-data.csv -n 1000 my-data.lace
```

...or initialize an engine from a file or dataframe.

```python
>>> import pandas as pd # Lace supports polars as well
>>> from lace import Engine
>>> engine = Engine.from_df(pd.read_csv("my-data.csv", index_col=0))
>>> engine.update(1_000)
>>> engine.save("my-data.lace")
```

You can monitor the progress of the training using diagnostic plots

```python
>>> from lace.plot import diagnostics
>>> diagnostics(engine)
```

![Animals MCMC convergence](assets/animals-convergence.png)

## License

Lace is licensed under Server Side Public License (SSPL), which is a copyleft
Expand Down
Binary file added assets/animals-convergence.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/animals-depprob.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/sats-depprob.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/sats-period-uncertainty.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion book/src/appendix/references.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,4 @@ and examples, see Mansinghka et al [^pcc-jmlr].
[^pcc-jmlr]: Mansinghka, V., Shafto, P., Jonas, E., Petschulat, C., Gasner,
M., & Tenenbaum, J. B. (2016). Crosscat: A fully bayesian nonparametric
method for analyzing heterogeneous, high dimensional data.
[(PDF)](jmlr.org/papers/volume17/11-392/11-392.pdf)
[(PDF)](https://jmlr.org/papers/volume17/11-392/11-392.pdf)
2 changes: 1 addition & 1 deletion book/src/appendix/stats-primer.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,5 +76,5 @@ The CRP metaphor works like this: you are on your lunch break and, as one often

where \\(z_i\\) is the table of customer i, \\(n_k\\) is the number of customers currently seated at table \\(k\\), and \\(N_{-i}\\) is the total number of seated customers, not including customer i (who is still deciding where to sit).

Under the CRP formalism, we make inferences about what datum belongs to which category. The weight vector is implicit. That's it. For information on how inference is done in DPMMs check out the [literature recommendations](#literature-recommendations).
Under the CRP formalism, we make inferences about what datum belongs to which category. The weight vector is implicit. That's it. For information on how inference is done in DPMMs check out the [literature recommendations](stats-primer.md).

Binary file added book/src/assets/animals-convergence.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added book/src/assets/animals-depprob.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion book/src/workflow/workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Open the model in lace
```python
import lace

engine = lace.Engine(metadata='metadata.lace')
engine = lace.Engine.load('metadata.lace')
```

```rust,noplayground
Expand Down
2 changes: 1 addition & 1 deletion book/theme/index.hbs
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@
<i class="fa fa-paint-brush"></i>
</button>
<ul id="theme-list" class="theme-popup" aria-label="Themes" role="menu">
<!-- <li role="none"><button role="menuitem" class="theme" id="light">{{ theme_option "Light" }}</button></li> -->
<li role="none"><button role="menuitem" class="theme" id="light">{{ theme_option "Light" }}</button></li>
<!-- <li role="none"><button role="menuitem" class="theme" id="rust">{{ theme_option "Rust" }}</button></li> -->
<!-- <li role="none"><button role="menuitem" class="theme" id="coal">{{ theme_option "Coal" }}</button></li> -->
<!-- <li role="none"><button role="menuitem" class="theme" id="navy">{{ theme_option "Navy" }}</button></li> -->
Expand Down