Initial commit

yariv · yariv · commit 75120e6cd03d · 2024-11-05T11:32:41.000-08:00
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,80 @@
+# Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+
+This Code of Conduct also applies outside the project spaces when there is a
+reasonable belief that an individual's behavior may have a negative impact on
+the project or its community.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <opensource-conduct@meta.com>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,32 @@
+# Contributing to Conjugate Estimators
+
+## Our Development Process
+This project is for demonstration purposes only. It's not planned to undergo active development.
+
+## Pull Requests
+We actively welcome your pull requests.
+
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Meta's open source projects.
+
+Complete your CLA here: <https://code.facebook.com/cla>
+
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+
+Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+
+## License
+By contributing to this repository, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.
diff --git a/LICENSE.md b/LICENSE.md
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) Meta Platforms, Inc. and affiliates.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,40 @@
+
+# A demonstration of the conjugate beta estimator statistical algorithm
+
+This repo contains a reference implementation for a statistical algorithm called
+Conjugate Beta Estimator (CBE) for computing
+CIs for population means using (weighted) sample means and
+potentially noisy labels.
+
+The basic formula for CBE is
+
+    alpha = mu*n + alpha_prior
+    beta = (1-mu)*n + beta_prior
+    ci = [ppf(0.05, alpha, beta), ppf(0.95, alpha, beta)
+
+where mu is the mean of the (weighted) sample labels and n is the sample size in bits
+
+Both mu and n can be adjusted to account for label noise.
+
+mu should be adjusted using the Rogan Gladen (RG) estimator for the sample mean:
+
+    rg(mu, sensitivity, specificity) = (mu + specificity - 1) / (sensitivity + specificity - 1)
+
+See: https://en.wikipedia.org/wiki/Beth_Gladen.
+
+n should be adjusted using the following formula:
+
+  num_bits_per_label = (1 - entropy((sensitivity + specificity) / 2))
+  n_modified = num_bits_per_label * n
+
+The reason for the num_bits_per_label formula is that the rg formula is increasingly unstable when
+the mean of sensitivity and specificity approaches 0.5 (the max entropy value). The rg formula
+is maximally stable when sensitivity = specificity = 1, which is the case when labels are perfectly
+accurate. Therefore, we want the CI derived from the Beta distribution to grow wider as
+(sensitivity + specificity)/2 approeaches 0.5.
+
+
+See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.
+
+## License
+Conjugate Estimators is MIT licensed, as found in the LICENSE file.
diff --git a/cbe.py b/cbe.py
@@ -0,0 +1,156 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+
+import numpy as np
+import pandas as pd
+import plotly.graph_objects as go
+from plotly import express as px
+from scipy.stats import beta, gamma
+
+
+##### Initialize constants
+
+# fix the random seed for reproducibility
+np.random.seed(0)
+
+# How many rows are in the population table
+population_size = 1000
+
+# What % of population rows will get sampled
+sample_rate = 0.01
+
+# The number of population rows that get sampled
+sample_size = int(population_size * sample_rate)
+sample_size = 100
+
+# The beta parameters that are used for sampling ground truth labels
+# for each row.
+beta_prior_negative = 50
+beta_prior_positive = 10
+
+
+###### Create the population
+
+# Sample the p(positive) score for each population row
+pop_positive_probs = np.random.beta(
+    beta_prior_positive, beta_prior_negative, size=(population_size)
+)
+
+# Sample a ground truth label for each population row
+pop_labels = np.random.rand((pop_positive_probs.size)) < pop_positive_probs
+pop_mean = pop_labels.mean()
+print("Population mean label:", pop_mean)
+
+num_trials = 100
+
+# Add some noise to the probabilities to simulate importance sampling with an imperfect classifier
+noise_mean = 0
+noise_std_dev = 0.1
+classifier_preds = (pop_positive_probs) + np.random.normal(
+    noise_mean, noise_std_dev, pop_positive_probs.shape
+)
+classifier_preds = np.clip(classifier_preds, 0, 1) + 1e-9
+
+
+# take the sqrt because that's what we do today
+# TODO verify necessity
+sample_weights = np.sqrt(classifier_preds)
+
+pop_sampling_probs = sample_weights / sample_weights.sum()
+
+
+sampled_indices = np.random.choice(
+    np.arange(population_size),
+    size=(num_trials, sample_size),
+    replace=True,
+    p=pop_sampling_probs,
+)
+
+# the probability each sample is picked
+sample_probs = pop_sampling_probs[sampled_indices]
+sample_labels = pop_labels[sampled_indices]
+sample_rep_weights = (sample_size / sample_probs) / (sample_size / sample_probs).sum(
+    -1, keepdims=True
+)
+
+
+# This function implements the conjugate beta estimator algorithm.
+# Its forumala is
+#   alpha = mu*n + alpha_prior
+#   beta = (1-mu)*n + beta_prior
+#   ci = [ppf(0.05, alpha, beta), ppf(0.95, alpha, beta]
+# where mu is the mean of the (weighted) sample labels and n is the sample size in bits
+# (see below).
+def cbe(sample_mean, sample_size):
+    alpha_param = sample_mean * sample_size + 1e-9  # optional: add an alpha prior
+    beta_param = (1 - sample_mean) * sample_size + 1e-9  # optional: add a beta prior
+    lower_bound = beta.ppf(0.025, alpha_param, beta_param)
+    median = beta.ppf(0.5, alpha_param, beta_param)
+    upper_bound = beta.ppf(0.975, alpha_param, beta_param)
+    return lower_bound.mean(), median.mean(), upper_bound.mean()
+
+
+### Perfect labels section
+sample_mean = (sample_labels * sample_rep_weights).sum(-1)
+res = cbe(sample_mean, sample_size)
+print("CI with perfect labels:", res)
+
+
+### Noisy labels section
+
+# Generate random noise
+rns = np.random.uniform(0, 1, size=sample_labels.shape)
+noisy_labels = sample_labels.copy().astype(int)
+
+sensitivity = 0.85
+specificity = 0.93
+
+# Add random noise to negative samples according to sensitivity
+noisy_labels[sample_labels == 0] += rns[sample_labels == 0] > sensitivity
+
+# Add random noise to positive samples according to specificity
+noisy_labels[sample_labels == 1] = (
+    noisy_labels[sample_labels == 1]
+    + (rns[sample_labels == 1] > specificity).astype(int)
+) % 2
+
+
+# Rogan Gladen estimator for the sample mean. See: https://en.wikipedia.org/wiki/Beth_Gladen
+# To derive it, we start with the following equation, where o is the observed mean, p is the true mean,
+# and tpr and fpr are the true positive and false positive rates.
+#   o = tpr*p + fpr*(1-p)
+#   o = tpr*p + fpr - fpr*p
+#   p*(tpr - fpr) = o - fpr
+#   p = (o - fpr)/(tpr - fpr)
+def rg(mean):
+    return (mean + specificity - 1) / (sensitivity + specificity - 1)
+
+
+noisy_mean = (noisy_labels * sample_rep_weights).sum(-1)
+print("Noisy mean:", noisy_mean.mean())
+res = cbe(noisy_mean, sample_size)
+print("CI with noisy labels:", res)
+
+rg_mean = rg((noisy_mean))
+print("RG sample mean:", rg_mean.mean())
+
+res = cbe(rg_mean, sample_size)
+print("CI with RG mean and standard sample size", res)
+
+
+def entropy(val):
+    return -val * np.log2(val) - (1 - val) * np.log2(1 - val)
+
+
+# When the mean of sensitivity and specificity approachs 0.5, the denominator
+# in the Rogan Gladen formula, (mean + specificity - 1)/(sensitivity + specificity - 1)
+# approaches 0. This causes the RG mean estimate to be highly unstable to small
+# fluctuations in sensitivity and/or specificity. To counter this instability, we want
+# to widen the CI. The entropy in this case is 1 (because entropy(0.5) = 1).
+# In the formula below, an entropy value of 1 leads to 0 bits per sample, which
+# amounts to having no samples when those samples are basically random noise.
+# However, as the mean of sensitivity and specificity approaches 1, the entropy
+# approaches 0, leading to about 1 bit per sample, which is the behavior
+# of the original CBE algorithm.
+num_bits = (1 - entropy((sensitivity + specificity) / 2)) * sample_size
+res = cbe(rg_mean, num_bits)
+print("CI with RG mean and entropy weighted sample size:", res)