Skip to content

Commit 75120e6

Browse files
committed
Initial commit
0 parents  commit 75120e6

File tree

5 files changed

+329
-0
lines changed

5 files changed

+329
-0
lines changed

CODE_OF_CONDUCT.md

+80
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Code of Conduct
2+
3+
## Our Pledge
4+
5+
In the interest of fostering an open and welcoming environment, we as
6+
contributors and maintainers pledge to make participation in our project and
7+
our community a harassment-free experience for everyone, regardless of age, body
8+
size, disability, ethnicity, sex characteristics, gender identity and expression,
9+
level of experience, education, socio-economic status, nationality, personal
10+
appearance, race, religion, or sexual identity and orientation.
11+
12+
## Our Standards
13+
14+
Examples of behavior that contributes to creating a positive environment
15+
include:
16+
17+
* Using welcoming and inclusive language
18+
* Being respectful of differing viewpoints and experiences
19+
* Gracefully accepting constructive criticism
20+
* Focusing on what is best for the community
21+
* Showing empathy towards other community members
22+
23+
Examples of unacceptable behavior by participants include:
24+
25+
* The use of sexualized language or imagery and unwelcome sexual attention or
26+
advances
27+
* Trolling, insulting/derogatory comments, and personal or political attacks
28+
* Public or private harassment
29+
* Publishing others' private information, such as a physical or electronic
30+
address, without explicit permission
31+
* Other conduct which could reasonably be considered inappropriate in a
32+
professional setting
33+
34+
## Our Responsibilities
35+
36+
Project maintainers are responsible for clarifying the standards of acceptable
37+
behavior and are expected to take appropriate and fair corrective action in
38+
response to any instances of unacceptable behavior.
39+
40+
Project maintainers have the right and responsibility to remove, edit, or
41+
reject comments, commits, code, wiki edits, issues, and other contributions
42+
that are not aligned to this Code of Conduct, or to ban temporarily or
43+
permanently any contributor for other behaviors that they deem inappropriate,
44+
threatening, offensive, or harmful.
45+
46+
## Scope
47+
48+
This Code of Conduct applies within all project spaces, and it also applies when
49+
an individual is representing the project or its community in public spaces.
50+
Examples of representing a project or community include using an official
51+
project e-mail address, posting via an official social media account, or acting
52+
as an appointed representative at an online or offline event. Representation of
53+
a project may be further defined and clarified by project maintainers.
54+
55+
This Code of Conduct also applies outside the project spaces when there is a
56+
reasonable belief that an individual's behavior may have a negative impact on
57+
the project or its community.
58+
59+
## Enforcement
60+
61+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
62+
reported by contacting the project team at <[email protected]>. All
63+
complaints will be reviewed and investigated and will result in a response that
64+
is deemed necessary and appropriate to the circumstances. The project team is
65+
obligated to maintain confidentiality with regard to the reporter of an incident.
66+
Further details of specific enforcement policies may be posted separately.
67+
68+
Project maintainers who do not follow or enforce the Code of Conduct in good
69+
faith may face temporary or permanent repercussions as determined by other
70+
members of the project's leadership.
71+
72+
## Attribution
73+
74+
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
75+
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
76+
77+
[homepage]: https://www.contributor-covenant.org
78+
79+
For answers to common questions about this code of conduct, see
80+
https://www.contributor-covenant.org/faq

CONTRIBUTING.md

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Contributing to Conjugate Estimators
2+
3+
## Our Development Process
4+
This project is for demonstration purposes only. It's not planned to undergo active development.
5+
6+
## Pull Requests
7+
We actively welcome your pull requests.
8+
9+
1. Fork the repo and create your branch from `main`.
10+
2. If you've added code that should be tested, add tests.
11+
3. If you've changed APIs, update the documentation.
12+
4. Ensure the test suite passes.
13+
5. Make sure your code lints.
14+
6. If you haven't already, complete the Contributor License Agreement ("CLA").
15+
16+
## Contributor License Agreement ("CLA")
17+
In order to accept your pull request, we need you to submit a CLA. You only need
18+
to do this once to work on any of Meta's open source projects.
19+
20+
Complete your CLA here: <https://code.facebook.com/cla>
21+
22+
## Issues
23+
We use GitHub issues to track public bugs. Please ensure your description is
24+
clear and has sufficient instructions to be able to reproduce the issue.
25+
26+
Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
27+
disclosure of security bugs. In those cases, please go through the process
28+
outlined on that page and do not file a public issue.
29+
30+
## License
31+
By contributing to this repository, you agree that your contributions will be licensed
32+
under the LICENSE file in the root directory of this source tree.

LICENSE.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) Meta Platforms, Inc. and affiliates.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
2+
# A demonstration of the conjugate beta estimator statistical algorithm
3+
4+
This repo contains a reference implementation for a statistical algorithm called
5+
Conjugate Beta Estimator (CBE) for computing
6+
CIs for population means using (weighted) sample means and
7+
potentially noisy labels.
8+
9+
The basic formula for CBE is
10+
11+
alpha = mu*n + alpha_prior
12+
beta = (1-mu)*n + beta_prior
13+
ci = [ppf(0.05, alpha, beta), ppf(0.95, alpha, beta)
14+
15+
where mu is the mean of the (weighted) sample labels and n is the sample size in bits
16+
17+
Both mu and n can be adjusted to account for label noise.
18+
19+
mu should be adjusted using the Rogan Gladen (RG) estimator for the sample mean:
20+
21+
rg(mu, sensitivity, specificity) = (mu + specificity - 1) / (sensitivity + specificity - 1)
22+
23+
See: https://en.wikipedia.org/wiki/Beth_Gladen.
24+
25+
n should be adjusted using the following formula:
26+
27+
num_bits_per_label = (1 - entropy((sensitivity + specificity) / 2))
28+
n_modified = num_bits_per_label * n
29+
30+
The reason for the num_bits_per_label formula is that the rg formula is increasingly unstable when
31+
the mean of sensitivity and specificity approaches 0.5 (the max entropy value). The rg formula
32+
is maximally stable when sensitivity = specificity = 1, which is the case when labels are perfectly
33+
accurate. Therefore, we want the CI derived from the Beta distribution to grow wider as
34+
(sensitivity + specificity)/2 approeaches 0.5.
35+
36+
37+
See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.
38+
39+
## License
40+
Conjugate Estimators is MIT licensed, as found in the LICENSE file.

cbe.py

+156
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
3+
import numpy as np
4+
import pandas as pd
5+
import plotly.graph_objects as go
6+
from plotly import express as px
7+
from scipy.stats import beta, gamma
8+
9+
10+
##### Initialize constants
11+
12+
# fix the random seed for reproducibility
13+
np.random.seed(0)
14+
15+
# How many rows are in the population table
16+
population_size = 1000
17+
18+
# What % of population rows will get sampled
19+
sample_rate = 0.01
20+
21+
# The number of population rows that get sampled
22+
sample_size = int(population_size * sample_rate)
23+
sample_size = 100
24+
25+
# The beta parameters that are used for sampling ground truth labels
26+
# for each row.
27+
beta_prior_negative = 50
28+
beta_prior_positive = 10
29+
30+
31+
###### Create the population
32+
33+
# Sample the p(positive) score for each population row
34+
pop_positive_probs = np.random.beta(
35+
beta_prior_positive, beta_prior_negative, size=(population_size)
36+
)
37+
38+
# Sample a ground truth label for each population row
39+
pop_labels = np.random.rand((pop_positive_probs.size)) < pop_positive_probs
40+
pop_mean = pop_labels.mean()
41+
print("Population mean label:", pop_mean)
42+
43+
num_trials = 100
44+
45+
# Add some noise to the probabilities to simulate importance sampling with an imperfect classifier
46+
noise_mean = 0
47+
noise_std_dev = 0.1
48+
classifier_preds = (pop_positive_probs) + np.random.normal(
49+
noise_mean, noise_std_dev, pop_positive_probs.shape
50+
)
51+
classifier_preds = np.clip(classifier_preds, 0, 1) + 1e-9
52+
53+
54+
# take the sqrt because that's what we do today
55+
# TODO verify necessity
56+
sample_weights = np.sqrt(classifier_preds)
57+
58+
pop_sampling_probs = sample_weights / sample_weights.sum()
59+
60+
61+
sampled_indices = np.random.choice(
62+
np.arange(population_size),
63+
size=(num_trials, sample_size),
64+
replace=True,
65+
p=pop_sampling_probs,
66+
)
67+
68+
# the probability each sample is picked
69+
sample_probs = pop_sampling_probs[sampled_indices]
70+
sample_labels = pop_labels[sampled_indices]
71+
sample_rep_weights = (sample_size / sample_probs) / (sample_size / sample_probs).sum(
72+
-1, keepdims=True
73+
)
74+
75+
76+
# This function implements the conjugate beta estimator algorithm.
77+
# Its forumala is
78+
# alpha = mu*n + alpha_prior
79+
# beta = (1-mu)*n + beta_prior
80+
# ci = [ppf(0.05, alpha, beta), ppf(0.95, alpha, beta]
81+
# where mu is the mean of the (weighted) sample labels and n is the sample size in bits
82+
# (see below).
83+
def cbe(sample_mean, sample_size):
84+
alpha_param = sample_mean * sample_size + 1e-9 # optional: add an alpha prior
85+
beta_param = (1 - sample_mean) * sample_size + 1e-9 # optional: add a beta prior
86+
lower_bound = beta.ppf(0.025, alpha_param, beta_param)
87+
median = beta.ppf(0.5, alpha_param, beta_param)
88+
upper_bound = beta.ppf(0.975, alpha_param, beta_param)
89+
return lower_bound.mean(), median.mean(), upper_bound.mean()
90+
91+
92+
### Perfect labels section
93+
sample_mean = (sample_labels * sample_rep_weights).sum(-1)
94+
res = cbe(sample_mean, sample_size)
95+
print("CI with perfect labels:", res)
96+
97+
98+
### Noisy labels section
99+
100+
# Generate random noise
101+
rns = np.random.uniform(0, 1, size=sample_labels.shape)
102+
noisy_labels = sample_labels.copy().astype(int)
103+
104+
sensitivity = 0.85
105+
specificity = 0.93
106+
107+
# Add random noise to negative samples according to sensitivity
108+
noisy_labels[sample_labels == 0] += rns[sample_labels == 0] > sensitivity
109+
110+
# Add random noise to positive samples according to specificity
111+
noisy_labels[sample_labels == 1] = (
112+
noisy_labels[sample_labels == 1]
113+
+ (rns[sample_labels == 1] > specificity).astype(int)
114+
) % 2
115+
116+
117+
# Rogan Gladen estimator for the sample mean. See: https://en.wikipedia.org/wiki/Beth_Gladen
118+
# To derive it, we start with the following equation, where o is the observed mean, p is the true mean,
119+
# and tpr and fpr are the true positive and false positive rates.
120+
# o = tpr*p + fpr*(1-p)
121+
# o = tpr*p + fpr - fpr*p
122+
# p*(tpr - fpr) = o - fpr
123+
# p = (o - fpr)/(tpr - fpr)
124+
def rg(mean):
125+
return (mean + specificity - 1) / (sensitivity + specificity - 1)
126+
127+
128+
noisy_mean = (noisy_labels * sample_rep_weights).sum(-1)
129+
print("Noisy mean:", noisy_mean.mean())
130+
res = cbe(noisy_mean, sample_size)
131+
print("CI with noisy labels:", res)
132+
133+
rg_mean = rg((noisy_mean))
134+
print("RG sample mean:", rg_mean.mean())
135+
136+
res = cbe(rg_mean, sample_size)
137+
print("CI with RG mean and standard sample size", res)
138+
139+
140+
def entropy(val):
141+
return -val * np.log2(val) - (1 - val) * np.log2(1 - val)
142+
143+
144+
# When the mean of sensitivity and specificity approachs 0.5, the denominator
145+
# in the Rogan Gladen formula, (mean + specificity - 1)/(sensitivity + specificity - 1)
146+
# approaches 0. This causes the RG mean estimate to be highly unstable to small
147+
# fluctuations in sensitivity and/or specificity. To counter this instability, we want
148+
# to widen the CI. The entropy in this case is 1 (because entropy(0.5) = 1).
149+
# In the formula below, an entropy value of 1 leads to 0 bits per sample, which
150+
# amounts to having no samples when those samples are basically random noise.
151+
# However, as the mean of sensitivity and specificity approaches 1, the entropy
152+
# approaches 0, leading to about 1 bit per sample, which is the behavior
153+
# of the original CBE algorithm.
154+
num_bits = (1 - entropy((sensitivity + specificity) / 2)) * sample_size
155+
res = cbe(rg_mean, num_bits)
156+
print("CI with RG mean and entropy weighted sample size:", res)

0 commit comments

Comments
 (0)