Add KS tests for weighted sampling #1530

dhardy · 2024-11-18T19:55:47Z

Added a CHANGELOG.md entry

Motivation

Some of these are non-trivial distributions we didn't really test before.

To validate solution of #1476.

Details

Single-element weighted sampling is simple enough.

fn choose_two_iterator is also simple enough: there are no weights, so we can just assign each pair of results a unique index in the list of 100 * 99 / 2 possibilities (nothing that we sort pairs since the order of chosen elements is not specified).

fn choose_two_weighted_indexed gets a bit more complicated; I choose to approach it by building a table for the CDF of size num*num including impossible variants. Most of the tests don't pass, so there must be a mistake here.

Aside: using let key = rng.random::<f64>().ln() / weight; (src/seq/index.rs:392) may help with #1476 but does not fix the above.

Some failures

dhardy · 2024-11-19T09:31:07Z

I can confirm that choose_multiple_weighted has a significant problem, since sampling two elements from 0, 1, 2 with weights 1, 1/2, 1/3 a million times and sorting yields 532298 counts of (0, 1), 338524 counts of (0, 2) and 129178 counts of (1, 2). (Unlike #1476, this example does not require very small weights.)

This is sampling without replacement, so expected samples are:

(0,1) or (1, 0): 531818
(0, 2) or (2, 0): 339393
(1, 2) or (2, 1): 128788

…ust-random#1476 Also improves choose_two_weighted_indexed time by 23% (excluding new test)

Approx 2% improvement to tests sampling 2 of 100 elements

This results in approx 18% faster tests choosing 2-in-100 items

dhardy · 2024-11-19T11:04:01Z

I fixed my calculation of the CDF, found a variant which failed like #1476, fixed this by taking the logarithm of keys, and applied some optimisation to the Efraimidis-Spirakis algorithm.

benjamin-lieser

Looks correct. We might want to test how the performance is for big amount

distr_test/tests/weighted.rs

benjamin-lieser · 2024-11-25T11:06:51Z

distr_test/tests/weighted.rs

+}
+
+#[test]
+fn choose_two_weighted_indexed() {


This is probably more complex than needed, but looks correct.
It's probably worth implementing chi squared at some point, but this should also be quite sensitive.

This is probably more complex than needed, but looks correct.

You mean the use of an Adapter? Yes, but I'd sooner do this than revise the KS test API (which is well adapted for other usages).

It's probably worth implementing chi squared at some point, but this should also be quite sensitive.

A fair point.

I mean using KS for these distributions (chi squared would be more straight forward), Adapter I think is fine.

src/seq/index.rs

dhardy · 2024-11-26T10:13:08Z

@benjamin-lieser I addressed most of your feedback, using BinaryHeap as suggested. Best-of-three runs before this change (a039a7f):

running 3 tests
test choose_weighted_indexed ... ok
test choose_one_weighted_indexed ... ok
test choose_two_weighted_indexed ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 6 filtered out; finished in 3.69s

Best-of-three after (831353f):

running 3 tests
test choose_weighted_indexed ... ok
test choose_one_weighted_indexed ... ok
test choose_two_weighted_indexed ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 6 filtered out; finished in 3.98s

Yes, that's a regression, though given that we're using amount ∊ {1, 2} these tests aren't very representative benchmarks. I agree that in theory this approach should scale better to large amounts.

dhardy · 2024-11-26T10:26:46Z

I added a benchmark (using weights of 1.0 everywhere, which is probably the worst case for the jumps in the A-ExpJ algorithm.

Edit: I removed reports of outliers (4-19% so it's clear my machine is not ideally configured for benchmarks, but still good enough for our conclusions).

Before this PR (0ff946c; 554d331 is similar):

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [2.8688 µs 2.8842 µs 2.9014 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [5.0881 µs 5.1248 µs 5.1672 µs]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [611.72 ns 614.87 ns 618.17 ns]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [695.34 ns 699.24 ns 703.74 ns]

At d645952 (the first change to fn sample_efraimidis_spirakis, which is just taking the logarithm of the key to fix #1476):

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [6.7046 µs 6.7319 µs 6.7612 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [8.9848 µs 9.0293 µs 9.0784 µs]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [1.0019 µs 1.0062 µs 1.0108 µs]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [1.0887 µs 1.0953 µs 1.1020 µs]

At b806b29 ("keep at most amount candidates", using sort_unstable at each addition beyond the first amount):

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [5.8209 µs 5.8382 µs 5.8574 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [4.7545 ms 4.7821 ms 4.8133 ms]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [2.0914 µs 2.0966 µs 2.1028 µs]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [33.773 µs 33.887 µs 34.015 µs]

At a039a7f (0f662b1 is similar):

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [1.0601 µs 1.0626 µs 1.0652 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [450.93 µs 451.58 µs 452.26 µs]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [1.7520 µs 1.7559 µs 1.7602 µs]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [7.6607 µs 7.6817 µs 7.7069 µs]

At 831353f:

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [1.0498 µs 1.0509 µs 1.0523 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [29.491 µs 29.545 µs 29.618 µs]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [1.7660 µs 1.7695 µs 1.7738 µs]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [2.9548 µs 2.9593 µs 2.9649 µs]

That's not exactly what I expected (my previous changes were guided by the time to run the KS test).

According to the benchmarks we would be better off with the code of d645952, however one thing not represented here is memory usage: that old version of fn sample_efraimidis_spirakis stores length number of elements while more recent commits store only amount elements.

I think therefore we should keep using this latest version.

benjamin-lieser

I agree that we should use this version. The other one might be faster in our benchmarks where length and amount are not too large, but it is not true to the idea of the algorithm and users might expect it to have the correct memory footprint.

…t-random#1530) - Extra testing for weighted sampling - Fix IndexedRandom::choose_multiple_weighted with very small keys - Use A-ExpJ algorithm with BinaryHeap for better performance with large length / amount

dhardy added 10 commits November 18, 2024 17:06

Add KS test for WeightedIndex

2e28810

Add KS test for WeightedAliasIndex

eb1836b

Add KS test for WeightedTreeIndex

a8ce256

Add KS test for IndexedRandom::choose_weighted

9e03a15

Add KS test for IndexedRandom::choose_multiple_weighted (one element)

4b0a296

Add KS test for IndexedRandom::choose_multiple_weighted (two elements)

2584212

Some failures

Add KS test for IteratorRandom::choose

336ddbe

Add KS test for IteratorRandom::choose_stable

2ef212b

Add KS test for IteratorRandom::choose_multiple_fill (two elements)

a0908bc

Fix cdf for choose_two_weighted_indexed

865aba2

dhardy requested a review from benjamin-lieser November 18, 2024 19:55

dhardy added 5 commits November 19, 2024 10:09

More complex test for choose_multiple_weighted

16a16c6

Fix calculated CDF in choose_two_weighted_indexed

554d331

Test and fix choose_multiple_weighted with very small probabilities: r…

d645952

…ust-random#1476 Also improves choose_two_weighted_indexed time by 23% (excluding new test)

sample_efraimidis_spirakis: keep at most amount candidates

b806b29

Approx 2% improvement to tests sampling 2 of 100 elements

sample_efraimidis_spirakis: use algorithm A-ExpJ

0f662b1

This results in approx 18% faster tests choosing 2-in-100 items

dhardy marked this pull request as ready for review November 19, 2024 11:02

dhardy added 2 commits November 19, 2024 11:04

Rustfmt

51c6fdd

Clippy

a1f61ae

dhardy mentioned this pull request Nov 23, 2024

choose_multiple_weighted returns unexpect probability of result #1476

Closed

Temp: extra test

1c147a4

dhardy mentioned this pull request Nov 23, 2024

Prepare 0.9.0-beta.0 #1535

Merged

1 task

benjamin-lieser reviewed Nov 25, 2024

View reviewed changes

dhardy added 3 commits November 25, 2024 17:05

Review feedback

ddb8f5f

Remove extra test

a039a7f

fn sample_efraimidis_spirakis: use BinaryHeap

831353f

benjamin-lieser approved these changes Nov 26, 2024

View reviewed changes

dhardy merged commit bf9d429 into rust-random:master Nov 26, 2024
16 checks passed

dhardy added a commit to dhardy/rand that referenced this pull request Nov 26, 2024

Add extra benchmark missing from rust-random#1530

7813eb0

dhardy added a commit to dhardy/rand that referenced this pull request Nov 26, 2024

CHANGELOG for rust-random#1530

0993ba6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KS tests for weighted sampling #1530

Add KS tests for weighted sampling #1530

dhardy commented Nov 18, 2024

dhardy commented Nov 19, 2024 •

edited

Loading

dhardy commented Nov 19, 2024

benjamin-lieser left a comment

benjamin-lieser Nov 25, 2024

dhardy Nov 25, 2024

benjamin-lieser Nov 25, 2024

dhardy commented Nov 26, 2024

dhardy commented Nov 26, 2024 •

edited

Loading

benjamin-lieser left a comment •

edited

Loading

Add KS tests for weighted sampling #1530

Add KS tests for weighted sampling #1530

Conversation

dhardy commented Nov 18, 2024

Motivation

Details

dhardy commented Nov 19, 2024 • edited Loading

dhardy commented Nov 19, 2024

benjamin-lieser left a comment

Choose a reason for hiding this comment

benjamin-lieser Nov 25, 2024

Choose a reason for hiding this comment

dhardy Nov 25, 2024

Choose a reason for hiding this comment

benjamin-lieser Nov 25, 2024

Choose a reason for hiding this comment

dhardy commented Nov 26, 2024

dhardy commented Nov 26, 2024 • edited Loading

benjamin-lieser left a comment • edited Loading

Choose a reason for hiding this comment

dhardy commented Nov 19, 2024 •

edited

Loading

dhardy commented Nov 26, 2024 •

edited

Loading

benjamin-lieser left a comment •

edited

Loading