provide feedback on `max_iter` for `extend_haplotypes` #2988

petrelharp · 2024-09-17T23:36:29Z

After #2938 goes in, we need to do something so that the user knows whether the algorithm terminated because of running out of things to do or because it ran into max_iter. And, give guidance in the docs about what max_iter should be. I propose throwing a warning if max_iter is reached; we need to do some experiments to determine a good suggested value.

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2024-09-18T09:49:22Z

Return a dataclass consisting of the tree sequence and algorithm convergence details? Relying on side channels for info like this is tricky, I'd suggest something like

@dataclasses.dataclass
class ExtendHaplotypesResult:
     tree_sequence: TreeSequence
     iterations: int

petrelharp · 2024-10-21T20:02:27Z

@hfr1tz3 has done some experiments here:

doing 1e4 samples on 50Mb was taking like 10min per iteration
nearly all the edges are added in the first iteration - like 99% on first iteration; 1% on second iteration; just a handful (like 3) in remaining iterations
it always terminates within 5 iterations
this is confirmed on many more reps of fewer samples

Proposal is to (a) set default to 10; and say in docstring that if speed is a concern, setting to 1 or 2 should get nearly everything; and it is possible though unlikely that it hasn't converged by 10 - re-running will verify.

So: no need for a dataclass.

jeromekelleher · 2024-10-22T08:33:05Z

Can you imagine this changing in future, e.g, using a slightly different algorithm? What if we returned those numbers of edges per iteration in the dataclass? It worry that people won't get these nuances unless the data is readily available in the result

petrelharp · 2024-10-22T16:31:08Z

I can imagine it, but I don't think it's terribly likely? If we did want to do that, we'd probably give the method a new name and deprecate the old one? I'm proposing this because the numbers seem very clear that users can ignore this detail 98% of the time, and the note about speeding it up by setting max_iter=1 will get the remaining 2%. So, I'd rather keep this as a simple "ts tranformation" method, like the others, that returns a modified ts.

jeromekelleher · 2024-10-22T19:05:01Z

Ah ok, I forgot this had been released and documented already. Ignore what I said then

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide feedback on `max_iter` for `extend_haplotypes` #2988

provide feedback on `max_iter` for `extend_haplotypes` #2988

petrelharp commented Sep 17, 2024

jeromekelleher commented Sep 18, 2024

petrelharp commented Oct 21, 2024 •

edited

Loading

jeromekelleher commented Oct 22, 2024

petrelharp commented Oct 22, 2024

jeromekelleher commented Oct 22, 2024

provide feedback on max_iter for extend_haplotypes #2988

provide feedback on max_iter for extend_haplotypes #2988

Comments

petrelharp commented Sep 17, 2024

jeromekelleher commented Sep 18, 2024

petrelharp commented Oct 21, 2024 • edited Loading

jeromekelleher commented Oct 22, 2024

petrelharp commented Oct 22, 2024

jeromekelleher commented Oct 22, 2024

provide feedback on `max_iter` for `extend_haplotypes` #2988

provide feedback on `max_iter` for `extend_haplotypes` #2988

petrelharp commented Oct 21, 2024 •

edited

Loading