Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide feedback on max_iter for extend_haplotypes #2988

Open
petrelharp opened this issue Sep 17, 2024 · 5 comments
Open

provide feedback on max_iter for extend_haplotypes #2988

petrelharp opened this issue Sep 17, 2024 · 5 comments

Comments

@petrelharp
Copy link
Contributor

After #2938 goes in, we need to do something so that the user knows whether the algorithm terminated because of running out of things to do or because it ran into max_iter. And, give guidance in the docs about what max_iter should be. I propose throwing a warning if max_iter is reached; we need to do some experiments to determine a good suggested value.

@jeromekelleher
Copy link
Member

Return a dataclass consisting of the tree sequence and algorithm convergence details? Relying on side channels for info like this is tricky, I'd suggest something like

@dataclasses.dataclass
class ExtendHaplotypesResult:
     tree_sequence: TreeSequence
     iterations: int

@petrelharp
Copy link
Contributor Author

petrelharp commented Oct 21, 2024

@hfr1tz3 has done some experiments here:

  1. doing 1e4 samples on 50Mb was taking like 10min per iteration
  2. nearly all the edges are added in the first iteration - like 99% on first iteration; 1% on second iteration; just a handful (like 3) in remaining iterations
  3. it always terminates within 5 iterations
  4. this is confirmed on many more reps of fewer samples

Proposal is to (a) set default to 10; and say in docstring that if speed is a concern, setting to 1 or 2 should get nearly everything; and it is possible though unlikely that it hasn't converged by 10 - re-running will verify.

So: no need for a dataclass.

@jeromekelleher
Copy link
Member

Can you imagine this changing in future, e.g, using a slightly different algorithm? What if we returned those numbers of edges per iteration in the dataclass? It worry that people won't get these nuances unless the data is readily available in the result

@petrelharp
Copy link
Contributor Author

I can imagine it, but I don't think it's terribly likely? If we did want to do that, we'd probably give the method a new name and deprecate the old one? I'm proposing this because the numbers seem very clear that users can ignore this detail 98% of the time, and the note about speeding it up by setting max_iter=1 will get the remaining 2%. So, I'd rather keep this as a simple "ts tranformation" method, like the others, that returns a modified ts.

@jeromekelleher
Copy link
Member

Ah ok, I forgot this had been released and documented already. Ignore what I said then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants