Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove edge table sorting if possible #401

Open
bhaller opened this issue Oct 23, 2023 · 7 comments
Open

remove edge table sorting if possible #401

bhaller opened this issue Oct 23, 2023 · 7 comments
Labels
enhancement long-term trees related to tree-seq, tskit, etc.

Comments

@bhaller
Copy link
Contributor

bhaller commented Oct 23, 2023

@molpopgen has written here https://tskit-dev.slack.com/archives/C01JKJC5Y9G/p1698082329001319?thread_ts=1697956055.069829&cid=C01JKJC5Y9G that:

Edge sorting is totally unnecessary and is a big performance hit (even using the "bookmark"). fwdpyh11 has used this approach for several years now.

This would be good to merge into SLiM as well. I gather some work in tskit is needed to enable it (unless we hack SLiM's copy of tskit). I don't know how much work all of this is, as I don't really know what exactly Kevin is doing. But @petrelharp this would be a good thing for us to work on together at some point, I think. Perhaps soon, if you think it would be easy. Getting rid of the edge table sort in SLiM entirely would be very nice – much better than trying to parallelize that sort (which does not have very good performance).

@bhaller bhaller added enhancement trees related to tree-seq, tskit, etc. long-term labels Oct 23, 2023
@petrelharp
Copy link
Collaborator

Edges need to go into the edge table ordered by parent's birth time (and with edges contiguous by parent). So, to do this we need to do something like:

  • buffer each parents' edges together somewhere so then can be written to the table all at once after they die
  • only write these for a given parent p to the edge table at a point when there are no living individuals who were born earlier than p (since if there are still individuals older than p alive, these could still produce edges that would come earlier than p's edges)
  • do something special when reading in a tree sequence, since newly produced edges aren't guaranteed to occur later than edges already in the tables
  • somehow arrange that simplify doesn't remove any individuals (or nodes) whose edges haven't been written to the table yet (since a no-longer-alive parent whose edges aren't yet written won't have any recorded offspring in the table, and so would be simplified away)

This is do-able, although the details of the last point aren't clear to me. However, it would totally break if there's something like a "dummy individual" that doesn't ever reproduce or die - at least, it would prevent simplification from ever happening. (And, I think such individuals occur in some simulations?) For the same reason, the interaction between long-lived individuals and simplification interval aren't clear.

@petrelharp
Copy link
Collaborator

IIRC, Kevin implemented a buffering scheme, but did not have to deal with (a) reading in a tree sequence and having already-alive individuals, or (b) particularly long-lived individuals.

@petrelharp
Copy link
Collaborator

Kevin points out that the sorting of edges (by child ID and left endpoint) within each contiguous block of edges-from-the-same-parent isn't used by simplify and so might be removed; however, this is orthogonal to the points above (which have to do with sorting-by-parent).

@bhaller
Copy link
Contributor Author

bhaller commented Oct 25, 2023

Great, thanks. Not sure what I think about long-lived individuals; yes, they do occur in some designs (Daphnia resting eggs, etc.). Anyhow, this is not something that I want to attack right now, but perhaps the next time you and I are in the same spacetime vicinity, we might try to do it? Seemed good to have a placeholder to keep it in our thoughts. Marked "long-term" for now.

@molpopgen
Copy link
Contributor

molpopgen commented Oct 25, 2023

IIRC, Kevin implemented a buffering scheme, but did not have to deal with (a) reading in a tree sequence and having already-alive individuals, or (b) particularly long-lived individuals.

This isn't quite right -- overlapping generations is easy to handle, as I've mentioned. There are simple O(N) steps required. The links in the various tskit issues contain the details.

Edit: reading in a tree sequence also doesn't change things -- the setup steps related to overlapping generations can be handled right after reading in, etc..

@petrelharp
Copy link
Collaborator

Ah, yes - as I think you've noticed, I wasn't saying you didn't have to deal with "having already-alive individuals", I was saying you didn't have to deal with "reading in a tree sequence that referred to already-alive individuals". I guess the "setup steps" you refer to would be taking some of the edges out of the tree sequence and putting them in the buffer; that would require bookeeping but not be too bad.

For future reference, the "various tskit issues" are referenced here tskit-dev/tskit#2751 (right?)

@molpopgen
Copy link
Contributor

yes, that's the relevant tskit issue. There's another that I need to open re: the wrong edge criteria are being used for validation (full sort output rather than what simplification actually requires).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement long-term trees related to tree-seq, tskit, etc.
Projects
None yet
Development

No branches or pull requests

3 participants