Adding new population in the unified genealogy #2940

nbesteban · 2024-04-30T09:03:03Z

nbesteban
Apr 30, 2024

Hi everyone,

With the dated tree sequences used in the unified genealogy available on the Zenodo website, I'm considering adding a dated tree sequence from a new population to see where they will be situated and facilitate further analyses. Is it possible to merge them with the existing dated tree sequence of the unified genealogy to facilitate this work? If yes, is there already a script that can facilitate this?
Thank you so much!

jeromekelleher · 2024-05-01T13:05:41Z

jeromekelleher
May 1, 2024
Maintainer

Hi @nbesteban, welcome 👋

That's not a trivial operation, I think, and there would be a bunch of open research-level questions on how do merge two inferred ARGs. In practise, I think you'd be better off matching your new samples against the existing trees to give you a first approximation of what the joint thing would be.

@hyanwong, do we have any documented approaches for this sort of thing?

5 replies

hyanwong May 1, 2024
Collaborator

Hi @nbesteban. I agree with Jerome that it's a complex (but interesting!) research question about how to merge two existing tree sequences together.

You could match your existing samples into the ancestral tree sequence. Ideally you would do this by using the unsimplified ancestors tree sequence that we originally constructed, but we didn't include those in the zenodo download (perhaps we should have done). For the time being, then, you could take the existing zenodo tree sequence, and match your samples into that, excluding the sample nodes from matching.

There's no documented way to do this, but I could cook something up using existing routines. It's essentially like a threading operation. It would mean that you wouldn't be using the new data to allow more resolution of the topology, though. For example, imagine the extreme case where the "unified genealogy" only had 2 haploid genomes. Then every sample you added would simply attach to the root node, and you would have one huge polytomy all the way along the genome (this demonstrates that you wouldn't be using information within your new samples for topological resolution).

We need to substantially revise the tsinfer docs, so the pros and cons of this approach could be a good "advanced topic". I wouldn't be able to comment on the reliability of the resulting tree sequence, though.

nbesteban May 2, 2024
Author

Thanks for the clarifications and immediate responses.

For our purpose, we would love to see the full resolution of the topology since our samples are from underrepresented populations.
I guess we'll have to proceed with the provided Makefile and integrate lines in the script to accommodate our new dataset.

On the other hand, I'm still looking forward to the suggested approach if you are able to cook it at your convenience. :)
Once we have generated our tree sequences and other users as well, we can add other samples from the same population to perform analyses/validation with certain mutation sites and/or nodes of interest. But yeah, the pros and cons should still be clearly stated.

I'll keep in touch with this. Again, this is much appreciated!
Great work especially for us having limited computational resources! 👍

hyanwong May 7, 2024
Collaborator

Out of interest, where are you based @nbesteban, and what (roughly) are the underrepresented groups for which you have data?

nbesteban May 8, 2024
Author

Hi @hyanwong. I'm based in the Philippines and we are currently working on WGS data from 42 Filipino groups of diverse ancestry. This is part of the ongoing Filipino Genomes Research Program that hopes to include and represent Filipino Genomes in global genomic research and applications.
Given the portability, losslessness, and efficiency despite limited computational resources, we plan to use tree-sequence-based methods to facilitate genome-wide analyses.
Our group is very interested and excited on working with the TsKit ecosystem. :)

hyanwong May 9, 2024
Collaborator

That's really interesting @nbesteban , and sounds like a great project. I've posted some demo code for Jerome's "simple" approach as a separate answer here, although to actually do this with unified genealogy you'd need the ancestors_ts which we haven't provided for download (but I guess we should do, there's no reason why not).

I also have some ideas for how to do this "properly", without ignoring sites unique to your dataset. I'll post some thoughts about that later.

hyanwong · 2024-05-09T21:52:49Z

hyanwong
May 9, 2024
Collaborator

Hi @nbesteban - here's a demo of adding extra samples to an existing inferred tree sequence (here I simulated one for testing, and used the tsinfer.SampleData.from_tree_sequence() method to feed the simulated genotypes into tsinfer, but there's no reason in principle why you can't use a VCF instead). I haven't tested it extensively, so there may be bugs in the code. It also assumes that you have the ancestors_tree_sequence lying around. We should provide this for any updated inferences we do (and eventually we'll do so for the unified genealogy):

Note that to do it "properly" we should also solve tskit-dev/tsinfer#916

import numpy as np
import msprime
import tsinfer
import json
import tskit

def modify_sample_data(main_sd, extra_sd):
    # modify extra_sd to remove sites not in main_sd and add missing sites only in main_sd
    modified_sd = tsinfer.SampleData(sequence_length=extra_sd.sequence_length)
    modified_sd.individuals_metadata_schema = extra_sd.individuals_metadata_schema
    modified_sd.populations_metadata_schema = extra_sd.populations_metadata_schema
    
    for p in extra_sd.populations():
        modified_sd.add_population(metadata=p.metadata)
    for i in extra_sd.individuals():
        modified_sd.add_individual(
            ploidy=len(i.samples),
            metadata=i.metadata,
            population=i.population,
            location=i.location,
            time=i.time,
            flags=i.flags)
    
    print("Omitting the following site positions, present in the extra data but absent from the main data",
          np.where(np.isin(extra_sd.sites_position[:], main_sd.sites_position[:]) == False)[0]
    )
    for s in main_sd.sites():
        site_index = np.where(s.position == extra_sd.sites_position[:])[0]
        assert len(site_index) <= 1
        if len(site_index) == 1:
            alleles = extra_sd.sites_alleles[site_index][0]
            # Todo - we assume that the encoding of alleles e.g. A->0, T->1 is the same in the 
            # main_sd and the extra_sd, but this may not be the case for VCF-derived files.
            # In this case we need to re-code the extra_sd genotypes so the alleles match.
            # For the moment just fail if this is not the case
            assert s.alleles == tuple(alleles)  # Careful here, what if the extra data is all derived states?
            site_index = site_index[0]
            modified_sd.add_site(
                position=s.position,
                genotypes=extra_sd.sites_genotypes[site_index],
                alleles=alleles,
                metadata=extra_sd.sites_metadata[site_index],
            )
        else:
            modified_sd.add_site(
                position=s.position,
                genotypes=np.full_like(extra_sd.sites_genotypes[0], s.ancestral_allele),
                alleles=s.alleles,
            )
    modified_sd.finalise()
    assert np.all(modified_sd.sites_position[:] == main_sd.sites_position[:])
    return modified_sd

def add_edges_from_previous_ts(tables, prev_inferrred_ts):
    # Get nodes and edges from prev_inferrred_ts that were added in the
    # match_samples phase (i.e. sample nodes and PC nodes above them)
    # and add them into the provided tables instance.
    orig_samples = np.where(tables.nodes.flags & tskit.NODE_IS_SAMPLE)[0]
    node_map = {}
    individual_map = {tskit.NULL: tskit.NULL}
    sample_match_nodes = set()
    nodes_above_samples = set(
        prev_inferrred_ts.edges_parent[
            np.isin(prev_inferrred_ts.edges_child, prev_inferrred_ts.samples())
        ]
    )
    # TODO - also add populations from the prev_inferrred_ts
    for nd in prev_inferrred_ts.nodes():
        if nd.is_sample():
            if nd.individual != tskit.NULL:
                if nd.individual not in individual_map:
                    individual_map[nd.individual] = tables.individuals.append(
                        prev_inferrred_ts.individual(nd.individual)
                    )
            node_map[nd.id] = tables.nodes.append(nd.replace(individual=individual_map[nd.individual]))
            sample_match_nodes.add(nd.id)
            continue
        if nd.metadata:
            md = json.loads(nd.metadata.decode())
            if "ancestor_data_id" in md:
                node_map[nd.id] = md["ancestor_data_id"]
                continue
        # If we are here, this is a node that is neither a sample nor a normal ancestor
        assert nd.flags & tsinfer.NODE_IS_PC_ANCESTOR
        # Hack to check if this PC node is already in the ancestors_ts
        # see https://github.com/tskit-dev/tsinfer/issues/916 for a proper solution
        if nd.id in nodes_above_samples:
            node_map[nd.id] = tables.nodes.append(nd)
            sample_match_nodes.add(nd.id)
    
    for e in prev_inferrred_ts.edges():
        if e.parent in sample_match_nodes or e.child in sample_match_nodes:
            tables.edges.add_row(
                left=e.left,
                right=e.right,
                child=node_map[e.child], 
                parent=node_map[e.parent])

    tables.sort()
    # Simplify but keep unary nodes etc and put the original samples at the end.
    # Note that the original individuals will still be at the start of the indiv table
    samples = np.concatenate(([node_map[s] for s in prev_inferrred_ts.samples()], orig_samples))
    assert np.all(tables.nodes.flags[samples] & tskit.NODE_IS_SAMPLE != 0)
    tables.simplify(samples, keep_unary=True, filter_sites=False, filter_individuals=False, filter_populations=False)
    new_ts = tables.tree_sequence()

    # Check we have attached all the samples effectively
    # (the hack above means we could perghaps have missed some intermediate PC nodes)
    for tree in new_ts.trees():
        if tree.num_edges > 0:
            if tree.num_roots != 1:
                raise RuntimeError("Missed some PC nodes")
    return new_ts
    
# SIMULATE SOME DATA, AND SPLIT INTO 2 SETS
main_data_size = 40  # num individuals in the main ts
extra_data_size = 5  # num individuals we are going to add
tot_n = main_data_size + extra_data_size
tmp_ts = msprime.sim_mutations(msprime.sim_ancestry(tot_n, population_size=1e4, sequence_length=1e5, recombination_rate=1e-8), rate=1e-8)
main_sample_data = tsinfer.SampleData.from_tree_sequence(
    tmp_ts.simplify(tmp_ts.samples()[0:(main_data_size*2)])
)
print(main_sample_data.num_sites, "sites in the main datafile")
extra_sample_data = tsinfer.SampleData.from_tree_sequence(
    tmp_ts.simplify(tmp_ts.samples()[(main_data_size*2):])
)
print(extra_sample_data.num_sites, "sites in the extra datafile")
assert extra_sample_data.num_individuals == extra_data_size

# CREATE ANCESTORS AND MAIN TREE SEQS
main_ancestors_ts = tsinfer.match_ancestors(main_sample_data, tsinfer.generate_ancestors(main_sample_data))
main_ts = tsinfer.match_samples(main_sample_data, main_ancestors_ts)

extra_sample_data = modify_sample_data(main_sample_data, extra_sample_data)
ts_partial = tsinfer.match_samples(extra_sample_data, main_ancestors_ts, post_process=False)
# Trim down to the first and last site
ts_partial = ts_partial.keep_intervals([[ts_partial.site(0).position, ts_partial.site(-1).position+1]], simplify=False)

full_ts = add_edges_from_previous_ts(
    tables=ts_partial.dump_tables(),
    prev_inferrred_ts=main_ts,
)
print(
    "Done: added",
    full_ts.num_samples - main_ts.num_samples,
    "new samples to the `main_ts` tree seq, using the ancestors_ts"
)

4 replies

nbesteban May 13, 2024
Author

Hi @hyanwong . This is all noted.
We'll use this script once we have finished the quality checks of the sequences and generated our sample tree sequences.
We'll update you as soon as possible.
In the meantime, we'll keep on the lookout for the ancestor tree sequence and the resolution of the accompanying issue.

This is much appreciated.

hyanwong May 15, 2024
Collaborator

Great, good luck @nbesteban. Note that for the pipeline above you don't use your own generated tree sequence at all: rather, you just re-run sample matching of your own samples against the to-be-provided ancestors-tree-sequence from the unified genealogy.

nbesteban May 16, 2024
Author

Oh, I see. If I'm not mistaken, then we only need to phase and infer the ancestral alleles of our samples chr20 prior to the implementation.
Thanks for the clarity, Dr. @hyanwong . We'll keep in touch with the ancestors-tree-sequence and any bugs/errors that we might encounter along the way (if there are).

nbesteban Jul 30, 2024
Author

Hi @hyanwong ,

Sorry to bother you and for the long interval in our communications.

I have finished preparing our input data and will explore the provided script. With this, may I ask if the ancestors-tree-sequence file from the unified genealogy paper is already available or uploaded to a repository?

On the other hand, I am also happy to inform you that I have been invited to attend the TsKit workshop in Drøbak, Norway this coming August. Looking forward to the workshop and a fruitful discussion. :)

hyanwong · 2024-05-10T09:03:58Z

hyanwong
May 10, 2024
Collaborator

Just "thinking out loud" here, after chatting to Duncan and Jerome.

Programs like Singer perform a single-sample threading operation like the code I posted as another answer, but remove polytomies by creating new attachment positions (i.e. new ancestors) during threading. This would be one avenue (but depends on the order of threaded samples). As with the code I posted, it would probably work fine if the number of new sample genomes was small.

Another option would be to infer a tree sequence separately for the new samples, and try to merge the two together. Although this was what @nbesteban was originally asking about, and by going site-by-site it would be possible to equate ancestors (i.e. nodes) between the tree sequences, I suspect if would be tricky to ensure there were no loops in the resulting ARG. Additionally, I think this is not really making best use of the data.

In the answer/discussion below, I'll focus on a more tsinfer-like approach, suitable for a larger number of new samples, and which treats the entire batch of new samples at once (advantage = no order dependency). In particular, if you have a new set of sequences to add to an existing inferred TS, I can imagine each variable site in the new data could fall into one of 4 categories

The "new variation" at the site would generate a completely new ancestor, due to
i. this being a completely new site, not present in the existing tree sequence or
ii. variation changing this into an inference site (e.g. what was a singleton is now a doubleton)
The new variation changes a previous inference site into a non-inference site (e.g. it's turned into a trialleic site)
The site was a non-inference site and stays a non-inference site
The site was an inference site and stays an inference site. In this case
i. Adding the site might not change the building or the order of generated ancestors
ii. Adding the site might change the building and/or order of generated ancestors.

If the site is of type 3, nothing more needs to be done. If it is of type 4, I don't think we can easily change the ancestors without re-running the entire inference again (but we might hope that getting the time order right using tsdate would make us robust to changes in frequency order). It would be a research question as to how much this affects inference: it's closely related a how well we do at inference when we subsample the data, something that @duncanMR is investigating.

I haven't thought much what we do for sites of type 2, without re-running the entire inference. I guess for trialleleic sites we could simply mark the 3rd allele as missing.

Sites of type 1 require algorithmic consideration. It would be possible to generate new ancestors for these sites, and probably (relatively) easy to figure out their relative time order compared to the other ancestors. It would be a lot of work if we had to re-do the matching of existing ancestors to incorporate these, but I think we could reasonably assume that existing nodes in the tree sequence wouldn't copy from them anyway (since sites of type 1a represent "new" variation, and for singletons->doubleton sites, we can simply re-match the original singleton sample). So "all" we would need to do is to gradually match the new ancestors, forwards in time, against a tree sequence comprised of a combination of the previous TS (truncated at the ancestors time) with any more ancient newly-generated ancestors. I think this is algorithmically possible, but would require substantial reworking of the match_ancestors algorithm.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new population in the unified genealogy #2940

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Adding new population in the unified genealogy #2940

nbesteban Apr 30, 2024

Replies: 3 comments · 9 replies

jeromekelleher May 1, 2024 Maintainer

hyanwong May 1, 2024 Collaborator

nbesteban May 2, 2024 Author

hyanwong May 7, 2024 Collaborator

nbesteban May 8, 2024 Author

hyanwong May 9, 2024 Collaborator

hyanwong May 9, 2024 Collaborator

nbesteban May 13, 2024 Author

hyanwong May 15, 2024 Collaborator

nbesteban May 16, 2024 Author

nbesteban Jul 30, 2024 Author

hyanwong May 10, 2024 Collaborator

nbesteban
Apr 30, 2024

Replies: 3 comments 9 replies

jeromekelleher
May 1, 2024
Maintainer

hyanwong May 1, 2024
Collaborator

nbesteban May 2, 2024
Author

hyanwong May 7, 2024
Collaborator

nbesteban May 8, 2024
Author

hyanwong May 9, 2024
Collaborator

hyanwong
May 9, 2024
Collaborator

nbesteban May 13, 2024
Author

hyanwong May 15, 2024
Collaborator

nbesteban May 16, 2024
Author

nbesteban Jul 30, 2024
Author

hyanwong
May 10, 2024
Collaborator