Replies: 3 comments 9 replies
-
Hi @nbesteban, welcome 👋 That's not a trivial operation, I think, and there would be a bunch of open research-level questions on how do merge two inferred ARGs. In practise, I think you'd be better off matching your new samples against the existing trees to give you a first approximation of what the joint thing would be. @hyanwong, do we have any documented approaches for this sort of thing? |
Beta Was this translation helpful? Give feedback.
-
Hi @nbesteban - here's a demo of adding extra samples to an existing inferred tree sequence (here I simulated one for testing, and used the Note that to do it "properly" we should also solve tskit-dev/tsinfer#916 import numpy as np
import msprime
import tsinfer
import json
import tskit
def modify_sample_data(main_sd, extra_sd):
# modify extra_sd to remove sites not in main_sd and add missing sites only in main_sd
modified_sd = tsinfer.SampleData(sequence_length=extra_sd.sequence_length)
modified_sd.individuals_metadata_schema = extra_sd.individuals_metadata_schema
modified_sd.populations_metadata_schema = extra_sd.populations_metadata_schema
for p in extra_sd.populations():
modified_sd.add_population(metadata=p.metadata)
for i in extra_sd.individuals():
modified_sd.add_individual(
ploidy=len(i.samples),
metadata=i.metadata,
population=i.population,
location=i.location,
time=i.time,
flags=i.flags)
print("Omitting the following site positions, present in the extra data but absent from the main data",
np.where(np.isin(extra_sd.sites_position[:], main_sd.sites_position[:]) == False)[0]
)
for s in main_sd.sites():
site_index = np.where(s.position == extra_sd.sites_position[:])[0]
assert len(site_index) <= 1
if len(site_index) == 1:
alleles = extra_sd.sites_alleles[site_index][0]
# Todo - we assume that the encoding of alleles e.g. A->0, T->1 is the same in the
# main_sd and the extra_sd, but this may not be the case for VCF-derived files.
# In this case we need to re-code the extra_sd genotypes so the alleles match.
# For the moment just fail if this is not the case
assert s.alleles == tuple(alleles) # Careful here, what if the extra data is all derived states?
site_index = site_index[0]
modified_sd.add_site(
position=s.position,
genotypes=extra_sd.sites_genotypes[site_index],
alleles=alleles,
metadata=extra_sd.sites_metadata[site_index],
)
else:
modified_sd.add_site(
position=s.position,
genotypes=np.full_like(extra_sd.sites_genotypes[0], s.ancestral_allele),
alleles=s.alleles,
)
modified_sd.finalise()
assert np.all(modified_sd.sites_position[:] == main_sd.sites_position[:])
return modified_sd
def add_edges_from_previous_ts(tables, prev_inferrred_ts):
# Get nodes and edges from prev_inferrred_ts that were added in the
# match_samples phase (i.e. sample nodes and PC nodes above them)
# and add them into the provided tables instance.
orig_samples = np.where(tables.nodes.flags & tskit.NODE_IS_SAMPLE)[0]
node_map = {}
individual_map = {tskit.NULL: tskit.NULL}
sample_match_nodes = set()
nodes_above_samples = set(
prev_inferrred_ts.edges_parent[
np.isin(prev_inferrred_ts.edges_child, prev_inferrred_ts.samples())
]
)
# TODO - also add populations from the prev_inferrred_ts
for nd in prev_inferrred_ts.nodes():
if nd.is_sample():
if nd.individual != tskit.NULL:
if nd.individual not in individual_map:
individual_map[nd.individual] = tables.individuals.append(
prev_inferrred_ts.individual(nd.individual)
)
node_map[nd.id] = tables.nodes.append(nd.replace(individual=individual_map[nd.individual]))
sample_match_nodes.add(nd.id)
continue
if nd.metadata:
md = json.loads(nd.metadata.decode())
if "ancestor_data_id" in md:
node_map[nd.id] = md["ancestor_data_id"]
continue
# If we are here, this is a node that is neither a sample nor a normal ancestor
assert nd.flags & tsinfer.NODE_IS_PC_ANCESTOR
# Hack to check if this PC node is already in the ancestors_ts
# see https://github.com/tskit-dev/tsinfer/issues/916 for a proper solution
if nd.id in nodes_above_samples:
node_map[nd.id] = tables.nodes.append(nd)
sample_match_nodes.add(nd.id)
for e in prev_inferrred_ts.edges():
if e.parent in sample_match_nodes or e.child in sample_match_nodes:
tables.edges.add_row(
left=e.left,
right=e.right,
child=node_map[e.child],
parent=node_map[e.parent])
tables.sort()
# Simplify but keep unary nodes etc and put the original samples at the end.
# Note that the original individuals will still be at the start of the indiv table
samples = np.concatenate(([node_map[s] for s in prev_inferrred_ts.samples()], orig_samples))
assert np.all(tables.nodes.flags[samples] & tskit.NODE_IS_SAMPLE != 0)
tables.simplify(samples, keep_unary=True, filter_sites=False, filter_individuals=False, filter_populations=False)
new_ts = tables.tree_sequence()
# Check we have attached all the samples effectively
# (the hack above means we could perghaps have missed some intermediate PC nodes)
for tree in new_ts.trees():
if tree.num_edges > 0:
if tree.num_roots != 1:
raise RuntimeError("Missed some PC nodes")
return new_ts
# SIMULATE SOME DATA, AND SPLIT INTO 2 SETS
main_data_size = 40 # num individuals in the main ts
extra_data_size = 5 # num individuals we are going to add
tot_n = main_data_size + extra_data_size
tmp_ts = msprime.sim_mutations(msprime.sim_ancestry(tot_n, population_size=1e4, sequence_length=1e5, recombination_rate=1e-8), rate=1e-8)
main_sample_data = tsinfer.SampleData.from_tree_sequence(
tmp_ts.simplify(tmp_ts.samples()[0:(main_data_size*2)])
)
print(main_sample_data.num_sites, "sites in the main datafile")
extra_sample_data = tsinfer.SampleData.from_tree_sequence(
tmp_ts.simplify(tmp_ts.samples()[(main_data_size*2):])
)
print(extra_sample_data.num_sites, "sites in the extra datafile")
assert extra_sample_data.num_individuals == extra_data_size
# CREATE ANCESTORS AND MAIN TREE SEQS
main_ancestors_ts = tsinfer.match_ancestors(main_sample_data, tsinfer.generate_ancestors(main_sample_data))
main_ts = tsinfer.match_samples(main_sample_data, main_ancestors_ts)
extra_sample_data = modify_sample_data(main_sample_data, extra_sample_data)
ts_partial = tsinfer.match_samples(extra_sample_data, main_ancestors_ts, post_process=False)
# Trim down to the first and last site
ts_partial = ts_partial.keep_intervals([[ts_partial.site(0).position, ts_partial.site(-1).position+1]], simplify=False)
full_ts = add_edges_from_previous_ts(
tables=ts_partial.dump_tables(),
prev_inferrred_ts=main_ts,
)
print(
"Done: added",
full_ts.num_samples - main_ts.num_samples,
"new samples to the `main_ts` tree seq, using the ancestors_ts"
) |
Beta Was this translation helpful? Give feedback.
-
Just "thinking out loud" here, after chatting to Duncan and Jerome. Programs like Singer perform a single-sample threading operation like the code I posted as another answer, but remove polytomies by creating new attachment positions (i.e. new ancestors) during threading. This would be one avenue (but depends on the order of threaded samples). As with the code I posted, it would probably work fine if the number of new sample genomes was small. Another option would be to infer a tree sequence separately for the new samples, and try to merge the two together. Although this was what @nbesteban was originally asking about, and by going site-by-site it would be possible to equate ancestors (i.e. nodes) between the tree sequences, I suspect if would be tricky to ensure there were no loops in the resulting ARG. Additionally, I think this is not really making best use of the data. In the answer/discussion below, I'll focus on a more tsinfer-like approach, suitable for a larger number of new samples, and which treats the entire batch of new samples at once (advantage = no order dependency). In particular, if you have a new set of sequences to add to an existing inferred TS, I can imagine each variable site in the new data could fall into one of 4 categories
If the site is of type 3, nothing more needs to be done. If it is of type 4, I don't think we can easily change the ancestors without re-running the entire inference again (but we might hope that getting the time order right using tsdate would make us robust to changes in frequency order). It would be a research question as to how much this affects inference: it's closely related a how well we do at inference when we subsample the data, something that @duncanMR is investigating. I haven't thought much what we do for sites of type 2, without re-running the entire inference. I guess for trialleleic sites we could simply mark the 3rd allele as missing. Sites of type 1 require algorithmic consideration. It would be possible to generate new ancestors for these sites, and probably (relatively) easy to figure out their relative time order compared to the other ancestors. It would be a lot of work if we had to re-do the matching of existing ancestors to incorporate these, but I think we could reasonably assume that existing nodes in the tree sequence wouldn't copy from them anyway (since sites of type 1a represent "new" variation, and for singletons->doubleton sites, we can simply re-match the original singleton sample). So "all" we would need to do is to gradually match the new ancestors, forwards in time, against a tree sequence comprised of a combination of the previous TS (truncated at the ancestors time) with any more ancient newly-generated ancestors. I think this is algorithmically possible, but would require substantial reworking of the match_ancestors algorithm. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
With the dated tree sequences used in the unified genealogy available on the Zenodo website, I'm considering adding a dated tree sequence from a new population to see where they will be situated and facilitate further analyses. Is it possible to merge them with the existing dated tree sequence of the unified genealogy to facilitate this work? If yes, is there already a script that can facilitate this?
Thank you so much!
Beta Was this translation helpful? Give feedback.
All reactions