Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function(s) to identify and remove individuals based on GTscore conScore #30

Open
2 tasks
krshedd opened this issue Sep 7, 2023 · 15 comments
Open
2 tasks
Assignees
Labels
enhancement New feature or request

Comments

@krshedd
Copy link
Collaborator

krshedd commented Sep 7, 2023

We currently do not have any functions to remove contaminated individual based on GTscore conScore. I propose that we create 2 functions:

  • find_ind_con - reads in conScore from GTscore singleSNP_sampleSummary.txt file(s), plots density distribution of conScore or heterozygosity vs. conScore similar to GTscore SampleSummaryPlots.pdf output, and outputs modified version of singleSNP_sampleSummary.txt; user inspects plot(s) to determine conScore cutoff.
  • remove_ind_con - takes the output from find_ind_con in concert with a conScore cutoff to remove individuals above a certain threshold, this threshold may be specific to a given GT-seq panel.

The idea would be for these 2 new functions to become part of our standard QA process along with remove_ind_miss_loci, dupcheck_within_silly, remove_dups, find_alt_species, and remove_alt_species. Previously, using TaqMan, contaminated individuals would likely be no-called for enough SNPs that they'd drop out with remove_ind_miss_loci, but that is not necessarily the case with GT-seq.

Open to other ideas, but @csjalbert and I can work on these when we analyze C015 SEAK coho baseline.

@krshedd krshedd added the enhancement New feature or request label Sep 7, 2023
@awbarclay
Copy link
Collaborator

I like the idea of removing individuals based on their conScore. However, I think it would be better to have the conScore imported into Loki so people don't have to search for sample summary text files in order to remove contaminated individuals. If the contamination scores can be imported from Loki into R along with the genotypes, we could modify Loki2R to import the scores and produce conScore density distribution plots when type = "GTSNP" . Then we could then use the remove_ind_con function to remove contaminated individuals. Of course, we'd have to check with Eric to see if this is possible. What do you think?

@krshedd
Copy link
Collaborator Author

krshedd commented Sep 7, 2023

@awbarclay , I'd considered having conScore imported into LOKI as well, however, it gets a bit tricky since it would be tied to both the fish and lab project. A fish could have multiple conScore if it was genotyped on more than one GT-seq project (i.e., re-runs, different locus panels, etc.). This would not be a problem if we were only pulling genotypes by lab project, but breaks down if you wanted to pull genotypes by a vector of locusnames if they span multiple GT-seq projects. I'm open to suggestions here, but it gets a bit complicated.

@awbarclay
Copy link
Collaborator

After talking this over with @krshedd, we think it would be great if contaminated fish could be given "0" genotypes before they are imported. That way, the fish will be removed using GCLr::remove_ind_miss_loci(). The lab staff would have to "no call" the fish before importing the geotypes, which will require functions similar to the ones that @krshedd suggested above to determine a threshold and give contaminated fish "0" scores for all loci to make their lives easier. Lab staff are already "no calling" fish for chip projects, so it wouldn't be much different. @csjalbert is this something that could be implemented in the future?

@tylerdann
Copy link

tylerdann commented Sep 7, 2023 via email

@hahoyt
Copy link

hahoyt commented Sep 7, 2023

My 2 cents, if a contaminated fish is very likely to already be < 80% successful and will be removed due to that rule, what is the benefit of loading it all up in Loki with 0/0 calls? Additionally, if it is all 0/0 in Loki, the assumption is going to be that it failed and that maybe it just needs to be rerun vs. crappy success rate fish that therefore know is crappy.

@krshedd
Copy link
Collaborator Author

krshedd commented Sep 7, 2023

@hahoyt there are instances where fish can have a high conScore, but still have a high genotype rate. This plot is an example from K205 - Unuk Chinook 2021 tGMR. The fish in the upper right hand corner have high contamination and high genotyping success, thus the only way to remove them would be from the GTscore conScore. Think of these as fish that lab staff would have no-called in the past because the VIC/FAM plots were too "fluffy". So far the consensus idea that Andy and I have been mulling would be to add some code to the pipeline post-GTscore, but pre-LOKI import that would remove these. Definitely want to get @csjalbert thoughts on this since he knows the pipeline better than all of us.

image

@hahoyt
Copy link

hahoyt commented Sep 7, 2023

Oh, I see. So we can't assume a contaminated fish (based on conScore) will have a < 80% success rate. Also, I think that uSATs are scored and there is a clearly contaminated fish, the team no calls it for all markers. So this wouldn't be any different than that. Cool.

@krshedd
Copy link
Collaborator Author

krshedd commented Sep 7, 2023

Exactly, same as how we do SNPs on chips and uSATs, get rid of fish with junk/contaminated genotypes before they go into LOKI.

@hahoyt
Copy link

hahoyt commented Sep 7, 2023

All SNPs on chips are not 0/0'd out for a fish with contamination. The will have more 0/0's because of the fluffiness but the genotypers don't select the fish for all markers and 0/0 it out. Or at least, we never have.

@krshedd
Copy link
Collaborator Author

krshedd commented Sep 7, 2023

Right, sorry for adding confusion. The point is those 0/0s for SNPs on chips (sounds like a tasty snack?) likely push the fish <80% genotyping success, so they drop out in downstream analyses. That is not the case with GTscore, hence my desire to do something with conScore.

@hahoyt
Copy link

hahoyt commented Sep 7, 2023

Roger that. :-)

@csjalbert
Copy link
Collaborator

I agree that it makes sense to deal with these contaminated samples. This seems like something that could be implemented in the GTscore pipeline. It could be as simple as a script that runs post-GTscore --that's how the genotype rate plots work. That said, a few questions to make sure I'm understanding correctly:

  1. Would we just change the LOKI file or is the idea to have this act on other outputs?
  • If just LOKI, we'd need to be careful using any rubias, etc., that comes from the pipeline as they wouldn't necessarily be the same data as what ends up in LOKI.
  • This is basically the same as when we rescore IDFG299 - no records of the changes in any of the normal outputs.
  • if this is automated, we could include the list of contaminated fish in a blacklist that is used to filter the rubias/genepop exports.
  1. I'm unclear if the lab (or P/Ls) would select their cut offs and 0 out fish for each project or are you thinking some sort of standardized values that automatically apply (e.g., 0.8 genotype rate and 0.3 contamination) for all projects?

  2. Tying into 2 above, if it involves human review, then it probably wouldn't need to be part of the pipeline, but a few extra GCLr functions that someone runs after the data is transferred from the server..?

  • this seems like what @krshedd described in the first post.

@krshedd
Copy link
Collaborator Author

krshedd commented Sep 8, 2023

Thanks @csjalbert for the clarifying questions and forgive my lack of a detailed understanding of the order of operations for different pipeline steps.

  1. Only the final LOKI input file needs to be changed. This could be done once everything is off the server and transferred onto the V: drive. We could rename the LOKI file generated by the pipeline on the server as preliminary or something so it is clear that contaminated fish haven't been removed yet. We could also include a README.txt or something to clarify that all the other GTscore associated files (genepop, rubias, etc.) on the V: drive are raw, and include contaminated fish.
  2. I think it will require some user intervention. My naive though was just have an R function that uses the _singleSNP_sampleSummary.txt files to plot a density distribution of conScore (or a plotly version of heterozygosity vs. conScore like what is already included in _SampleSummaryPlots.pdf) so lab staff could identify a threshold. Then run another function that uses that threshold to remove contaminated fish from the final LOKI input file.
  3. Correct, since this requires human intervention, this would all occur on the V: drive, after files have been transferred from the server.

Does that make sense? Anything major I'm missing?

@csjalbert
Copy link
Collaborator

csjalbert commented Sep 8, 2023

@krshedd this makes sense to me. I don't see a way around human review on a project-by-project basis, so it makes sense to set this up on V:, where lab staff have easy access. The only additional comment, is that I will not split LOKI files on the server. We can take care of the split on V: after the contaminated fish have been removed. I suppose this would be a 3rd function, that we may not even need, once we test the new importer.

  • loki_splits.r - split the filtered LOKI file into 60Mb chunks. Make sure to name it "LOKI_split_x_filtered" or something like that so it's clear this is not the raw LOKI file.

  • Not sure how to write out certain size CSVs in R - any ideas?

  • Perhaps quick fix is to split by 500k lines or some random amount that is under our file limit.

I'll work on these functions soon and let you know what I can come up with.

@csjalbert
Copy link
Collaborator

Just a note that apparently 60mb files no longer work with our importer. split_gtscore_loki_import.R is a new function that splits LOKI files into usable chunks and it should be used with any contamination score function(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants