Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate_sets #174

Open
rdmorin opened this issue Mar 11, 2021 · 1 comment
Open

generate_sets #174

rdmorin opened this issue Mar 11, 2021 · 1 comment
Labels
enhancement New feature or request priority.high

Comments

@rdmorin
Copy link
Collaborator

rdmorin commented Mar 11, 2021

We currently have a function in oncopipe (generate_pairs) that handles the complex task of matching up tumours and their matched normal samples. I think we need to implement a similar function in oncopipe that allows more complex sample sets to be automatically constructed with other groupings (not always 1:1). An example use case I've encountered for cases with more than one tumour is that we want to run some analyses/tools on ALL the tumour bams (or tumour_mafs etc), so we need to know the sample ID of each sample that exists for that patient and have them grouped properly. Something along the lines of:

op.generate_sets(SAMPLES,sample_types=('tumour_genome','normal_genome',grouping='patient')

This could return a data frame with a column for each tumour genome that exists (named by the corresponding time point from a time_point column) or perhaps the tumour_genome column contains a list of genomes where more than one exists.

Another use case:
op.generate_sets(SAMPLES,sample_types=('tumour_genome','tumour_mrna',grouping='tumour_sample')

This would return a data frame that has a column for the tumour genome and another for the RNA-seq sample (or samples) for that patient. In this case, I've indicated that grouping would be at the level of the sample instead of the patient, so if there are multiple time points, these would still be in separate rows.

@rdmorin rdmorin added the enhancement New feature or request label Mar 11, 2021
@lkhilton
Copy link
Member

@oncogenomics I've made a handy reprex for you to mess around with to get the gist of what we're trying to accomplish. You should have permissions to be able to run this directly:

cd /projects/rmorin_scratch/Laura_temp/oncopipe_sandbox
snakemake -np -s test_generate_pairs.smk all

Essentially we have results files that have been generated by retrieving wildcards from a "runs" table, which is generated from the input samples table (in my example the maf files fit this description). We might also have input files that aren't paired. We want to be able to easily write rules that take outputs from different pipelines and joins them on patient_id and/or surgical number.

The issue with the way I've done it is that not every RNAseq sample has a genome, so some of the rules generated on the dry run have an input bam but no input maf files. Ideally the generate_sets function would handle this and only set mRNA bam files that ALSO have one or more genome maf files as targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority.high
Projects
None yet
Development

No branches or pull requests

4 participants