-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pre-loaded annotation data structures #87
Comments
Interesting idea, I could have a think about how to implement it. The core of scPipe was originally pure C++, which is why everything tended to be file-based. Fundamentally what we get out of the annotation is an The lazier solution could be just to write a wrapper using tempfiles to do exactly what you described but in the background and automatically cleaned up. I'd imagine it's much slower, has a bit more disk usage overhead, but potentially negligible in the context of the remainder of the pipeline. |
I imagine that this would actually be fairly easy, if you're willing to take I/O out of C++. Specifically, you can
It is true that reading the annotation file into R will increase memory usage; but you already have to load the entire file into memory anyway to create your equivalent C++ structure, and any memory overhead from annotation is likely to be of fixed size and negligible (compared to the sequencing data itself). In my packages, I use R to do all of my I/O except for things related to heavy lifting from the BAM file. This is one considerable advantage in writing R packages versus writing standalone C++ binaries. |
+1 for looking into dumping responsibility for file parsing onto rtracklayer. |
yes. I don't have a clear overall design at first. I wrote the C++ part first as a stand-alone command line tools, and decide to build an R package, later. so the design is not very R centric. they are inconvenient in R and I should add those features in. BTW, what are your thoughts on Rcpp conversion VS direct SEXP? I feel you use both. |
Use Rcpp. I switched a few years ago and I haven't looked back. From my perspective:
There's probably a whole bunch of other things I've taken for granted. Long story short, for R package development, Rcpp has definitely been a plus. Or a plus-plus, so to speak. |
Had a look using ENSEMBL
GENCODE
RefSeq
Many columns are hidden for brevity but take me word that they don't contain anything useful for our purposes. Primarily our goal is to have an exon-gene relationship where gene is defined by the appropriate ID type. For Gencode this is trivial, as each entry of type If you know any easy way out of this mess I'd love to hear it. Otherwise the plan would be to write the annotation source specific parsing code in R to obtain a 4 column data frame of |
I never have this problem when I'm using GTF files for If you must take some parent-based format as input, I'd suggest breaking up the logic as follows:
So yes, your plan seems appropriate, though it seems like you chose a difficult format to start with. |
Some support incoming: So the goal is to generate a SAF style data.table (http://bioinf.wehi.edu.au/featureCounts/). I've written up helper functions to import different annotation formats, the major upside of using So the strategy is to extract all entries with In C++ code has been written to accept SAF as a Rcpp::DataFrame and construct the necessary annotation object. What is left to do is:
In my branch it's now possible to use a single GRanges object as annotation. I don't consider the feature quite complete yet. Additional work:
I may also need to look into GRangesList and greater integration with TxDb (which I have no experience with) |
Using
sc_exon_mapping
as an example: it would be convenient to be able to supply aGRanges
object directly asannofn=
, especially in situations where some editing of the annotation files are necessary, e.g., to filter out undesirable features or to add custom features. Currently I've been loading files in, editing the resultingGRanges
, then saving them back to a new GFF3 file via rtracklayer. I'd like to skip the last step.The same request applies to
sc_demultiplex
andsc_gene_counting
, which seem to take the path of the annotation file but not adata.frame
orDataFrame
or something in-memory. It is often the case that I need to load such files in anyway because they are not exactly formatted as required for these functions.The text was updated successfully, but these errors were encountered: