-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kallisto gene vs transcript count #434
Comments
I'm sorry, but I'm not going to look through many lines of unformatted code to identify why you're observing a discrepancy. A few things to keep in mind: 1) Line numbers (and mtx row/col numbers) are 1-indexed while EC/transcript identifiers are 0-indexed. 2) Equivalence classes will NOT be the same between two different kallisto runs, leading to differences. |
Dear Delaney, Thanks. Best regards, |
Yes, I've been able to recapitulate my desired counts between TCCs and genes on my end. If you want, you can run the bustools commands on your BUS file, and manually run bustools count to get TCC matrices and gene count matrices from the same BUS file. (You can even download bustools from source and edit the bustools_count.cpp C++ file to manually output things). |
Just linking the original issue here for reference pachterlab/gget_examples#6 |
Dear Laura,
There is a discrepancy between the gene count and transcript count from the count matrices generated by Kallisto. I generated the gene count matrix and tcc matrix from the same set of fastq files. Subsequently, I compared the gene count vs the total of transcript count. I used only the equivalence classes which map only to the one gene of interest. I found that these counts do not match. Surprisingly, even when the transcript count, based on the equivalence classes which only map to the gene of interest, is greater than zero, the gene count for the corresponding cells is zero.
The details are as follows:
Software versions:
Kb_python 0.28.2
Kallisto 0.48
Count Matrix Commands:
The code reads a list of fastq files and creates a count matrix of either genes or tcc associated with each specimen and saves the file with the specimen prefix.
Gene count matrix :
system(paste("kb count -i index.idx -g t2g.txt -x 10xv2 -t 2 --overwrite", paste(fastq_fns[(2i-1):(2i)], collapse = " "), sep = " "), intern = TRUE)
my_files <- list.files(path = "./counts_unfiltered",pattern = "^cells_x_genes")
file.copy(from = paste0("./counts_unfiltered/", my_files), to = paste0("./cell_out/unfcounts/",specimens[2*i],"_", my_files))
Tcc_count matrix:
system(paste("kb count -i index.idx -g t2g.txt -x 10xv2 -t2 --tcc --overwrite", paste(fastq_fns[(2i-1):(2i)], collapse = " "), sep = " "), intern = TRUE)
my_files <- list.files(path = "./counts_unfiltered",pattern = "^cells_x_tcc")
file.copy(from = paste0("./counts_unfiltered/", my_files), to=paste0("./cell_out/unfcounts_tcc/",specimendata$specimenID[i],"_", my_files))
For the purpose of comparison of a particular gene vs tcc count, we only included the equivalent classes which had one gene mapping only.
#read the gene count matrix and identify rows in the gene count matrix matching to BIN1 gene
lib <- lib <- specimendata$specimenID[i]
count_file <- paste0(lib,"_cells_x_genes")
rowbin1 <- which(grepl(pattern = "^ENSG00000136717",rownames(res_matx))) #row 7729
res_mat <- read_count_output(dir = "./cell_out/unfcounts_tcc/", name = count_file, tcc = TRUE)
totalgene <- res_mat[rowbin1_tcc,]
#read the tcc count matrix, identify rows in the t2g file matching BIN1 and use only those rows for counting BIN1 transcripts
lib <- lib <- specimendata$specimenID[i]
count_file <- paste0(lib,"_cells_x_tcc")
rowbin1_tcc <- which(grepl(pattern = "\tBIN1$",t2g$V1)) #rows 28581 to 28594
res_mat <- read_count_output(dir = "./cell_out/unfcounts_tcc/", name = count_file, tcc = TRUE)
res_matx_tcc <- res_mat[rowbin1_tcc,]
#find total of 14 BIN1 transcripts for each cell
totalbin1 <- colSums(as.matrix(res_matx_tcc))
#compare res_matx and res_matx_tcc
table(totalgene,totalbin1)
totalbin1 (sum of count of 14 transcripts)
0 1 2 3 4 5 6
0 6877 933 151 29 15 1 0
1 1164 174 30 6 1 0 0
2 182 32 3 0 0 0 1
3 26 3 0 1 0 0 0
4 3 0 0 0 0 0 0
Please note above that when total of transcripts is >=1, the gene count is still predominantly zero.
Thanks.
Best regards,
Rajesh
The text was updated successfully, but these errors were encountered: