GC + mappability corrections #4

tlesluyes · 2020-01-23T12:05:12Z

Hi. I’m not 100% sure whether this is an actual issue but I have some questions regarding the gc and mappability corrections.

The first question is: why don’t you consider using a single loess correction instead of two? That would be a single loess(reads~gc+map,data=whatever) instead of two loess(reads~covariate,data=whatever). Any statistical advices/advantages doing this?
https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L208
The second questions is about the correction factors that are computed right after the segmentation: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L115

So, those factors are computed early but applied later, during covariate corrections. The first covariate is GC, where you: 1) subsample bins to get 50,000 of them (that is x2s), 2) correct those with pre-computed factors, 3) try to fit a loess regression on those, 4) try to fit a loess regression on the entire dataset (that is x2) if the previous one fails and 5) apply correction factors for GC. Then the same process is performed for the mappability. Is this correct?
Does that imply that only subsampled bins are actually corrected for pre-computed factors at: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L206
Because this seems to only be performed on the subset (x2s) and not on the whole dataset (x2). So, if the regression fails at that stage, then the other one that is performed in the entire dataset uses non-corrected read values because x2 has not been adjusted. If things go really wrong, then the two fits will fail fot the two covariates and coverages will never be adjusted with these factors. But what if things go right, are coverage values actually corrected twice? This correction is inside the loop that iterates for each covariate so it seems like it’s applied everytime. Am I reading this correctly and is this wanted? Would that make sense to adjust read coverages for the entire dataset with these pre-conputed factors first (a single time) and then correct for covariates (no matter if you use a single or two corrections and if the regression fails with subsampled regions)?

The text was updated successfully, but these errors were encountered:

imielinski · 2020-01-23T13:11:42Z

Thanks - it’s been a while since this code was actually written so I will try my best The iterative loess code is actually adapted from Gavin Ha’s HMMcopy package – agreed multivariate loess might be more appropriate though I imagine sparsity would be the main issue. ie each multi-dimensional bin might have too few data points. Loess can get unstable in my experience and then can give very odd results. Doing this kind of iterative “marginal” correction is perhaps safer But short answer yes the numbers first get adjusted for GC, then the adjusted numbers get re-adjusted for mappability. In my experience the mappability correction does not do much. I agree that the tryCatch is a bit funky, would be better for the code to just fall on its sword if there is an error. BTW I’m loving the detailed and thoughtful code review – can we hire you??? From: tlesluyes <[email protected]> Reply-To: mskilab/fragCounter <[email protected]> Date: Thursday, January 23, 2020 at 7:05 AM To: mskilab/fragCounter <[email protected]> Cc: Subscribed <[email protected]> Subject: [mskilab/fragCounter] GC + mappability corrections (#4) Hi. I’m not 100% sure whether this is an actual issue but I have some questions regarding the gc and mappability corrections. * The first question is: why don’t you consider using a single loess correction instead of two? That would be a single loess(reads~gc+map,data=whatever) instead of two loess(reads~covariate,data=whatever). Any statistical advices/advantages doing this? https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L208<https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L208__;Iw!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-JqAOZ8s$> * The second questions is about the correction factors that are computed right after the segmentation: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L115<https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L115__;Iw!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-Re5MwDA$> So, those factors are computed early but applied later, during covariate corrections. The first covariate is GC, where you: 1) subsample bins to get 50,000 of them (that is x2s), 2) correct those with pre-computed factors, 3) try to fit a loess regression on those, 4) try to fit a loess regression on the entire dataset (that is x2) if the previous one fails and 5) apply correction factors for GC. Then the same process is performed for the mappability. Is this correct? Does that imply that only subsampled bins are actually corrected for pre-computed factors at: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L206<https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L206__;Iw!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-E6E32aI$> Because this seems to only be performed on the subset (x2s) and not on the whole dataset (x2). So, if the regression fails at that stage, then the other one that is performed in the entire dataset uses non-corrected read values because x2 has not been adjusted. If things go really wrong, then the two fits will fail fot the two covariates and coverages will never be adjusted with these factors. But what if things go right, are coverage values actually corrected twice? This correction is inside the loop that iterates for each covariate so it seems like it’s applied everytime. Am I reading this correctly and is this wanted? Would that make sense to adjust read coverages for the entire dataset with these pre-conputed factors first (a single time) and then correct for covariates (no matter if you use a single or two corrections and if the regression fails with subsampled regions)? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/issues/4?email_source=notifications&email_token=ABFUFY3C4MHXQ6LP4REL7ADQ7GBXTA5CNFSM4KKVLTN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IIHEVIQ__;!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-ynaIk40$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/ABFUFY65QRH7WRPMBXNXEKLQ7GBXTANCNFSM4KKVLTNQ__;!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-JgfPHrY$>.

…

________________________________ This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

tlesluyes · 2020-01-24T15:30:03Z

No problem, happy to help! I wanted to have a clear understanding of what fragCounter and Dryclean actually perform as I have a strong interest for cleaning signal from CNAs. I'm glad my review will somewhat improve those tools. :)

PS: I really enjoy my current position so I'm afraid you cannot hire me (yet?) ;)

BW

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC + mappability corrections #4

GC + mappability corrections #4

tlesluyes commented Jan 23, 2020

imielinski commented Jan 23, 2020 via email

tlesluyes commented Jan 24, 2020

GC + mappability corrections #4

GC + mappability corrections #4

Comments

tlesluyes commented Jan 23, 2020

imielinski commented Jan 23, 2020 via email

tlesluyes commented Jan 24, 2020