Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat) initial centroid implementation for tdf #156

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jspaezp
Copy link
Contributor

@jspaezp jspaezp commented Oct 21, 2024

What is this?

This basically implements a very simple centroiding strategy for tdf (bruker) data which should enable using LFQ on it.

Why is it needed?

Right now the spectrum reader for timsrust does not export MS1's ... @sander-willems-bruker might have a better idea as to why.

What is still missing

  1. Noticed while on this that the injection time reported for ms2 spectra is actually the retention time. @lazear I am assuming this is the injection time used to collect ions (usually in miliseconds, right?) @sander-willems-bruker is there any reason why this is used here instead of the real injection time+correction factor?
  2. Right now the ms2 scans use the 'precursor index' as an index, whilst the MS1 scans use the frame index which means that we might have collisions in indices (which .... I dont think should be a problem ...).
  3. Right now there is a parameter in the centroiding that is hard-coded ... we could propagate it from the config if we wanted to.

FYI

@treitpeter

LMK what you think! I will wait a bit to get feedback on API design+thoughts before I do a final "ready to review" PR.

@treitpeter
Copy link

@jspaezp This is amazing, I really appreciate it. It would be really nice to figure this out, many people are using Bruker platform. If we can contribute more RAW files, let me know. I have hundreds of them, from HeLa to, sg. species to community (SiHuMi_X) to stool files, so we can do proper tests. Please keep me in the loop, I'm happy to test at any time.

@jspaezp
Copy link
Contributor Author

jspaezp commented Oct 22, 2024

I think we can start using some public data (https://www.ebi.ac.uk/pride/archive/projects/PXD028735 + https://pmc.ncbi.nlm.nih.gov/articles/PMC8967878/)

For DDA my future reference ...

ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Alpha_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Alpha_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Alpha_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Beta_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Beta_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Beta_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Gamma_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Gamma_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_A_Sample_Gamma_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Alpha_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Alpha_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Alpha_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Beta_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Beta_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Beta_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Gamma_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Gamma_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_PASEF_Condition_B_Sample_Gamma_03.d

diaPASEF

ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Alpha_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Alpha_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Alpha_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Beta_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Beta_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Beta_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Gamma_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Gamma_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Gamma_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Alpha_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Alpha_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Alpha_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Beta_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Beta_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Beta_03.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Gamma_01.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Gamma_02.d
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/10/PXD010000/LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Gamma_03.d

I feel like MS2 quant is out of the scope of sage ... so I will not be implementing that anytime soon.

@sander-willems-bruker
Copy link
Contributor

What is this?

This basically implements a very simple centroiding strategy for tdf (bruker) data which should enable using LFQ on it.

Why is it needed?

Right now the spectrum reader for timsrust does not export MS1's ... @sander-willems-bruker might have a better idea as to why.

The reason for it not being there is that it is a somewhat strange construct for TIMS data, because you throw away all IM information. While there might be use cases where it works just fine, I think centroiding without this information will produce relatively poor quantifications.

What is still missing

  1. Noticed while on this that the injection time reported for ms2 spectra is actually the retention time. @lazear I am assuming this is the injection time used to collect ions (usually in miliseconds, right?) @sander-willems-bruker is there any reason why this is used here instead of the real injection time+correction factor?

This indeed is probably incorrect. We should be able to parse this out of timsrust and propagate this correctly.

  1. Right now the ms2 scans use the 'precursor index' as an index, whilst the MS1 scans use the frame index which means that we might have collisions in indices (which .... I dont think should be a problem ...).
  2. Right now there is a parameter in the centroiding that is hard-coded ... we could propagate it from the config if we wanted to.

FYI

@treitpeter

LMK what you think! I will wait a bit to get feedback on API design+thoughts before I do a final "ready to review" PR.

If this is a highly requested community feature, doesn't it make more sense to implement a "ms1_spectrum_reader" directly in TimsRust...?

@jspaezp
Copy link
Contributor Author

jspaezp commented Oct 25, 2024

The reason for it not being there is that it is a somewhat strange construct for TIMS data, because you throw away all IM information. While there might be use cases where it works just fine, I think centroiding without this information will produce relatively poor quantifications.

I mean ... yes but that is already what we are doing for the "DDA on DIA data" here, not 100% sure why that is dramatically different.

If this is a highly requested community feature, doesn't it make more sense to implement a "ms1_spectrum_reader" directly in TimsRust...?

I would argue there is some demand ... not sure if you want to commit to a specific implementation of the centroiding in the crate.
mentioned

Having said that ... this draft PR is definitely a patch ...
An alternative and much more laborious way to implement lfq would be to actually extend the "processed spectrum" representation in sage to allow including mobility information, which then can be used to query here:

and a way to propagate the ims information from the detection tot he peptide idx. (PrecursorRange more accurately)

sage/crates/sage/src/lfq.rs

Lines 218 to 252 in 888afad

log::info!("tracing MS1 features");
spectra
.par_iter()
.filter(|s| s.level == 1)
.for_each(|spectrum| {
let a = alignments[spectrum.file_id];
let rt = (spectrum.scan_start_time / a.max_rt) * a.slope + a.intercept;
let query = self.rt_slice(rt, RT_TOL);
for peak in &spectrum.peaks {
for entry in query.mass_lookup(peak.mass) {
let id = match self.settings.combine_charge_states {
true => PrecursorId::Combined(entry.peptide),
false => PrecursorId::Charged((entry.peptide, entry.charge)),
};
let mut grid = scores.entry((id, entry.decoy)).or_insert_with(|| {
let p = &db[entry.peptide];
let composition = p
.sequence
.iter()
.map(|r| composition(*r))
.sum::<Composition>();
let dist = crate::isotopes::peptide_isotopes(
composition.carbon,
composition.sulfur,
);
Grid::new(entry, RT_TOL, dist, alignments.len(), GRID_SIZE)
});
grid.add_entry(rt, entry.isotope, spectrum.file_id, peak.intensity);
}
}
});

which we would still need to centroid, since we would need something that returns spectra with peaks in the hundreds (sage retains ~150/250 usually ... depending on params) and definitely not the 200,000 that are common on an ms1 frame.

@lazear
Copy link
Owner

lazear commented Oct 26, 2024

which we would still need to centroid, since we would need something that returns spectra with peaks in the hundreds (sage retains ~150/250 usually ... depending on params) and definitely not the 200,000 that are common on an ms1 frame.

FWIW, all MS1 peaks are retained - but yes, centroiding to some degree will probably be necessary if a single frame has 200k peaks

@jspaezp
Copy link
Contributor Author

jspaezp commented Oct 27, 2024

The more you know ... I didn't realize the processing was different for MS1/MS2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants