Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove_unused_fragments() disrupts the order of other dataframe attributes like nAA #269

Open
CRYoushiwo opened this issue Jan 8, 2025 · 6 comments

Comments

@CRYoushiwo
Copy link

Describe the bug
In compress_fragment_indices(), the frag_start_col has already been sorted in ascending order. Therefore, there is no need to use sort_values() before calling compress_fragment_indices. Doing so may disrupt the order of nAA or other sequential attributes.

Similarly, using sort_index() might compromise the ordered nature of frag_start_col. To prevent this, it should be replaced with reset_index().

changes

  • fix bug
  • adding test and test data
@CRYoushiwo CRYoushiwo changed the title remove_unused_fragments() disrupts the order of other dataframe attributes like nAA. remove_unused_fragments() disrupts the order of other dataframe attributes like nAA Jan 8, 2025
@GeorgWa
Copy link
Collaborator

GeorgWa commented Jan 8, 2025

Hi, can you give a minimal example which shows the bug.
How I understand it, we pass a precursor df in a certain order and it's returned in a different order?

@CRYoushiwo
Copy link
Author

OK [GeorgWa](https://github.com/GeorgWa), I will clarify this bug in more detail.

When running the following code:

precursor_df = refine_precursor_df(precursor_df)
print(f"precursor_df.frag_start_idx.is_monotonic_increasing: {precursor_df.frag_start_idx.is_monotonic_increasing}\n",
      f"(precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all(): {(precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all()}\n",
      f"precursor_df.nAA.is_monotonic_increasing: {precursor_df.nAA.is_monotonic_increasing}.")
precursor_df, _ = remove_unused_fragments(precursor_df, [fragment_intensity_df])
print(f"precursor_df.frag_start_idx.is_monotonic_increasing: {precursor_df.frag_start_idx.is_monotonic_increasing}\n",
      f"(precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all(): {(precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all()}\n",
      f"precursor_df.nAA.is_monotonic_increasing: {precursor_df.nAA.is_monotonic_increasing}.")

The results are as follows:

False
False
True
True
True
False

However, the desired results should be:

False
False
True
True
True
True

This is just a specific and straightforward example. The key issue is that the remove_unused_fragments() function should maintain the order of the nAA column in precursor_df for any input, regardless of whether the nAA column is ordered or unordered initially.

@CRYoushiwo
Copy link
Author

Here’s a professionally translated and clarified version of your description:


The bug was initially discovered during testing of AlphaPeptDeep. When the model performs inference, there are two possible ways to store the prediction results:

  1. Assigning the entire block at once
  2. Assigning results batch by batch

For the entire block assignment to work, the following conditions must be met:

(len(precursor_df) == 0) or (
    (precursor_df.index.values[0] == 0)
    and precursor_df.frag_start_idx.is_monotonic_increasing
    and (precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all()
    and precursor_df.nAA.is_monotonic_increasing
    and np.all(np.diff(precursor_df.index.values) == 1)
)

For further details, you can refer to the pDeepModel._set_batch_predict_data() function in the AlphaPeptDeep repository:
https://github.com/MannLabs/alphapeptdeep/blob/main/peptdeep/model/ms2.py.

I hope my understanding is correct, and I apologize if my explanation has caused you any confusion.🥺

@GeorgWa
Copy link
Collaborator

GeorgWa commented Jan 10, 2025

The Bug

At the moment remove_unused_fragments works as intended. In alphaBase, there is no requirement that the dataframe needs to be "refined". Refinement can be useful for some applications of alphaBase like peptideep, but if we would make sure that everything is refined after every operation, this would be very compute intensive.

Instead, a dataframe can be refined by the user which allows functions to use optimizations. The function for this is refine_precursor_df. We currently do not have a function to refine a fragment df.

Furthermore, remove_unused_fragments works as intended as it should return the dataframe in the same order as it was passed. This is why we do the sorting first and reset it later. Also, remove_precursor_df currently does not create the fragment df in continuous manner, because this wasn't implemented yet.

Your Solution

First, remove_unused_fragments should always return the precursor df in the same order. I think the better solution would be to call refine_precursor_df first (which orders nAA) and call remove_unused_fragments second.

The solution you provide does not make sure that frag_start_idx is monotonic BEFORE we pass it to compress_fragment_indices, although this is an optimized function which requires it. So it most likely won't work.

See docstring:

@nb.njit()
def compress_fragment_indices(frag_idx):
    """
    recalculates fragment indices to remove unused fragments. Can be used to compress a fragment library.
    Expects fragment indices to be ordered by increasing values (!!!).
    It should be O(N) runtime with N being the number of fragment rows.
    """

Lastly, we need a unittest which checks if the fragment-precursor pairing is still valid. Otherwise, we can't be sure that the fragments are still the same. Instead of using random fragment samples, for example assign 123, 456, 789 etc. and test if the changed frag_start_idx and frag_stop_idx still point to the same numbers.

Suggested Solution

Let's split this into two tasks. First, the user (or peptideep) calls refine_precursor_df which orders nAA, then we call remove_unused_fragments.

We also have to change remove_unused_fragments and compress_fragment_indices so it does not only remove unused fragments but also reorders the fragments while maintaining the mapping.

@CRYoushiwo
Copy link
Author

Thx, GeorgWa

  1. As mentioned earlier, this is just a specific and straightforward example. I believe it is a minimal example. My primary use of remove_unused_fragments() is to obtain a precursor_df sorted by nAA. However, I noticed that remove_unused_fragments() modifies the input order, as shown in the following line:
precursor_df = precursor_df.sort_values([frag_start_col], ascending=True)

This step illustrates that the input order is altered during the process.

  1. I have reviewed the docstring in compress_fragment_indices() and noticed the statement: "Expects fragment indices to be ordered by increasing values." However, I am unclear why this strict requirement is necessary for the function. Could you provide a simple example to demonstrate the potential issues or dangers if this condition is not met?

  2. Splitting this process into two tasks is undoubtedly the correct approach. I have attempted to combine refine_precursor_df() with the original remove_unused_fragments(), but it did not seem to work as expected. Perhaps there is a specific detail I may have overlooked?

@GeorgWa
Copy link
Collaborator

GeorgWa commented Jan 13, 2025

  1. It should have no effect on the input order as we restore it afterwards.
# order by frag frag_start_col
# calling reset_index would be even better to make sure index is clean
precursor_df = precursor_df.sort_values([frag_start_col], ascending=True)
# ... some code
# restore old input order
precursor_df = precursor_df.sort_index()
  1. Yes, this is why I was concerned about the implementation. From my knowledge this is necessary (given the explicit mentioning in the docstring). If you think it's not, let's remove it from the docstring and add tests which "prove" it.
    If you think the reordering is not necessary this would solve the problem.

What do you think about:

  • Add a unittest to check if compress_fragment_indices() works without ordered frag_start_idx
  • Add a unittest on the fragment level which checks if remove_unused_fragments works without ordered frag_start_idx.

Sorry for the confusion, I hope this way we can address it and merge your PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants