Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UMI support #72

Merged
merged 11 commits into from
Feb 13, 2024
Merged

Add UMI support #72

merged 11 commits into from
Feb 13, 2024

Conversation

remiolsen
Copy link
Member

@remiolsen remiolsen added this to the 0.6.1 milestone Feb 8, 2024
@remiolsen remiolsen requested a review from alneberg February 8, 2024 13:40
@remiolsen
Copy link
Member Author

remiolsen commented Feb 8, 2024

@alneberg The last feature before release. I would like to merge this before #71 if you don't mind. I guess my solution to reconciling our code is to split the baby in half then to sew her back together again (if that analogy works). My point is that it's a bit ugly but it works.

"""
nt = re.compile("\*n([atcg])")
nts = "".join(re.findall(nt, cs_string))

if umi_before > 0:
nts = nts[umi_before:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really work? If there are substitutions/deletions/insertions in the alignment both the reference and query nucleotide will be in the cs string right?

For example:
cg:Z:6M2D21M cs:Z::1*at:2*ac:1-ac:21
Or am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works if you add the correct number of N's into the template sequence. So we need to know that the index region is e.g., exactly 10 nt index + 9 nt UMI long. That way we can read the sequence directly from the CS string. Unrelated example:

cs:Z::33*nt*nc*nc*ng*ng*na*ng*na:9

Insertions and deletions will of course affect this perfect picture, but I don't think too terribly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I think we could use the fact that we're looking specifically for cases where there is an n to the left inside the *na* section. But we can leave that for later

@@ -38,6 +38,7 @@
entry_points={
"console_scripts": [
"anglerfish=anglerfish.anglerfish:anglerfish",
"anglerfish-explore=anglerfish.explore.cli:main",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Although I don't know what the effect of this is. Does it make the explore command executable or add it to the path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With conda, and I guess with most package managers, both! So now this will work:

% anglerfish-explore --help
Usage: anglerfish-explore [OPTIONS]

Options:
  -f, --fastq TEXT                Fastq file to align  [required]
...

I'm thinking maybe for a 1.0 release we can unify these commands. I'm open to having them as subcommands, eg. anglerfish run and anglerfish explore

Copy link
Member

@alneberg alneberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This breaks the explore command a tiny bit for UMI:s. I've located the issue and I should be able to create a fix for it. Separate PR I guess. Your call if it should be included in this release or not. If I can fix it quickly I think it would make sense to include it.

@remiolsen
Copy link
Member Author

Great work! This breaks the explore command a tiny bit for UMI:s. I've located the issue and I should be able to create a fix for it. Separate PR I guess. Your call if it should be included in this release or not. If I can fix it quickly I think it would make sense to include it.

I was thinking of doing a release on Friday, if you can't get a fix out in time it's not a huge deal.

Also I think this code, the adaptor handling parts, could do with simplifying or refactoring. For instance the separate I7 and I5 handling could be DRY'd up. But that's for another time.

@remiolsen remiolsen merged commit 580fc97 into master Feb 13, 2024
12 checks passed
@remiolsen remiolsen deleted the umi_support branch February 13, 2024 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants