Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umi format explore fix #73

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 33 additions & 13 deletions anglerfish/demux/adaptor.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,24 +90,44 @@ def __init__(self, sequence, name, delim, index):
self.name = name
self.delim = delim
self.index = index
self.umi_after = 0
self.umi_before = 0
self.len_after_index = 0
self.len_before_index = 0

# Dynamically assign attributes
self.umi = re.findall(udelim, self.sequence)

# TODO Duplicated from Adaptor class, will be merged later
Copy link
Member

@remiolsen remiolsen Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not happy with this part when I first wrote it, and am not happy with seeing for a second time. Problem might be I wrote it too generic. In reality it's only the first case that is valid, i.e. the INDEX+UMI configuration provided by that company whose name is mostly known by it's three letter abbreviation.

I'm leaning towards allowing this. As I stated in the previous PR, these two adaptor classes needs a proper cleaning up.
Just as a side-note, from purely semantics it's not clear to me what an "adaptorpart" is and what such an object would be a reflection of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wanted to support the same type as you did in the other code essentially.

For me each adaptor would have two adaptorparts, i.e. i7 and i5. Does that make sense?

Copy link
Member

@remiolsen remiolsen Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry, I'm being nitpicky over the name. It might have been more clear if AdaptorPart was a child class of Adaptor. But that's currently a little bit weird since AdaptorParts are instance variables of Adaptor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, but no I think child classes would have been more confusing actually. Their just the two parts of an Adapter to DRY up all the methods and attributes which are identical between i7 and i5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In isolation think it would be less confusing. BUT it doesn't really fit with how the adaptor class is currently used. So it would mean a whole lot of refactoring for the sake of idealism, which I don't advocate for.

# Check if UMI is before or after index
if len(self.umi) > 0 and ">" + self.umi[0] in self.sequence:
# The index region is INDEX+UMI
self.umi_after = int(re.search(ulen, self.umi[0]).group(1))
self.len_before_index = len(idelim.split(self.sequence)[0])
self.len_after_index = len(udelim.split(self.sequence)[-1])
elif len(self.umi) > 0 and self.umi[0] + "<" in self.sequence:
# The index region is UMI+INDEX
self.umi_before = int(re.search(ulen, self.umi[0]).group(1))
self.len_before_index = len(udelim.split(self.sequence)[0])
self.len_after_index = len(idelim.split(self.sequence)[-1])
elif len(self.umi) > 0:
# TODO give details which adaptor has the problem
raise UserWarning(
"Found adaptor with UMI but it does not flank an index. This is not supported."
)
# Non UMI cases
elif has_match(idelim, self.sequence):
self.len_before_index = len(idelim.split(self.sequence)[0])
self.len_after_index = len(idelim.split(self.sequence)[-1])

def has_index(self):
return self.sequence.find(self.delim) > -1

def len_before_index(self):
return self.sequence.find(self.delim)
def len_before_index_region(self):
return self.len_before_index

def len_after_index(self):
return len(self.sequence) - self.sequence.find(self.delim) - len(self.delim)

def get_mask(self, insert_Ns):
if self.has_index():
if not insert_Ns:
return self.sequence.replace(self.delim, "")
else:
return self.sequence.replace(self.delim, "N" * len(self.index))
else:
return self.sequence
def len_after_index_region(self):
return self.len_after_index


# General function to check if a string contains a pattern
Expand Down
11 changes: 8 additions & 3 deletions anglerfish/explore/explore.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,9 +104,11 @@ def run_explore(

# Alignment thresholds
before_thres = round(
adaptor_end.len_before_index() * good_hit_threshold
adaptor_end.len_before_index_region() * good_hit_threshold
)
after_thres = round(
adaptor_end.len_after_index_region() * good_hit_threshold
)
after_thres = round(adaptor_end.len_after_index() * good_hit_threshold)
insert_thres_low = insert_thres_low
insert_thres_high = insert_thres_high

Expand All @@ -133,7 +135,10 @@ def run_explore(
] = match_col_df

thres = round(
(adaptor_end.len_before_index() + adaptor_end.len_after_index())
(
adaptor_end.len_before_index_region()
+ adaptor_end.len_after_index_region()
)
* good_hit_threshold
)
df_good_hits = df_good_hits[df_good_hits["match_1_len"] >= thres]
Expand Down
Loading