Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BED feature #45

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open

BED feature #45

wants to merge 49 commits into from

Conversation

isovic
Copy link
Collaborator

@isovic isovic commented Sep 7, 2020

This PR adds a BED feature to polish only the specified regions of the input draft.
The feature is very useful when an assembly was patched (for example manually or by scaffolding), and the user wants to target only those regions instead of potentially disrupting an already high quality assembly.

One could suggest that instead of this feature the user should simply filter and clip overlaps from the outside, but that has a very nasty side-effect: since the region can begin at an arbitrary position (e.g. 1237bp in target), and the current windowing splits windows in fixed intervals (e.g. 500bp), that would mean that the first 237bp of the window where the region begins would have coverage 0x and the heuristics would kick in to trim the window. The end result would be a severely deteriorated assembly.
Another argument might be to turn off window trimming, but then internal windows would have plenty of insertions at their ends.

The only way around this is to implement a proper region-based polishing which generalizes to the full contig size as well as specified regions.

To implement this, some refactoring had to be done. Here is a brief description of work:

  • The way breaking points were constructed had to be refactored because it was rigid (it expected that the polishing begins at coordinate 0 and would split the CIGAR in equally distant target coordinates), and entangled (it is not the responsibility of the Overlap class to find the CIGAR breakpoints).
  • The Window class now stores both the start and end target coordinate of a window, which is very important to allow an arbitrary start/stop region for polishing. This feature will also be very useful in an unrelated upcoming feature.
  • I added unit tests to verify that the breakpoint construction is good. Previously, this was untested, and there was a bug in the old logic. The new version seems to work properly.
  • Added the BED parser, and unit tests to cover the code.

Both the regions and the windows may be of an arbitrary size.

…lidating that there aren't any regions which overlap.
… before the trees are constructed (will be useful later).
…ght now only generating empty windows for the BED regions.
…s_ and target_bed_trees_, and added the trees for the windows.
…dded a placeholder for a generic function to find window breaking points base on supplied windows.
…ld because these tests hijacked my unit tests.
…points in util.*. It breaks the alignment into supplied windows.
…igar and Overlap::find_breaking_points which take a set of windows.
@isovic isovic requested a review from rvaser September 7, 2020 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant