Skip to content

AnvarSh1/TRiWS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TRiWS

Tandem Repeats in Weighted Sequences

Below are extracts from my MSc Individual Project Report (King's College London) (Full report available at http://bit.ly/ASMSc)

"Tandem repeat is a consecutively repeated string which is not empty. For e.g. www, 0101, and

TAGCTAGC. Any string containing such repeated sequences as a substring is called repetition-containing;

otherwise, a string is repetition-free (Barton et al., 2014). A tandem repeat in this case, is a string of the

form , where  is a non-empty string (Barton et al., 2014). Considering some string x, questions

regarding tandem repeats in x may arise. For e.g. if x contains tandem repeats and a possible number of

such repetitions, it has been demonstrated by Crochemore (1981) that there exists a chance of having O

(n log n) maximal repetitions within a string of length n. Eventually, Kolpakov and Kucherov (1999) devised

an O (n) time algorithm to calculate the compact representation for all such repetitions denoted by runs.

Usually, these algorithms work on regular strings. We will currently focus on locating tandem repeats in

weighted sequences. The first step is to understand what a weighted sequence is. A weighted sequence

may be defined as a type of string that allows characters of a given alphabet to occur at a particular

position with respective probability, unlike fixed single characters in a normal string. This model allows us,

among other things, to use the idea of equivalence classes in weighted sequences, introduced by Zhang,

Guo and Iliopoulos (2012) to calculate tandem repeats of all possible lengths."

Modules Overview.

"To make both implementation and further application more convenient, solution is divided into separate modules. Such model allows to use separate functions without bulking main code, develop and improve each one of them separately while avoiding interference with other modules. Python (version 3.5.1) was used during development; usage of external libraries was limited where possible, to create completely independent code. Input model The very first step in our calculations is input of the weighted string. Considering the mixture of different data types in one sequence, it is not always possible to process input string "as is"; consistent weighted string model is needed to start calculations. For this task module rwtstr.py is created, which reads a string directly from a file and outputs a sequence with separated nested values. Simple template was considered, according to examples from previous chapters, with addition of delimiter: Each position i in sequence x is separated from next by “;” symbol, making it possible to process each concerning different types of data. Taking into account that positions that show only letters are having occurrence probability equal to 1, numbers and missing syntax will be added After syntax corrections are made, next step is present each element as an object that we can address. We use Python’s built-in eval() function for this purpose, which evaluates a given string as pure Python expression. Above manipulations result in list of lists of tuples, where each nested list is position x[i] of the string, and each tuple is a letter with its given occurrence probability. While first reason for using tuples as (letter, occurrence) pairs resulted from syntax itself, it was decided to use this model further. Main reason is tuples are, unlike lists, immutable, which gives us additional safety over accidental modifications. At this point, we successfully read our sequence and made it available for further manipulations. Next, we implement second module naivestrcr.py, which allows us to find the maximal repetition in a given string. Algorithm uses the idea of naïve implementation of Crochemore’s algorithm. At the current moment, naivestrcr.py is the only module that uses additional library – we import Python’s re library, which allows us to use regular expressions within our code.

After regular string is passed to internal function NaiveCr(), two nested loops compare substring, starting from size of 1. With additional use of regular expressions, it is easier to compare the substring with the whole sequence. After maximum comparison is made, function will print maximal repeated substring. Main module Now, as all supplementary modules are implemented, we can move to the main file. Algorithm implemented here is the short variant of the algorithm developed by Barton, Iliopoulos and Pissis (2014). The difference between approaches is, patternalgo.py works nearly as sub-problem solver. After successful filtering colouring stages, algorithm starts to assemble extended factor from each given black position and appends each factor to the list of extended factor. Next, we simply pass this list, item by item, to naivestrcr.py, which will compute maximal repeating substring for given factor."

About

Tandem Repeats in Weighted Sequences

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages