Skip to content

Latest commit

 

History

History
executable file
·
58 lines (39 loc) · 3.18 KB

README.md

File metadata and controls

executable file
·
58 lines (39 loc) · 3.18 KB

DiMotif: Alignment-free Discriminative Protein Motif Discovery

We present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs.

DiMotif paper is currently under review and available on BioArXiv:

    @article {Asgari345843,
    author = {Asgari, Ehsaneddin and McHardy, Alice and Mofrad, Mohammad R. K.},
    title = {Probabilistic variable-length segmentation of protein sequences for discriminative motif mining (DiMotif) and sequence embedding (ProtVecX)},
    year = {2018},
    doi = {10.1101/345843},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2018/07/12/345843},
    eprint = {https://www.biorxiv.org/content/early/2018/07/12/345843.full.pdf},
    journal = {bioRxiv}
    }

DiMotif step-by-step

An ipython notebook containing an example of motif discovery using DiMotif is provided here: https://github.com/ehsanasgari/dimotif/blob/master/notebook/DiMotif_step_by_step_example.ipynb

User Manual

python3 dimotif.py --pos seqfile_of_positive_class --neg seqfile_of_negative_class --outdir output_directory --topn top_N_motifs --segs number_of_segmentations

Using the above mentioned command all the steps will be done sequentially and output will be organized in output directory.

Main parameters

--pos sequences file of the positive_class in txt or fasta format
--neg sequences file of the negative_class in txt or fasta format
--outdir output_directory
--topn how many motif to extract
--segs number of segmentation schemes to be sampled

Computational Workflow

  • For a given set of positive sequences it extracts the most discriminative motifs in the positive class using a probabilistic segmentation inferred from Swiss-Prot
  • Motifs are hierarchically clustered according to their co-occurrence patterns in the positive sequences Motifs are colored according to their most frequent secondary structure in PDB database
  • For each motif the normalized biophysical scores are also provided for further biophysical interpretations
  • The orange databases in the diagram are general-purpose databases and information. However, the red and blue databases are problem-specific datasets we want to find their related motifs.
    • dimotif