Skip to content

bayesomicslab/BSEEJ

Repository files navigation

Bayesian Reconstruction and Differential Testing of Excised Introns

Authors:
Marjan Hosseini
Devin J. McConnell
Derek Aguiar

Graphical abstract



The code for BSEEJ method is provided in 'bseej.py' file. The compressed version of junction files for one example gene (A2ML1) is added in the github. To run BSEEJ with default configurations, simply run the requirements.txt. Then run 'bseej.py'. More detailed guide is the follow.

Usage

  1. Clone the repository:
git clone https://github.com/bayesomicslab/BSEEJ.git
cd BSEEJ
  1. Install the project dependencies and activate the environment using one of these options:

Manually install a conda environment and install the dependencies (recommended choice):

conda create -n bseej_env python==3.10
conda install -c pytorch -c conda-forge -c defaults arviz nbclient nbconvert nbformat networkx numba numexpr numpy pandas scikit-image scikit-learn scipy Theano
conda activate bseej_env

or use the requirement.txt file to install specific versions we used for developing the code:

conda create -n bseej_env python==3.10
pip install -r requirements.txt
conda activate bseej_env

or alternatively you can use yml file that is provided.

conda env create -f bseej_env.yml
conda activate bseej_env
  1. Run the code:
python3 bseej.py

Or alternatively for changing model parameters or training other genes run: Please note that junction path should contain junction files (.junc) with the same format as we provided in the A2ML1.zip after decompression.

python3 bseej.py -k clusters_no -i max iteration  -e eta hyper parameter -a alpha hyper-parameter -r r -s s -p junction_path -g gene_name -o results_path
- clusters_no: number of clusters, this number should be larger than the number of minimum node cover of the interval graph of the intron excisions
- max iteration: the maximum number of iterations for the Gibbs updates
- eta corresponds to eta hyper-parameter (See the model section)  
- alpha corresponds to alpha hyper-parameter (See the model section)  
- r corresponds to r hyper-parameter (See the model section)  
- s corresponds to s hyper-parameter (See the model section) 
- junction_path is the relative or full path to the folder that contains the junctions reads. Here one gene (A2ML1.zip) has been uploaded for example to be run and it should first unzip and then give the extracted path to the code.
- gene: the name of the gene (Here by default, gene is A2ML1).

We also provided an example of usage of our code in example.ipynb file.

Prepare the input:

For creating the EGA and simulated data BAM files we ran the STAR [1] aligner for fast and accurate alignment.

STAR --runThreadN 20 --genomeDir ../genome_data/genome_index/ --outFileNamePrefix ./person_${i}_ --twopassMode  Basic --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate  --readFilesIn ${1}/person_${i}_1.fa ${1}/person_${i}_2.fa

In this example $${1}$ is the directory where the files are located. The input file 'person_*_1.fa' is a collection of genes for the ith sample on forward sequence and the second is the collection of genes on the backward sequence. This allows us to use the twopassMode and we also used the intronMotif in order to obtain spliced alignments (XS). Once the files have been aligned they are then separated out into the individual bam files of just one gene to work on at a time.

Regtools [2] was used for an efficient filtering of the junctions.

regtools junctions extract -s 0 -a 6 -m 50 -M 500000 %s -o %s.junc  

Here the two %s are the bam file and output name respectively. On all data used in this project, EGA, Geuvadis, and simulations, we used the following flags:

* -s: finds XS/unstranded flags
* -a: minimum anchor length into exon (6 bp)
* -m: minimum intron size (50 bp)
* -M: Maximum intron size (500000 bp)
portcullis prep -t 20 -v --force -o %s_portcullis/1-prep/ GRCh38.primary_assembly.genome.fa %s/%s.bam

First step of portcullis [3] is prep and it takes in the fasta file from the reference genome used, here is an example from the data simulations. It is important to note that portcullis was run on our simulations and both experimental results. Here %s is there name of the folder to direct output to and the bam file name.

portcullis junc -t 20 -v -o %s_portcullis/2-junc/portcullis_all --intron_gff %s_portcullis/1-prep/    

The next step that is being done is extraction of the junctions into a gff format. Here %s is just giving it the name of the folder to look into.

portcullis filt -t 20 -v -n --max_length 500000 --min_cov 30 -o %s_portcullis/3-filt/portcullis_filtered --intron_gff %s_portcullis/1-prep/ %s_portcullis/2-junc/portcullis_all.junctions.tab

Finally, we set some filtering using portcullis. We only keep introns that have a max length of 500000, do not use the machine learning filtering, and have a minimum coverage of 30. Here again %s is just point to the gene folder to look into for the files. After portcullis is complete we do an overlap check between the junctions found from regtools and portcullis and only keep the ones that overlap with at least 90%.

Model:

BREM Probabilistic Graphical Model



Main variables and parameters include:

  • $V$ is the set of unique intron excisions, indexed by $v$ and its size of denoted by $|V|$.

  • $N$ is the number of samples and are indexed by $i$.

  • $J_i$ is the number of intron excisions in ith sample.

  • $K$ is the number of clusters (indexed by $k$).

  • For the jth intron excision in the ith sample, we assign a cluster k.

  • Graph $G = (V, E)$, where V is the set of unique intron excision and there is an edge between two intron excisions iff they intersect each other.

  • $\Omega$ is the set of all the independent sets in $G$.

  • $r, s$ are priors for $\pi$ (Beta distribution):

    $$ \pi_k \sim Beta(r, s), \forall k = {1, \dots, K} $$



* Increase in mean of Beta$(r,s)$ results in increase in cluster size |SEEJ|.

  • The structure of a (clusters) SEEJs consists of the inclusion or exclusion of intron excisions.


$$ b_{kv} \sim Bernoulli(\pi_k), \ \forall v \in C \\ \text{s.t.} \ b_k \in \Omega $$

  • For cluster k, $\beta_{k}$ is a $|V|$-dimensional Dirichlet which represents the distribution of the cluster k over the intron excisions.

$$ \beta_k \sim Dirichlet_{|V|}(\eta \odot b_k) $$

* $\eta = (\eta_1, \eta_2, ..., \eta_{|V|})$ is $\beta$ variable prior.

  • For the ith sample, variable $\theta_i$ is a $K$-dimensional Dirichlet distribution and represents the proportions of the clusters in sample $i$.

$$ \theta_i \sim Dirichlet_K(\alpha) $$

* $\alpha = (\alpha_1, \alpha_2, ..., \alpha_N)$ is $\theta$ variable prior.

  • Variable $z_{ij}$ is the cluster assignment for jth intron excision in ith sample. It can take a natural value between 1 and $K$ and follows a Multinomial:

$$ z_{ij} \sim Multinomial(\theta_i) \\ z \in {1, \ldots, K}^{N \times J_i} $$

  • In the ith sample, the jth intron excision is $w_{ij}$ and is observed and follows a Multinomial distribution:

$$ w_{ij} \sim Multinomial(\beta_{z_{ij}}) $$

References

[1] Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21.

[2] Feng, Y. Y., Ramu, A., Cotto, K. C., Skidmore, Z. L., Kunisaki, J., Conrad, D. F., ... & Griffith, M. (2018). RegTools: Integrated analysis of genomic and transcriptomic data for discovery of splicing variants in cancer. BioRxiv, 436634.

[3] Mapleson, D., Venturini, L., Kaithakottil, G., & Swarbreck, D. (2018). Efficient and accurate detection of splice junctions from RNA-seq with Portcullis. GigaScience, 7(12), giy131.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published