Nose2py (#224)

* Switched tests from nose to pytest. * Switched deprecated imports to new ra path * Addresses np 1.25 deprecation of np.where on 0d arrays * Liberalize to python 3.12 * Update head readme to include new install and tests * Update README.md * FRET-dye-mc tests added. Added ability to fix rng * Caught bug in MPI tests, reverted * Update min np&py reqs. np < 3.2 fails np.concatenate(a,dtype) * Added documentation for smFRET code to docs * Updated from version 0.1.1 to 0.2.0 given major addition of FRET code --------- Authored-by: Justin Miller <[email protected]>
bowman-lab · May 3, 2024 · 3b8c23e · 3b8c23e
1 parent cb5ecb3
commit 3b8c23e
Show file tree

Hide file tree

Showing 48 changed files with 1,504 additions and 301 deletions.
diff --git a/README.md b/README.md
@@ -13,6 +13,19 @@ Porter, J.R., Zimmerman, M.I. and Bowman, G.R., 2019. [Enspara: Modeling molecul
 
 Installation is documented [here](https://enspara.readthedocs.io/en/latest/installation.html).
 
+Current installation instructions are (using mamba):
+
+```
+git clone https://github.com/bowman-lab/enspara
+mamba create -n enspara -c conda-forge cython numpy mdtraj scipy python=3.12
+mamba activate enspara
+cd enspara
+pip install -e .
+```
+
+Optionally, install pytests to run the tests:
+`mamba install -c conda-forge pytest`
+
 ## Building the docs
 
 Enspara uses sphinx for documentation. They're a bit of a work in progress,
@@ -25,18 +38,21 @@ make html
 
 ## Running the tests
 
-Enspara uses `nosetests` as a test discovery and running tool. To run the
+Enspara uses `pytest` as a test discovery and running tool. To run the
 tests, you should first make sure you have the development dependencies
 installed then, from the enspara directory, run:
 
 ```bash
-nosetests enspara
+pytest
 ```
 
-If you want to run the MPI tests, you can run
-
+If you wish to skip the MPI tests, you can run:
 ```bash
-mpiexec -n 2 nosetests enspara -a mpi
+pytest -m 'not mpi'
 ```
 
-where `-a mpi` asks nose to run only the tests that are MPI tests.
+If you want to run the MPI tests, you can run:
+
+```bash
+mpiexec -n 2 pytests -m 'mpi'
+```
diff --git a/docs/source/developing.rst b/docs/source/developing.rst
@@ -5,21 +5,21 @@ Running the Tests
 -----------------
 
 After building the project for develoment, go to the project directory 
-and use nose:
+and use pytest:
 
 .. code-block:: bash
 
-        nosetests enspara -a '!mpi'
+        pytest enspara/test/ -m 'not mpi'
 
 This command runs all the tests, except those that require MPI. To include
-those tests, simply remove the ``-a '!mpi'``. However, these may or may not
+those tests, simply remove the ``-m 'not mpi'``. However, these may or may not
 function unless you run the tests under MPI.
 
 To run the tests under MPI, you'll run something that looks like:
 
 .. code-block:: bash
 
-   mpiexec -n 2 nosetests enspara -a mpi
+   mpiexec -n 2 pytest enspara/test/ -m 'mpi'
 
 This will run only the mpi tests with an MPI swarm of two nodes. Obviously not
 all pathology will be caught by just two MPI nodes, but it's a start!
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -29,6 +29,7 @@ Contents:
    cookbook
    pocket-detection
    transition-path-theory
+   smFRET
    developing
    _modules/enspara
 

diff --git a/docs/source/smFRET.rst b/docs/source/smFRET.rst
@@ -0,0 +1,227 @@
+smFRET predictions
+==========
+
+:code:`enspara` enables prediction of smFRET results from Markov State models.
+Accordingly, you will need to have clustered your data (see :doc:`clustering </clustering>`)
+and built an appropriate MSM (see :doc: `Fitting-an-MSM </./tutorial/fitting>`).
+
+You will also need to acquire either point clouds of MSMs of the dyes you wish
+to model. We have deposited some dye MSMs `here <https://osf.io/82xtd/?view_only=b7f354e86eb144a69d9d047b42e21a9f>`_ for general use.
+
+Once you are satisfied with your MSMs, you may predict smFRET results from them.
+
+There are two levels of detail available for use:
+
+1. :ref:`Apps <smFRET-app>`. smFRET prediction code is availiable in a 
+command-line application that is capable of handling much of the bookkeeping
+necessary.
+
+2. Functions. The smFRET code is ultimately implemented as functions, which
+offer the highest degree of control over the function's behavior, but also
+require the most work on the user's part.
+
+
+.. _smFRET-app:
+
+smFRET App
+--------------------------------
+
+smFRET prediction functionality is availiable in :code:`enspara` in the script
+:code:`apps/smFRET-dye-MC.py` or :code:`apps/smFRET-point-clouds.py`. 
+Its help output explains at a granular level of detail what it is capable of, 
+and so here we will seek to provide a high-level discussion of how it can be used.
+
+When predicting smFRET, you will need to make a few important choices:
+
+1. How would you like to treat the dyes? Treating dyes as point clouds is faster,
+though does not account for dye lifetimes and requires choosing a Förster radius.
+Accounting for dye dynamics (dye-MC) takes longer, but also provides dye-lifetimes
+and is more physically realistic. 
+
+2. Which dyes would you like to predict FRET for? You will need to acquire either
+a dye MSM or a dye-point cloud for each dye of interest. We describe creation of 
+additional dye MSMs 
+`here <https://osf.io/82xtd/?view_only=b7f354e86eb144a69d9d047b42e21a9f>`_.
+
+
+
+Predicting smFRET using dye MSMs
+~~~~~~~~~~~~~~~~
+
+smFRET calculations are parallelized using pool. It is generally fairly fast,
+but MPI implementation has not been implemented yet. Scaling to many cluster centers
+will make the calculation noticably slower. Predicting smFRET proceeds
+in two steps, first modeling the dye lifetimes for each state in the protein MSM.
+
+This can be done as follows:
+
+.. code-block:: bash
+
+    python /home/username/enspara/enspara/apps/smFRET-dye-MC.py \
+    --donor_name 'AlexaFluor 488 B1R' \
+    --acceptor_name 'AlexaFluor 647 C2R' \
+    --donor_centers /path/to/donor/dye/MSM_centers.xtc \
+    --donor_top /path/to/donor/dye/topology_file.pdb \
+    --acceptor_centers /path/to/acceptor/dye/MSM_centers.xtc \
+    --acceptor_top /path/to/acceptor/dye/topology_file.pdb \
+    --donor_tcounts /path/to/donor/dye/transition/counts.npy \
+    --acceptor_tcounts /path/to/acceptor/dye/transition/counts.npy \
+    --dye_lagtime 0.002 \
+    --prot_top /path/to/protein/topology_file.pdb \
+    --prot_centers /path/to/protein/MSM/centers.xtc \
+    --resid_pairs /path/to/residue/labeling/pairs.txt \
+    --n_procs 40 \
+    --n_samples 1000 \
+    --output_dir /path/to/output/location/
+
+This will calculate the dye-lifetimes for each residue pair in pairs.txt for each
+protein center in the MSM, using the provided dye MSMs built with a lagtime of 2ps
+(0.002 ns). In this case, 1000 Monte-Carlo simulations will be run for each protein center to generate a variety of dye lifetimes and emission events. 
+We tend to see convergence of FRET efficiency for these
+calculations with ~500-1000 events. :code:`pairs.txt` should look something like
+this:
+:code:`112 136` to calculate lifetimes for the dyes with labeling at positions
+112 and 136, using the residue notation provided in :code:`topology_file.pdb` (resSeq).
+To predict lifetimes for multiple dye pairs, simply append more resSeq positions
+to :code:`pairs.txt`.
+
+One file will be generated for each residue pair, with the format :code:`events-112-136.npy`
+
+These events files will be a numpy array of shape (n_protein_centers, 2, n_events).
+For each protein center, the time to dye emission will be saved along with the 
+emission outcome. In this case, energy transfer = acceptor emission, 
+radiative = donor emission, and non_radiative  = no observed emission.
+
+Optionally, you can save out dye_trajectories and dye MSMs for each protein state 
+with the commands
+.. code-block:: bash
+    --save_dtrj True \
+    --save_dmsm True \
+
+This will result in dye MSMs being saved in the output directory for each protein 
+state. The dye MSMs will be similar to the original, though states that resulted 
+in steric clashes will be removed and the remaining states rebuilt based on the 
+original counts matrix.
+
+The saved dtrj will be a ragged array of shape (n_bursts, variable). Each element 
+in the ragged array corresponds to the cluster center number that the dye was in
+at that step in the monte-carlo simulation. 
+
+
+Next, you will need to simulate smFRET bursts.
+
+This can be done as follows:
+
+.. code-block:: bash
+
+    python /home/username/enspara/enspara/apps/smFRET-dye-MC.py run_burst \
+    --eq_probs /path/to/protein/eq_probs.npy \
+    --t_counts /path/to/protein/t_counts.npy \
+    --lifetimes_dir /path/to/dye/lifetimes/output \
+    --lagtime 5 \
+    --donor_name 'AlexaFluor 488 B1R' \
+    --acceptor_name 'AlexaFluor 647 C2R' \
+    --prot_top /path/to/protein/topology_file.pdb \
+    --prot_centers /path/to/protein/MSM/centers.xtc \
+    --resid_pairs /path/to/residue/labeling/pairs.txt \
+    --n_procs 2 \
+    --output_dir /path/to/output/location/
+    --correction_factor 10000 9000 8000 7000 6000 5000
+
+This will run a kinetic monte carlo simulation to simulate smFRET bursts for
+the provided Markov State Model and protein centers, using dye-labeling
+on the residues specified in pairs.txt. The correction factor is a scaling factor
+which slows the simulation timescale to match the experimental timescale. In this
+case, we are calculating the FRET efficiency for a series of rescaling times.
+Note- You must have calculated dye lifetimes for these residues or else 
+the code will error. This calculation is generally very fast and is written 
+mostly single threaded. Parallelization is supported across the number of 
+dye pairs being calculated with pool. Lagtime is the protein MSM lagtime (in ns).
+
+Three outputs will be created:
+
+1. :code:`/path/to/output/FEs/FE-residue1-residue2-correction-factor.npy`. 
+This is a numpy array of the simulated FRET efficiencies. Each entry is 
+the FRET efficiency of a single photon burst. Typically, we histogram the
+results and present these as the FRET efficency.
+
+2. :code:`/path/to/output/Lifetimes/(a_or_d)_lifetimes-residue1-residue2-correction-factor.npy` is a 
+numpy array of the acceptor or donor lifetimes. It will be a ragged array 
+of shape (n_bursts, n_photons). 
+
+3. :code:`/path/to/output/MSMs/residue1-donor_dye-residue2-acceptor_dye-(eqs or t_prbs).npy` is the modified protein MSM accounting for steric clashes that 
+occured when the protein was labeled.
+
+Predicting smFRET using dye point clouds
+~~~~~~~~~~~~~~~~
+
+smFRET calculations are parallelized using pool. It is generally fairly fast,
+but MPI implementation has not been implemented yet. Scaling to many cluster centers
+will make the calculation noticably slower. Predicting smFRET proceeds
+in two steps, first modeling the inter-dye distance for each protein MSM state.
+
+This can be done as follows:
+
+.. code-block:: bash
+
+    python /home/username/enspara/enspara/apps/smFRET-point-clouds.py model_dyes \
+    /path/to/protein/MSM/centers.xtc \
+    /path/to/protein/topology_file.pdb \
+    /path/to/residue/labeling/pairs.txt
+    --FRETdye1 /path/to/dye/pointcloud.pdb \ 
+    --FRETdye2 /path/to/dye/pointcloud.pdb \
+    --n_procs 40 \
+    --output_dir /path/to/output/location/
+
+This will calculate the pairwise distance distribution for each state
+in the protein MSM, for each residue labeling pair in :code:`pairs.txt`. 
+
+:code:`pairs.txt` should look something like
+this:
+.. code-block:: bash
+    112 136
+    145 223
+
+This will calculate pairswise inter-dye distance distributions using the two
+provided point clouds. We provide two point clouds for dyes AlexaFluor 488 and
+AlexaFluor594 in enspara which are the default if no dyes are provided. 
+
+Two files will be output:
+
+1. :code:`/path/to/output/location/dye_distributions/bin_edges_residue1_residue2.h5`
+which is a ragged array of shape (n_protein_centers, variable) and the bin edges of
+the histogrammed distance distribution. The width of bins is fixed at 0.1Å.
+
+2. :code:`/path/to/output/location/dye_distributions/probs_residue1_residue2.h5` is
+the probability that dyes will be at the given distance associated with the above
+histogram.
+
+
+Next, you will need to simulate smFRET bursts.
+
+This can be done as follows:
+
+.. code-block:: bash
+
+    python /home/username/enspara/enspara/apps/smFRET-dye-MC.py run_burst \
+    /path/to/protein/MSM/eq_probs.npy \
+    /path/to/protein/MSM/t_probs.npy \
+    2 \
+    /path/to/previous/output/dye_distributions \
+    /path/to/residue/labeling/pairs.txt
+    --n_procs 40 \
+    --time_factor 1000 \
+    --n_chunks 2 \
+    --R0 5.4 \
+    --output_dir /path/to/output/location
+
+This will run a kinetic monte carlo simulation to simulate smFRET bursts for
+the provided Markov State Model and protein centers, using dye-distance distributions
+on the residues specified in pairs.txt. You must have calculated dye 
+distance distributions for these residues or else the code will error.
+
+In this case, we have built a MSM with a 2ns labtime, and believe the MSM is 1,000 
+times "faster" than the experiment. We are calculating FRET efficiency using a Förster
+radius of 5.4nm. One output file will be generated, 
+:code:`FRET_E_residue1_residue2_time_factor.npy`, which is a numpy array of the FRET
+efficiencies for each observed burst.
diff --git a/enspara/apps/smFRET-dye-MC.py b/enspara/apps/smFRET-dye-MC.py
@@ -107,7 +107,7 @@ def process_command_line(argv):
              "Generally parallel over number of frames in supplied trajectory/MSM state. ")
     calc_lts_param_args.add_argument(
         '--n_samples', required=False, type=int,
-        default=5000,
+        default=1000,
         help="Number of times to run dye_lifetime calculations (per center).")
     calc_lts_param_args.add_argument(
         '--save_dtrj', required=False, default=False, type=bool,

diff --git a/enspara/apps/smFRET-point-clouds.py b/enspara/apps/smFRET-point-clouds.py
@@ -72,11 +72,11 @@ def process_command_line(argv):
              "Generally parallel over number of frames in supplied trajectory/MSM state")
     model_parameter_args.add_argument(
         '--FRETdye1', required=False,
-        default=os.path.dirname(inspect.getfile(ra)) + '/../data/dyes/AF488.pdb',
+        default=os.path.dirname(inspect.getfile(ra)) + '/../data/dyes/point-clouds/AF488.pdb',
         help="Path to point cloud of FRET dye pair 2")
     model_parameter_args.add_argument(
         '--FRETdye2', required=False,
-        default=os.path.dirname(inspect.getfile(ra)) + '/../data/dyes/AF594.pdb',
+        default=os.path.dirname(inspect.getfile(ra)) + '/../data/dyes/point-clouds/AF594.pdb',
         help="Path to point cloud of FRET dye pair 2")
     model_parameter_args.add_argument(
         '--output_dir', required=False, action=readable_dir, default='./',
@@ -100,11 +100,6 @@ def process_command_line(argv):
         't_probs',
         help="transition probabilities from the MSM. "
              "Should be of file type .npy")
-    fret_input_args.add_argument(
-        'photon_times',
-        help="File containing inter photon times. Each list is an individual photon burst "
-             "with photon wait times (in us) for each burst. Size (n_bursts, nphotons in burst) "
-             "Should be of file type .npy")
     fret_input_args.add_argument(
         'lagtime', type=float,
         help="lag time used to construct the MSM (in ns) "
@@ -124,6 +119,12 @@ def process_command_line(argv):
         '--n_procs', required=False, type=int, default=1,
         help="Number of cores to use for parallel processing. "
              "Generally parallel over number of frames in supplied trajectory/MSM state")
+    fret_parameters.add_argument(
+        '--photon_times',
+        default=os.path.dirname(inspect.getfile(ra)) + '/../data/dyes/interphoton_times.npy',
+        help="File containing inter photon times. Each list is an individual photon burst "
+             "with photon wait times (in us) for each burst. Size (n_bursts, nphotons in burst) "
+             "Should be of file type .npy")
     fret_parameters.add_argument(
         '--n_chunks', required=False, type=int, default=2,
         help="Enables you to assess intraburst variation. "
@@ -132,7 +133,7 @@ def process_command_line(argv):
         '--R0', required=False, type=float, default=5.4,
         help="R0 value for FRET dye pair of interest")
     fret_parameters.add_argument(
-        '--slowing_factor', required=False, type=int, default=1,
+        '--time_factor', required=False, type=int, default=1,
         help="factor to slow your trajectories by")
     fret_parameters.add_argument(
         '--output_dir', required=False, action=readable_dir, default='./',
@@ -230,10 +231,10 @@ def main(argv=None):
         cumulative_times = np.load(args.photon_times, allow_pickle=True)
 
         # Convert Photon arrival times into MSM steps.
-        MSM_frames = dyes_from_expt_dist.convert_photon_times(cumulative_times, args.lagtime, args.slowing_factor)
+        MSM_frames = dyes_from_expt_dist.convert_photon_times(cumulative_times, args.lagtime, args.time_factor)
 
         logger.info(f"Using r0 of {args.R0}")
-        logger.info(f"Using slowing factor of {args.slowing_factor}")
+        logger.info(f"Using time factor of {args.time_factor}")
         # Calculate the FRET efficiencies
         for n in np.arange(resSeq_pairs.shape[0]):
             logger.info(f"Calculating FRET Efficiencies for residues {resSeq_pairs[n]}")
@@ -250,7 +251,7 @@ def main(argv=None):
             FEs_sampling, trajs = dyes_from_expt_dist.sample_FRET_histograms(
                 T=t_probabilities, populations=populations, dist_distribution=dist_distribution,
                 MSM_frames=MSM_frames, R0=args.R0, n_procs=args.n_procs, n_photon_std=args.n_chunks)
-            np.save(f"{args.output_dir}/FRET_E_{title}_time_factor_{args.slowing_factor}.npy", FEs_sampling)
+            np.save(f"{args.output_dir}/FRET_E_{title}_time_factor_{args.time_factor}.npy", FEs_sampling)
 
             if args.save_burst_frames==True:
                 np.save(f'{args.output_dir}/syn-trjs-{title}.npy', trajs)

diff --git a/enspara/cluster/util.py b/enspara/cluster/util.py
@@ -16,7 +16,7 @@
 from ..geometry.libdist import euclidean, manhattan
 
 from ..exception import ImproperlyConfigured, DataInvalid
-from ..util import partition_list, partition_indices
+from ..ra.ra import partition_list, partition_indices
 from enspara import ra
 
 logger = logging.getLogger(__name__)