uio-bmi · pavlovicmilena · Dec 1, 2023 · Nov 1, 2022 · Nov 1, 2022 · Nov 1, 2022
diff --git a/docs/source/developer_docs/how_to_add_new_encoding.rst b/docs/source/developer_docs/how_to_add_new_encoding.rst
@@ -50,7 +50,7 @@ An example of the implementation of :code:`NewKmerFrequencyEncoder` for the :py:
         """
         Encodes the repertoires of the dataset by k-mer frequencies and normalizes the frequencies to zero mean and unit variance.
 
-        Arguments:
+        Specification arguments:
 
             k (int): k-mer length
 
@@ -324,7 +324,7 @@ This is the example of documentation for :py:obj:`~immuneML.encodings.filtered_s
     Nature Genetics 49, no. 5 (May 2017): 659–65. `doi.org/10.1038/ng.3822 <https://doi.org/10.1038/ng.3822>`_.
 
 
-    Arguments:
+    Specification arguments:
 
         comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in
         comparison_attributes will be considered, all other fields are ignored. Valid comparison value can be any repertoire field name.

diff --git a/docs/source/developer_docs/how_to_add_new_preprocessing.rst b/docs/source/developer_docs/how_to_add_new_preprocessing.rst
@@ -35,7 +35,7 @@ It includes implementations of the abstract methods and class documentation at t
         lower_limit, or more clonotypes than specified by the upper_limit.
         Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
 
-        Arguments:
+        Specification arguments:
 
             lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
 
@@ -260,7 +260,7 @@ This is the example of documentation for :py:obj:`~immuneML.preprocessing.filter
     lower_limit, or more clonotypes than specified by the upper_limit.
     Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
 
-    Arguments:
+    Specification arguments:
 
         lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
 

diff --git a/docs/source/installation/install_with_package_manager.rst b/docs/source/installation/install_with_package_manager.rst
@@ -50,14 +50,6 @@ Note: when creating a python virtual environment, it will automatically use the
 
   pip install immuneML
 
-Alternatively, if you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, include the optional extra :code:`TCRdist`:
-
-.. code-block:: console
-
-  pip install immuneML[TCRdist]
-
-See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`
-
 
 
 Install immuneML with conda
@@ -95,6 +87,25 @@ Install immuneML with conda
 Installing optional dependencies
 ----------------------------------
 
+TCRDist
+*******
+
+If you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, you can include the optional extra :code:`TCRdist`:
+
+.. code-block:: console
+
+  pip install immuneML[TCRdist]
+
+The TCRdist dependencies can also be installed manually using the :download:`requirements_TCRdist.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_TCRdist.txt>` file:
+
+.. code-block:: console
+
+  pip install -r requirements_TCRdist.txt
+
+
+DeepRC
+******
+
 Optionally, if you want to use the :ref:`DeepRC` ML method and and corresponding :ref:`DeepRCMotifDiscovery` report, you also
 have to install DeepRC dependencies using the :download:`requirements_DeepRC.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_DeepRC.txt>` file.
 Important note: DeepRC uses PyTorch functionalities that depend on GPU. Therefore, DeepRC does not work on a CPU.
@@ -104,8 +115,38 @@ To install the DeepRC dependencies, run:
 
   pip install -r requirements_DeepRC.txt --no-dependencies
 
+See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`
+
+
+Keras-based sequence CNN
+************************
+
+In order to use the :ref:`KerasSequenceCNN`, optional dependencies :code:`keras` and :code:`tensorflow` need to be installed.
+By default, version 2.11.0 of both dependencies are used.
+Other versions may work as well, as long as the used versions of :code:`keras` and :code:`tensorflow` are compatible with eachother.
+
+To install the default versions of these packages, you can include the optional extra :code:`KerasSequenceCNN`:
+
+.. code-block:: console
+
+  pip install immuneML[KerasSequenceCNN]
+
+Or install the dependencies manually using the :download:`requirements_KerasSequenceCNN.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_KerasSequenceCNN.txt>` file:
+
+.. code-block:: console
+
+  pip install -r requirements_KerasSequenceCNN.txt
+
+
+The :ref:`KerasSequenceCNN` uses CPU, it does *not* rely on GPU.
+
+CompAIRR
+********
+
 If you want to use the :ref:`CompAIRRDistance` or :ref:`CompAIRRSequenceAbundance` encoder, you have to install the C++ tool `CompAIRR <https://github.com/uio-bmi/compairr>`_.
-The easiest way to do this is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder:
+Furthermore, the :ref:`SimilarToPositiveSequence` encoder can be run both with and without CompAIRR, but the CompAIRR-based version is faster.
+
+The easiest way to install CompAIRR is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder:
 
 .. code-block:: console
 

diff --git a/docs/source/tutorials/how_to_apply_to_new_data.rst b/docs/source/tutorials/how_to_apply_to_new_data.rst
@@ -33,8 +33,11 @@ For a tutorial on importing datasets to immuneML (for training or applying an ML
 YAML specification example using the MLApplication instruction
 ------------------------------------------------------------------
 The :ref:`MLApplication` instruction takes in a :code:`dataset` and a :code:`config_path`. The :code:`config_path` should
-point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction. They can be found in the sub-folder
-:code:`instruction_name/optimal_label_name` in the results folder.
+point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction.
+The configuration of the optimal ML setting can always be found in the sub-folder :code:`<instruction_name>/optimal_<label_name>/zip` in the results folder.
+Alternatively, when running the :ref:`TrainMLModel` instruction with the parameter :code:`export_all_ml_settings` set to :code:`True`,
+the config file for each of the ML settings can be found inside :code:`<instruction_name>/split_<number>/<ml_setting_name>/ml_settings_config/zip`
+for each ML setting in each assessment split.
 
 
 .. highlight:: yaml

diff --git a/docs/source/tutorials/motif_recovery.rst b/docs/source/tutorials/motif_recovery.rst
@@ -5,6 +5,51 @@ immuneML provides several different options for recovering motifs associated wit
 Depending on the context, immuneML provides several different reports which can be used for this purpose.
 
 
+Discovering positional motifs using precision and recall thresholds
+----------------------------------------------------------------------
+
+It is often assumed that the antigen binding status of an immune receptor (antibody/TCR) may be determined by the *presence*
+of a short motif in the CDR3.
+We developed a method (manuscript in preparation) for the discovery of antigen binding associated motifs with the following properties:
+
+- Short position-specific motifs with possible gaps
+- High precision for predicting antigen binding
+- High generalisability to unseen data, i.e., retaining a relatively high precision on test data
+
+
+Method description
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A motif with a high precision for predicting antigen binding implies that when the motif is present,
+the probability that the sequence is a binder is high. One can thus iterate through every possible motif and filter
+them by applying a precision threshold. However, the more 'rare' a motif is, the more likely that the motif just had
+a high precision by chance (for example: a motif that occurs in only 1 binder and 0 non-binders has a perfect precision,
+but may not retain high precision on unseen data). Thus, an additional recall threshold is applied to remove
+rare motifs.
+Our method allows the user to define a precision threshold and learn the optimal recall threshold using a training + validation set.
+
+The method consists the following steps:
+
+1. Splitting the data into training, validation and test sets.
+
+2. Using the training set, find all motifs with a high training-precision.
+
+3. Using the validation set, determine the recall threshold for which the validation-precision is still high (separate recall thresholds may be learned for motifs with different sizes).
+
+4. Using the combined training + validation set, find all motifs exceeding the user-defined precision threshold and learned recall threshold(s).
+
+5. Using the test set, report the precision and recall of these learned motifs.
+
+6. Optional: use the set of learned motifs as input features for ML classifiers (e.g., :ref:`BinaryFeatureClassifier` or :ref:`LogisticRegression`) for antigen binding prediction.
+
+Steps 2+3 are done by the report :ref:`MotifGeneralizationAnalysis`. This report exports the learned recall cutoff(s).
+It is recommended to run this report using the :ref:`ExploratoryAnalysis` instruction.
+Steps 4+5 are done by the :ref:`Motif` encoder. The learned recall cutoff(s) are used as input parameters. This encoder
+can be used either in :ref:`ExploratoryAnalysis` or :ref:`TrainMLModel` instructions.
+
+
+
+
 Discovering motifs learned by classifiers
 -----------------------------------------
 

diff --git a/immuneML/IO/dataset_export/AIRRExporter.py b/immuneML/IO/dataset_export/AIRRExporter.py
@@ -207,12 +207,14 @@ def _postprocess_dataframe(df, dataset_labels: dict, omit_columns: list = None):
         if "frame_type" in df.columns:
             AIRRExporter._enums_to_strings(df, "frame_type")
 
-            df["productive"] = df["frame_type"] == SequenceFrameType.IN.name
-            df.loc[df["frame_type"].isnull(), "productive"] = ''
+            df["productive"] = df["frame_type"] == SequenceFrameType.IN.value
+            df.loc[df["frame_type"].isnull(), "productive"] = ""
+            df.loc[df["frame_type"] == "", "productive"] = ""
+            df.loc[df["frame_type"] == SequenceFrameType.UNDEFINED.value, "productive"] = ""
 
             df["vj_in_frame"] = df["productive"]
 
-            df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.name
+            df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.value
             df.loc[df["frame_type"].isnull(), "stop_codon"] = ''
 
             df.drop(columns=["frame_type"], inplace=True)

diff --git a/immuneML/IO/dataset_import/AIRRImport.py b/immuneML/IO/dataset_import/AIRRImport.py
@@ -38,6 +38,8 @@ class AIRRImport(DataImport):
 
     - import_productive (bool): Whether productive sequences (with value 'T' in column productive) should be included in the imported sequences. By default, import_productive is True.
 
+    - import_unknown_productivity (bool): Whether sequences with unknown productivity (missing value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.
+
     - import_with_stop_codon (bool): Whether sequences with stop codons (with value 'T' in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.
 
     - import_out_of_frame (bool): Whether out of frame sequences (with value 'F' in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.
@@ -110,15 +112,16 @@ def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
             - the allele information is removed from the V and J genes
         """
         if "productive" in df.columns:
-            df["frame_type"] = SequenceFrameType.OUT.name
-            df.loc[df["productive"], "frame_type"] = SequenceFrameType.IN.name
+            df["frame_type"] = SequenceFrameType.UNDEFINED.value
+            df.loc[df["productive"]==True, "frame_type"] = SequenceFrameType.IN.value
+            df.loc[df["productive"]==False, "frame_type"] = SequenceFrameType.OUT.value
         else:
             df["frame_type"] = None
 
         if "vj_in_frame" in df.columns:
-            df.loc[df["vj_in_frame"], "frame_type"] = SequenceFrameType.IN.name
+            df.loc[df["vj_in_frame"]==True, "frame_type"] = SequenceFrameType.IN.value
         if "stop_codon" in df.columns:
-            df.loc[df["stop_codon"], "frame_type"] = SequenceFrameType.STOP.name
+            df.loc[df["stop_codon"]==True, "frame_type"] = SequenceFrameType.STOP.value
 
         if "productive" in df.columns:
             frame_type_list = ImportHelper.prepare_frame_type_list(params)

diff --git a/immuneML/IO/dataset_import/DatasetImportParams.py b/immuneML/IO/dataset_import/DatasetImportParams.py
@@ -19,6 +19,7 @@ class DatasetImportParams:
     column_mapping_synonyms: dict = None
     region_type: RegionType = None
     import_productive: bool = None
+    import_unknown_productivity: bool = None
     import_unproductive: bool = None
     import_with_stop_codon: bool = None
     import_out_of_frame: bool = None

diff --git a/immuneML/IO/dataset_import/TenxGenomicsImport.py b/immuneML/IO/dataset_import/TenxGenomicsImport.py
@@ -38,6 +38,12 @@ class TenxGenomicsImport(DataImport):
 
     - receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor.  Valid values for receptor_chains are the names of the :py:obj:`~immuneML.data_model.receptor.ChainPair.ChainPair` enum. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).
 
+    - import_productive (bool): Whether productive sequences (with value 'True' in column productive) should be included in the imported sequences. By default, import_productive is True.
+
+    - import_unproductive (bool): Whether productive sequences (with value 'Fale' in column productive) should be included in the imported sequences. By default, import_unproductive is False.
+
+    - import_unknown_productivity (bool): Whether sequences with unknown productivity (missing or 'NA' value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.
+
     - import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon '*', or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
 
     - import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
@@ -105,17 +111,21 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset:
 
     @staticmethod
     def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
-        df["frame_type"] = None
-        df['productive'] = df['productive'] == 'True'
-        df.loc[df['productive'], "frame_type"] = SequenceFrameType.IN.name
+        df["frame_type"] = SequenceFrameType.UNDEFINED.value
+        df.loc[df['productive']=="True", "frame_type"] = SequenceFrameType.IN.value
+        df.loc[df['productive']=="False", "frame_type"] = SequenceFrameType.OUT.value
 
         allowed_productive_values = []
         if params.import_productive:
-            allowed_productive_values.append(True)
+            allowed_productive_values.append('True')
         if params.import_unproductive:
-            allowed_productive_values.append(False)
+            allowed_productive_values.append('False')
+        if params.import_unknown_productivity:
+            allowed_productive_values.append('')
+            allowed_productive_values.append('NA')
 
         df = df[df.productive.isin(allowed_productive_values)]
+        df.drop(columns=["productive"], inplace=True)
 
         ImportHelper.junction_to_cdr3(df, params.region_type)
         df.loc[:, "region_type"] = params.region_type.name

diff --git a/immuneML/IO/dataset_import/VDJdbImport.py b/immuneML/IO/dataset_import/VDJdbImport.py
@@ -109,7 +109,7 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset:
 
     @staticmethod
     def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
-        df["frame_type"] = SequenceFrameType.IN.name
+        df["frame_type"] = SequenceFrameType.IN.value
         ImportHelper.junction_to_cdr3(df, params.region_type)
         df.loc[:, "region_type"] = params.region_type.name
 

diff --git a/immuneML/config/default_params/datasets/airr_params.yaml b/immuneML/config/default_params/datasets/airr_params.yaml
@@ -2,6 +2,7 @@ is_repertoire: True
 path: ./
 paired: False
 import_productive: True
+import_unknown_productivity: True
 import_with_stop_codon: False
 import_out_of_frame: False
 import_illegal_characters: False

diff --git a/immuneML/config/default_params/datasets/i_receptor_params.yaml b/immuneML/config/default_params/datasets/i_receptor_params.yaml
@@ -2,6 +2,7 @@ is_repertoire: True
 path: ./
 paired: False
 import_productive: True
+import_unknown_productivity: True
 import_with_stop_codon: False
 import_out_of_frame: False
 import_illegal_characters: False

diff --git a/immuneML/config/default_params/datasets/tenx_genomics_params.yaml b/immuneML/config/default_params/datasets/tenx_genomics_params.yaml
@@ -2,6 +2,7 @@ is_repertoire: True
 path: ./
 import_productive: True # whether to only import productive sequences
 import_unproductive: False # whether to only import unproductive sequences
+import_unknown_productivity: True # whether to import sequences with unknown productivity (missing/NA)
 import_illegal_characters: False
 region_type: "IMGT_CDR3" # which region to use - IMGT_CDR3 option means removing first and last amino acid as 10xGenomics uses IMGT junction as CDR3
 separator: "," # column separator

diff --git a/immuneML/config/default_params/encodings/motif_params.yaml b/immuneML/config/default_params/encodings/motif_params.yaml
@@ -0,0 +1,5 @@
+max_positions: 4
+min_positions: 1
+min_precision: 0.8
+min_recall: 0
+min_true_positives: 10
diff --git a/immuneML/config/default_params/encodings/similar_to_positive_sequence_params.yaml b/immuneML/config/default_params/encodings/similar_to_positive_sequence_params.yaml
@@ -0,0 +1,5 @@
+hamming_distance: 1
+ignore_genes: false
+threads: 8
+keep_temporary_files: false
+compairr_path: null
diff --git a/immuneML/config/default_params/example_weighting/predefined_weighting_params.yaml b/immuneML/config/default_params/example_weighting/predefined_weighting_params.yaml
@@ -0,0 +1 @@
+separator: "\t"
diff --git a/immuneML/config/default_params/instructions/train_ml_model_params.yaml b/immuneML/config/default_params/instructions/train_ml_model_params.yaml
@@ -10,4 +10,6 @@ assessment: # outer loop of nested CV
 selection: # inner loop of nested CV
   split_strategy: random # perform random split to train and validation datasets
   split_count: 1 # how many fold to create
-  training_percentage: 0.7
+  training_percentage: 0.7
+example_weighting: null
+export_all_ml_settings: False # only export the optimal model
diff --git a/immuneML/config/default_params/ml_methods/binary_feature_classifier_params.yaml b/immuneML/config/default_params/ml_methods/binary_feature_classifier_params.yaml
@@ -0,0 +1,5 @@
+training_percentage: 0.7
+max_features: 100
+patience: 5
+min_delta: 0
+keep_all: false
diff --git a/immuneML/config/default_params/ml_methods/keras_sequence_cnn_params.yaml b/immuneML/config/default_params/ml_methods/keras_sequence_cnn_params.yaml
@@ -0,0 +1,3 @@
+training_percentage: 0.7
+units_per_layer: [[CONV, 400, 3, 1], [DROP, 0.5], [POOL, 2, 1], [FLAT], [DENSE, 50]]
+activation: relu
diff --git a/immuneML/config/default_params/reports/motif_generalization_analysis_params.yaml b/immuneML/config/default_params/reports/motif_generalization_analysis_params.yaml
@@ -0,0 +1,15 @@
+training_set_identifier_path: null
+training_percentage: 0.7
+split_by_motif_size: true
+max_positions: 4
+min_positions: 1
+min_precision: 0.9
+min_recall: 0
+min_true_positives: 1
+test_precision_threshold: 0.8
+highlight_motifs_name: Highlighted motif
+min_points_in_window: 50
+smoothing_constant1: 5
+smoothing_constant2: 10
+training_set_name: training set
+test_set_name: test set
diff --git a/immuneML/config/default_params/reports/motif_overlap_params.yaml b/immuneML/config/default_params/reports/motif_overlap_params.yaml
@@ -0,0 +1,5 @@
+n_splits: 5
+max_positions: 4
+min_precision: 0
+min_recall: 0
+min_true_positives: 1
diff --git a/immuneML/config/default_params/reports/motif_test_set_performance_params.yaml b/immuneML/config/default_params/reports/motif_test_set_performance_params.yaml
@@ -0,0 +1,8 @@
+highlight_motifs_name: Highlighted motif
+min_points_in_window: 50
+smoothing_constant1: 5
+smoothing_constant2: 10
+training_set_name: training set
+test_set_name: test set
+split_by_motif_size: true
+keep_test_dataset: true