diff --git a/docs/source/developer_docs/how_to_add_new_encoding.rst b/docs/source/developer_docs/how_to_add_new_encoding.rst index df3490d58..525207e85 100644 --- a/docs/source/developer_docs/how_to_add_new_encoding.rst +++ b/docs/source/developer_docs/how_to_add_new_encoding.rst @@ -50,7 +50,7 @@ An example of the implementation of :code:`NewKmerFrequencyEncoder` for the :py: """ Encodes the repertoires of the dataset by k-mer frequencies and normalizes the frequencies to zero mean and unit variance. - Arguments: + Specification arguments: k (int): k-mer length @@ -324,7 +324,7 @@ This is the example of documentation for :py:obj:`~immuneML.encodings.filtered_s Nature Genetics 49, no. 5 (May 2017): 659–65. `doi.org/10.1038/ng.3822 `_. - Arguments: + Specification arguments: comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in comparison_attributes will be considered, all other fields are ignored. Valid comparison value can be any repertoire field name. diff --git a/docs/source/developer_docs/how_to_add_new_preprocessing.rst b/docs/source/developer_docs/how_to_add_new_preprocessing.rst index 5e7510a12..b4f23af73 100644 --- a/docs/source/developer_docs/how_to_add_new_preprocessing.rst +++ b/docs/source/developer_docs/how_to_add_new_preprocessing.rst @@ -35,7 +35,7 @@ It includes implementations of the abstract methods and class documentation at t lower_limit, or more clonotypes than specified by the upper_limit. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets. - Arguments: + Specification arguments: lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire. @@ -260,7 +260,7 @@ This is the example of documentation for :py:obj:`~immuneML.preprocessing.filter lower_limit, or more clonotypes than specified by the upper_limit. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets. - Arguments: + Specification arguments: lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire. diff --git a/docs/source/installation/install_with_package_manager.rst b/docs/source/installation/install_with_package_manager.rst index e7c20c9aa..2a6b2a763 100644 --- a/docs/source/installation/install_with_package_manager.rst +++ b/docs/source/installation/install_with_package_manager.rst @@ -50,14 +50,6 @@ Note: when creating a python virtual environment, it will automatically use the pip install immuneML -Alternatively, if you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, include the optional extra :code:`TCRdist`: - -.. code-block:: console - - pip install immuneML[TCRdist] - -See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)` - Install immuneML with conda @@ -95,6 +87,25 @@ Install immuneML with conda Installing optional dependencies ---------------------------------- +TCRDist +******* + +If you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, you can include the optional extra :code:`TCRdist`: + +.. code-block:: console + + pip install immuneML[TCRdist] + +The TCRdist dependencies can also be installed manually using the :download:`requirements_TCRdist.txt ` file: + +.. code-block:: console + + pip install -r requirements_TCRdist.txt + + +DeepRC +****** + Optionally, if you want to use the :ref:`DeepRC` ML method and and corresponding :ref:`DeepRCMotifDiscovery` report, you also have to install DeepRC dependencies using the :download:`requirements_DeepRC.txt ` file. Important note: DeepRC uses PyTorch functionalities that depend on GPU. Therefore, DeepRC does not work on a CPU. @@ -104,8 +115,38 @@ To install the DeepRC dependencies, run: pip install -r requirements_DeepRC.txt --no-dependencies +See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)` + + +Keras-based sequence CNN +************************ + +In order to use the :ref:`KerasSequenceCNN`, optional dependencies :code:`keras` and :code:`tensorflow` need to be installed. +By default, version 2.11.0 of both dependencies are used. +Other versions may work as well, as long as the used versions of :code:`keras` and :code:`tensorflow` are compatible with eachother. + +To install the default versions of these packages, you can include the optional extra :code:`KerasSequenceCNN`: + +.. code-block:: console + + pip install immuneML[KerasSequenceCNN] + +Or install the dependencies manually using the :download:`requirements_KerasSequenceCNN.txt ` file: + +.. code-block:: console + + pip install -r requirements_KerasSequenceCNN.txt + + +The :ref:`KerasSequenceCNN` uses CPU, it does *not* rely on GPU. + +CompAIRR +******** + If you want to use the :ref:`CompAIRRDistance` or :ref:`CompAIRRSequenceAbundance` encoder, you have to install the C++ tool `CompAIRR `_. -The easiest way to do this is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder: +Furthermore, the :ref:`SimilarToPositiveSequence` encoder can be run both with and without CompAIRR, but the CompAIRR-based version is faster. + +The easiest way to install CompAIRR is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder: .. code-block:: console diff --git a/docs/source/tutorials/how_to_apply_to_new_data.rst b/docs/source/tutorials/how_to_apply_to_new_data.rst index 0277fd0ad..88bc57801 100644 --- a/docs/source/tutorials/how_to_apply_to_new_data.rst +++ b/docs/source/tutorials/how_to_apply_to_new_data.rst @@ -33,8 +33,11 @@ For a tutorial on importing datasets to immuneML (for training or applying an ML YAML specification example using the MLApplication instruction ------------------------------------------------------------------ The :ref:`MLApplication` instruction takes in a :code:`dataset` and a :code:`config_path`. The :code:`config_path` should -point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction. They can be found in the sub-folder -:code:`instruction_name/optimal_label_name` in the results folder. +point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction. +The configuration of the optimal ML setting can always be found in the sub-folder :code:`/optimal_/zip` in the results folder. +Alternatively, when running the :ref:`TrainMLModel` instruction with the parameter :code:`export_all_ml_settings` set to :code:`True`, +the config file for each of the ML settings can be found inside :code:`/split_//ml_settings_config/zip` +for each ML setting in each assessment split. .. highlight:: yaml diff --git a/docs/source/tutorials/motif_recovery.rst b/docs/source/tutorials/motif_recovery.rst index 12cab0ddc..4c5edd409 100644 --- a/docs/source/tutorials/motif_recovery.rst +++ b/docs/source/tutorials/motif_recovery.rst @@ -5,6 +5,51 @@ immuneML provides several different options for recovering motifs associated wit Depending on the context, immuneML provides several different reports which can be used for this purpose. +Discovering positional motifs using precision and recall thresholds +---------------------------------------------------------------------- + +It is often assumed that the antigen binding status of an immune receptor (antibody/TCR) may be determined by the *presence* +of a short motif in the CDR3. +We developed a method (manuscript in preparation) for the discovery of antigen binding associated motifs with the following properties: + +- Short position-specific motifs with possible gaps +- High precision for predicting antigen binding +- High generalisability to unseen data, i.e., retaining a relatively high precision on test data + + +Method description +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A motif with a high precision for predicting antigen binding implies that when the motif is present, +the probability that the sequence is a binder is high. One can thus iterate through every possible motif and filter +them by applying a precision threshold. However, the more 'rare' a motif is, the more likely that the motif just had +a high precision by chance (for example: a motif that occurs in only 1 binder and 0 non-binders has a perfect precision, +but may not retain high precision on unseen data). Thus, an additional recall threshold is applied to remove +rare motifs. +Our method allows the user to define a precision threshold and learn the optimal recall threshold using a training + validation set. + +The method consists the following steps: + +1. Splitting the data into training, validation and test sets. + +2. Using the training set, find all motifs with a high training-precision. + +3. Using the validation set, determine the recall threshold for which the validation-precision is still high (separate recall thresholds may be learned for motifs with different sizes). + +4. Using the combined training + validation set, find all motifs exceeding the user-defined precision threshold and learned recall threshold(s). + +5. Using the test set, report the precision and recall of these learned motifs. + +6. Optional: use the set of learned motifs as input features for ML classifiers (e.g., :ref:`BinaryFeatureClassifier` or :ref:`LogisticRegression`) for antigen binding prediction. + +Steps 2+3 are done by the report :ref:`MotifGeneralizationAnalysis`. This report exports the learned recall cutoff(s). +It is recommended to run this report using the :ref:`ExploratoryAnalysis` instruction. +Steps 4+5 are done by the :ref:`Motif` encoder. The learned recall cutoff(s) are used as input parameters. This encoder +can be used either in :ref:`ExploratoryAnalysis` or :ref:`TrainMLModel` instructions. + + + + Discovering motifs learned by classifiers ----------------------------------------- diff --git a/immuneML/IO/dataset_export/AIRRExporter.py b/immuneML/IO/dataset_export/AIRRExporter.py index 946a93c64..bfe9978b6 100644 --- a/immuneML/IO/dataset_export/AIRRExporter.py +++ b/immuneML/IO/dataset_export/AIRRExporter.py @@ -207,12 +207,14 @@ def _postprocess_dataframe(df, dataset_labels: dict, omit_columns: list = None): if "frame_type" in df.columns: AIRRExporter._enums_to_strings(df, "frame_type") - df["productive"] = df["frame_type"] == SequenceFrameType.IN.name - df.loc[df["frame_type"].isnull(), "productive"] = '' + df["productive"] = df["frame_type"] == SequenceFrameType.IN.value + df.loc[df["frame_type"].isnull(), "productive"] = "" + df.loc[df["frame_type"] == "", "productive"] = "" + df.loc[df["frame_type"] == SequenceFrameType.UNDEFINED.value, "productive"] = "" df["vj_in_frame"] = df["productive"] - df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.name + df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.value df.loc[df["frame_type"].isnull(), "stop_codon"] = '' df.drop(columns=["frame_type"], inplace=True) diff --git a/immuneML/IO/dataset_import/AIRRImport.py b/immuneML/IO/dataset_import/AIRRImport.py index 33742c26e..8b1e120b7 100644 --- a/immuneML/IO/dataset_import/AIRRImport.py +++ b/immuneML/IO/dataset_import/AIRRImport.py @@ -38,6 +38,8 @@ class AIRRImport(DataImport): - import_productive (bool): Whether productive sequences (with value 'T' in column productive) should be included in the imported sequences. By default, import_productive is True. + - import_unknown_productivity (bool): Whether sequences with unknown productivity (missing value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True. + - import_with_stop_codon (bool): Whether sequences with stop codons (with value 'T' in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False. - import_out_of_frame (bool): Whether out of frame sequences (with value 'F' in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False. @@ -110,15 +112,16 @@ def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams): - the allele information is removed from the V and J genes """ if "productive" in df.columns: - df["frame_type"] = SequenceFrameType.OUT.name - df.loc[df["productive"], "frame_type"] = SequenceFrameType.IN.name + df["frame_type"] = SequenceFrameType.UNDEFINED.value + df.loc[df["productive"]==True, "frame_type"] = SequenceFrameType.IN.value + df.loc[df["productive"]==False, "frame_type"] = SequenceFrameType.OUT.value else: df["frame_type"] = None if "vj_in_frame" in df.columns: - df.loc[df["vj_in_frame"], "frame_type"] = SequenceFrameType.IN.name + df.loc[df["vj_in_frame"]==True, "frame_type"] = SequenceFrameType.IN.value if "stop_codon" in df.columns: - df.loc[df["stop_codon"], "frame_type"] = SequenceFrameType.STOP.name + df.loc[df["stop_codon"]==True, "frame_type"] = SequenceFrameType.STOP.value if "productive" in df.columns: frame_type_list = ImportHelper.prepare_frame_type_list(params) diff --git a/immuneML/IO/dataset_import/DatasetImportParams.py b/immuneML/IO/dataset_import/DatasetImportParams.py index 24e43b08d..91ac7ecfa 100644 --- a/immuneML/IO/dataset_import/DatasetImportParams.py +++ b/immuneML/IO/dataset_import/DatasetImportParams.py @@ -19,6 +19,7 @@ class DatasetImportParams: column_mapping_synonyms: dict = None region_type: RegionType = None import_productive: bool = None + import_unknown_productivity: bool = None import_unproductive: bool = None import_with_stop_codon: bool = None import_out_of_frame: bool = None diff --git a/immuneML/IO/dataset_import/TenxGenomicsImport.py b/immuneML/IO/dataset_import/TenxGenomicsImport.py index 7189a7307..25c47123b 100644 --- a/immuneML/IO/dataset_import/TenxGenomicsImport.py +++ b/immuneML/IO/dataset_import/TenxGenomicsImport.py @@ -38,6 +38,12 @@ class TenxGenomicsImport(DataImport): - receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values for receptor_chains are the names of the :py:obj:`~immuneML.data_model.receptor.ChainPair.ChainPair` enum. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire). + - import_productive (bool): Whether productive sequences (with value 'True' in column productive) should be included in the imported sequences. By default, import_productive is True. + + - import_unproductive (bool): Whether productive sequences (with value 'Fale' in column productive) should be included in the imported sequences. By default, import_unproductive is False. + + - import_unknown_productivity (bool): Whether sequences with unknown productivity (missing or 'NA' value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True. + - import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon '*', or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False. - import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True. @@ -105,17 +111,21 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset: @staticmethod def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams): - df["frame_type"] = None - df['productive'] = df['productive'] == 'True' - df.loc[df['productive'], "frame_type"] = SequenceFrameType.IN.name + df["frame_type"] = SequenceFrameType.UNDEFINED.value + df.loc[df['productive']=="True", "frame_type"] = SequenceFrameType.IN.value + df.loc[df['productive']=="False", "frame_type"] = SequenceFrameType.OUT.value allowed_productive_values = [] if params.import_productive: - allowed_productive_values.append(True) + allowed_productive_values.append('True') if params.import_unproductive: - allowed_productive_values.append(False) + allowed_productive_values.append('False') + if params.import_unknown_productivity: + allowed_productive_values.append('') + allowed_productive_values.append('NA') df = df[df.productive.isin(allowed_productive_values)] + df.drop(columns=["productive"], inplace=True) ImportHelper.junction_to_cdr3(df, params.region_type) df.loc[:, "region_type"] = params.region_type.name diff --git a/immuneML/IO/dataset_import/VDJdbImport.py b/immuneML/IO/dataset_import/VDJdbImport.py index 22e8b1556..5d965d3c2 100644 --- a/immuneML/IO/dataset_import/VDJdbImport.py +++ b/immuneML/IO/dataset_import/VDJdbImport.py @@ -109,7 +109,7 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset: @staticmethod def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams): - df["frame_type"] = SequenceFrameType.IN.name + df["frame_type"] = SequenceFrameType.IN.value ImportHelper.junction_to_cdr3(df, params.region_type) df.loc[:, "region_type"] = params.region_type.name diff --git a/immuneML/config/default_params/datasets/airr_params.yaml b/immuneML/config/default_params/datasets/airr_params.yaml index 6ece7a48c..dd04ff6e5 100644 --- a/immuneML/config/default_params/datasets/airr_params.yaml +++ b/immuneML/config/default_params/datasets/airr_params.yaml @@ -2,6 +2,7 @@ is_repertoire: True path: ./ paired: False import_productive: True +import_unknown_productivity: True import_with_stop_codon: False import_out_of_frame: False import_illegal_characters: False diff --git a/immuneML/config/default_params/datasets/i_receptor_params.yaml b/immuneML/config/default_params/datasets/i_receptor_params.yaml index 1c8e27ab5..cf7ce2453 100644 --- a/immuneML/config/default_params/datasets/i_receptor_params.yaml +++ b/immuneML/config/default_params/datasets/i_receptor_params.yaml @@ -2,6 +2,7 @@ is_repertoire: True path: ./ paired: False import_productive: True +import_unknown_productivity: True import_with_stop_codon: False import_out_of_frame: False import_illegal_characters: False diff --git a/immuneML/config/default_params/datasets/tenx_genomics_params.yaml b/immuneML/config/default_params/datasets/tenx_genomics_params.yaml index e80155dd8..2f136948a 100644 --- a/immuneML/config/default_params/datasets/tenx_genomics_params.yaml +++ b/immuneML/config/default_params/datasets/tenx_genomics_params.yaml @@ -2,6 +2,7 @@ is_repertoire: True path: ./ import_productive: True # whether to only import productive sequences import_unproductive: False # whether to only import unproductive sequences +import_unknown_productivity: True # whether to import sequences with unknown productivity (missing/NA) import_illegal_characters: False region_type: "IMGT_CDR3" # which region to use - IMGT_CDR3 option means removing first and last amino acid as 10xGenomics uses IMGT junction as CDR3 separator: "," # column separator diff --git a/immuneML/config/default_params/encodings/motif_params.yaml b/immuneML/config/default_params/encodings/motif_params.yaml new file mode 100644 index 000000000..9bc84eaaa --- /dev/null +++ b/immuneML/config/default_params/encodings/motif_params.yaml @@ -0,0 +1,5 @@ +max_positions: 4 +min_positions: 1 +min_precision: 0.8 +min_recall: 0 +min_true_positives: 10 \ No newline at end of file diff --git a/immuneML/config/default_params/encodings/similar_to_positive_sequence_params.yaml b/immuneML/config/default_params/encodings/similar_to_positive_sequence_params.yaml new file mode 100644 index 000000000..c067d86fc --- /dev/null +++ b/immuneML/config/default_params/encodings/similar_to_positive_sequence_params.yaml @@ -0,0 +1,5 @@ +hamming_distance: 1 +ignore_genes: false +threads: 8 +keep_temporary_files: false +compairr_path: null \ No newline at end of file diff --git a/immuneML/config/default_params/example_weighting/predefined_weighting_params.yaml b/immuneML/config/default_params/example_weighting/predefined_weighting_params.yaml new file mode 100644 index 000000000..262b81ace --- /dev/null +++ b/immuneML/config/default_params/example_weighting/predefined_weighting_params.yaml @@ -0,0 +1 @@ +separator: "\t" \ No newline at end of file diff --git a/immuneML/config/default_params/instructions/train_ml_model_params.yaml b/immuneML/config/default_params/instructions/train_ml_model_params.yaml index 5337c4fd3..70a7d341b 100644 --- a/immuneML/config/default_params/instructions/train_ml_model_params.yaml +++ b/immuneML/config/default_params/instructions/train_ml_model_params.yaml @@ -10,4 +10,6 @@ assessment: # outer loop of nested CV selection: # inner loop of nested CV split_strategy: random # perform random split to train and validation datasets split_count: 1 # how many fold to create - training_percentage: 0.7 \ No newline at end of file + training_percentage: 0.7 +example_weighting: null +export_all_ml_settings: False # only export the optimal model \ No newline at end of file diff --git a/immuneML/config/default_params/ml_methods/binary_feature_classifier_params.yaml b/immuneML/config/default_params/ml_methods/binary_feature_classifier_params.yaml new file mode 100644 index 000000000..4a6e305fb --- /dev/null +++ b/immuneML/config/default_params/ml_methods/binary_feature_classifier_params.yaml @@ -0,0 +1,5 @@ +training_percentage: 0.7 +max_features: 100 +patience: 5 +min_delta: 0 +keep_all: false \ No newline at end of file diff --git a/immuneML/config/default_params/ml_methods/keras_sequence_cnn_params.yaml b/immuneML/config/default_params/ml_methods/keras_sequence_cnn_params.yaml new file mode 100644 index 000000000..3209f764f --- /dev/null +++ b/immuneML/config/default_params/ml_methods/keras_sequence_cnn_params.yaml @@ -0,0 +1,3 @@ +training_percentage: 0.7 +units_per_layer: [[CONV, 400, 3, 1], [DROP, 0.5], [POOL, 2, 1], [FLAT], [DENSE, 50]] +activation: relu \ No newline at end of file diff --git a/immuneML/config/default_params/reports/motif_generalization_analysis_params.yaml b/immuneML/config/default_params/reports/motif_generalization_analysis_params.yaml new file mode 100644 index 000000000..f1e085e24 --- /dev/null +++ b/immuneML/config/default_params/reports/motif_generalization_analysis_params.yaml @@ -0,0 +1,15 @@ +training_set_identifier_path: null +training_percentage: 0.7 +split_by_motif_size: true +max_positions: 4 +min_positions: 1 +min_precision: 0.9 +min_recall: 0 +min_true_positives: 1 +test_precision_threshold: 0.8 +highlight_motifs_name: Highlighted motif +min_points_in_window: 50 +smoothing_constant1: 5 +smoothing_constant2: 10 +training_set_name: training set +test_set_name: test set \ No newline at end of file diff --git a/immuneML/config/default_params/reports/motif_overlap_params.yaml b/immuneML/config/default_params/reports/motif_overlap_params.yaml new file mode 100644 index 000000000..8ffc06c50 --- /dev/null +++ b/immuneML/config/default_params/reports/motif_overlap_params.yaml @@ -0,0 +1,5 @@ +n_splits: 5 +max_positions: 4 +min_precision: 0 +min_recall: 0 +min_true_positives: 1 \ No newline at end of file diff --git a/immuneML/config/default_params/reports/motif_test_set_performance_params.yaml b/immuneML/config/default_params/reports/motif_test_set_performance_params.yaml new file mode 100644 index 000000000..1f7b58a7f --- /dev/null +++ b/immuneML/config/default_params/reports/motif_test_set_performance_params.yaml @@ -0,0 +1,8 @@ +highlight_motifs_name: Highlighted motif +min_points_in_window: 50 +smoothing_constant1: 5 +smoothing_constant2: 10 +training_set_name: training set +test_set_name: test set +split_by_motif_size: true +keep_test_dataset: true \ No newline at end of file diff --git a/immuneML/data_model/dataset/Dataset.py b/immuneML/data_model/dataset/Dataset.py index 014541b29..889e378b7 100644 --- a/immuneML/data_model/dataset/Dataset.py +++ b/immuneML/data_model/dataset/Dataset.py @@ -8,11 +8,12 @@ class Dataset: SUBSAMPLED = "subsampled" PREPROCESSED = "preprocessed" - def __init__(self, encoded_data=None, name: str = None, identifier: str = None, labels: dict = None): + def __init__(self, encoded_data=None, name: str = None, identifier: str = None, labels: dict = None, example_weights: list = None): self.encoded_data = encoded_data self.identifier = identifier self.name = name if name is not None else self.identifier self.labels = labels + self.example_weights = example_weights @classmethod @abc.abstractmethod @@ -62,3 +63,13 @@ def get_metadata(self, field_names: list, return_df: bool = False): @abc.abstractmethod def get_data_from_index_range(self, start_index: int, end_index: int): pass + + def set_example_weights(self, example_weights: list): + if example_weights is not None: + assert len(example_weights) == self.get_example_count(), f"{self.__class__.__name__}: trying to set example weights " \ + f"for dataset {self.identifier} but number of weights ({len(example_weights)}) " \ + f"does not match example count ({self.get_example_count()}). " + self.example_weights = example_weights + + def get_example_weights(self): + return self.example_weights \ No newline at end of file diff --git a/immuneML/data_model/dataset/ElementDataset.py b/immuneML/data_model/dataset/ElementDataset.py index 277cc4f4f..7946f52e2 100644 --- a/immuneML/data_model/dataset/ElementDataset.py +++ b/immuneML/data_model/dataset/ElementDataset.py @@ -28,12 +28,10 @@ def build(cls, dataset_file: Path, types: dict = None, filenames: list = None, * def __init__(self, labels: dict = None, encoded_data: EncodedData = None, filenames: list = None, identifier: str = None, dataset_file: Path = None, - file_size: int = 100000, name: str = None, element_class_name: str = None, element_ids: list = None, + file_size: int = 100000, name: str = None, element_class_name: str = None, + element_ids: list = None, example_weights: list = None, buffer_type=None): - super().__init__() - self.labels = labels - self.encoded_data = encoded_data - self.identifier = identifier if identifier is not None else uuid4().hex + super().__init__(encoded_data, name, identifier if identifier is not None else uuid4().hex, labels, example_weights) self.filenames = filenames if filenames is not None else [] self.filenames = [Path(filename) for filename in self.filenames] if buffer_type is None: @@ -41,7 +39,6 @@ def __init__(self, labels: dict = None, encoded_data: EncodedData = None, filena self.element_generator = ElementGenerator(self.filenames, file_size, element_class_name, buffer_type) self.file_size = file_size self.element_ids = element_ids - self.name = name self.element_class_name = element_class_name self.dataset_file = Path(dataset_file) @@ -110,6 +107,11 @@ def make_subset(self, example_indices, path, dataset_type: str): dataset_file=path / f"{dataset_name}.yaml", types=types, identifier=new_dataset_id, name=dataset_name) + # todo check if this is necessary + original_example_weights = self.get_example_weights() + if original_example_weights is not None: + new_dataset.set_example_weights([original_example_weights[i] for i in example_indices]) + return new_dataset def get_label_names(self): diff --git a/immuneML/data_model/dataset/RepertoireDataset.py b/immuneML/data_model/dataset/RepertoireDataset.py index d75b7b04b..217ee97b8 100644 --- a/immuneML/data_model/dataset/RepertoireDataset.py +++ b/immuneML/data_model/dataset/RepertoireDataset.py @@ -63,8 +63,9 @@ def build(cls, **kwargs): return RepertoireDataset(**{**kwargs, **{"repertoires": repertoires}}) def __init__(self, labels: dict = None, encoded_data: EncodedData = None, repertoires: list = None, identifier: str = None, - metadata_file: Path = None, name: str = None, metadata_fields: list = None, repertoire_ids: list = None): - super().__init__(encoded_data, name, identifier if identifier is not None else uuid.uuid4().hex, labels) + metadata_file: Path = None, name: str = None, metadata_fields: list = None, repertoire_ids: list = None, + example_weights: list = None): + super().__init__(encoded_data, name, identifier if identifier is not None else uuid.uuid4().hex, labels, example_weights) self.metadata_file = Path(metadata_file) if metadata_file is not None else None self.metadata_fields = metadata_fields self.repertoire_ids = repertoire_ids @@ -168,6 +169,10 @@ def make_subset(self, example_indices, path: Path, dataset_type: str): new_dataset = RepertoireDataset(repertoires=[self.repertoires[i] for i in example_indices], labels=copy.deepcopy(self.labels), metadata_file=metadata_file, identifier=str(uuid.uuid1())) + original_example_weights = self.get_example_weights() + if original_example_weights is not None: + new_dataset.set_example_weights([original_example_weights[i] for i in example_indices]) + return new_dataset def get_repertoire_ids(self) -> list: diff --git a/immuneML/data_model/encoded_data/EncodedData.py b/immuneML/data_model/encoded_data/EncodedData.py index f72cf5a60..e8f687127 100644 --- a/immuneML/data_model/encoded_data/EncodedData.py +++ b/immuneML/data_model/encoded_data/EncodedData.py @@ -17,7 +17,7 @@ class EncodedData: """ def __init__(self, examples, labels: dict = None, example_ids: list = None, feature_names: list = None, - feature_annotations: pd.DataFrame = None, encoding: str = None, info: dict = None, + feature_annotations: pd.DataFrame = None, encoding: str = None, example_weights: list = None, info: dict = None, dimensionality_reduced_data: np.ndarray = None): assert feature_names is None or examples.shape[1] == len(feature_names) @@ -29,14 +29,21 @@ def __init__(self, examples, labels: dict = None, example_ids: list = None, feat .format(len(label), len(example_ids)) assert examples is None or len(example_ids) == examples.shape[0], "EncodedData: there are {} example ids, but {} examples."\ .format(len(example_ids), examples.shape[0]) + + if example_weights is not None: + assert len(example_weights) == len(example_ids) if examples is not None: assert all(len(labels[key]) == examples.shape[0] for key in labels.keys()) if labels is not None else True + if example_weights is not None: + assert len(example_weights) == examples.shape[0] + self.examples = examples self.labels = labels self.example_ids = example_ids self.feature_names = feature_names self.feature_annotations = feature_annotations self.encoding = encoding + self.example_weights = example_weights self.info = info self.dimensionality_reduced_data = dimensionality_reduced_data diff --git a/immuneML/dsl/definition_parsers/DefinitionParser.py b/immuneML/dsl/definition_parsers/DefinitionParser.py index c5c3e8c02..b0dbdc7cf 100644 --- a/immuneML/dsl/definition_parsers/DefinitionParser.py +++ b/immuneML/dsl/definition_parsers/DefinitionParser.py @@ -5,6 +5,7 @@ from immuneML.dsl.DefaultParamsLoader import DefaultParamsLoader from immuneML.dsl.definition_parsers.DefinitionParserOutput import DefinitionParserOutput from immuneML.dsl.definition_parsers.EncodingParser import EncodingParser +from immuneML.dsl.definition_parsers.ExampleWeightingParser import ExampleWeightingParser from immuneML.dsl.definition_parsers.MLParser import MLParser from immuneML.dsl.definition_parsers.MotifParser import MotifParser from immuneML.dsl.definition_parsers.PreprocessingParser import PreprocessingParser @@ -44,8 +45,8 @@ def parse(workflow_specification: dict, symbol_table: SymbolTable, result_path: specs_defs = {} - for parser in [MotifParser, SignalParser, SimulationParser, PreprocessingParser, EncodingParser, MLParser, - ReportParser, ImportParser]: + for parser in [MotifParser, SignalParser, SimulationParser, PreprocessingParser, EncodingParser, ExampleWeightingParser, + MLParser, ReportParser, ImportParser]: symbol_table, new_specs = DefinitionParser._call_if_exists(parser.keyword, parser.parse, specs, symbol_table, result_path) specs_defs[parser.keyword] = new_specs diff --git a/immuneML/dsl/definition_parsers/EncodingParser.py b/immuneML/dsl/definition_parsers/EncodingParser.py index 5286296e5..80497502c 100644 --- a/immuneML/dsl/definition_parsers/EncodingParser.py +++ b/immuneML/dsl/definition_parsers/EncodingParser.py @@ -35,11 +35,4 @@ def parse_encoder(key: str, specs: dict): return encoder, params - @staticmethod - def parse_encoder_internal(short_class_name: str, encoder_params: dict): - encoder_class = ReflectionHandler.get_class_by_name(f"{short_class_name}Encoder", "encodings") - params = ObjectParser.get_all_params({short_class_name: encoder_params}, "encodings", short_class_name) - return encoder_class, params, params - - diff --git a/immuneML/dsl/definition_parsers/ExampleWeightingParser.py b/immuneML/dsl/definition_parsers/ExampleWeightingParser.py new file mode 100644 index 000000000..a573d8cb9 --- /dev/null +++ b/immuneML/dsl/definition_parsers/ExampleWeightingParser.py @@ -0,0 +1,39 @@ +import inspect + +from immuneML.dsl.ObjectParser import ObjectParser +from immuneML.dsl.symbol_table.SymbolTable import SymbolTable +from immuneML.dsl.symbol_table.SymbolType import SymbolType +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy +from immuneML.util.Logger import log +from immuneML.util.ParameterValidator import ParameterValidator +from immuneML.util.ReflectionHandler import ReflectionHandler + + +class ExampleWeightingParser: + keyword = "example_weightings" + + @staticmethod + def parse(example_weighting_specs: dict, symbol_table: SymbolTable): + for key in example_weighting_specs.keys(): + + example_weighting, params = ExampleWeightingParser.parse_weighting_strategy(key, example_weighting_specs[key]) + symbol_table.add(key, SymbolType.WEIGHTING, example_weighting, {"example_weighting_params": params}) + + return symbol_table, example_weighting_specs + + + @staticmethod + @log + def parse_weighting_strategy(key: str, specs: dict): + class_path = "example_weighting" + + valid_weighting_strategies = ReflectionHandler.all_nonabstract_subclasses(ExampleWeightingStrategy, subdirectory=class_path) + + + weighting_strategy = ObjectParser.get_class(specs, valid_weighting_strategies, "", class_path, "ExampleWeightingParser", key) + params = ObjectParser.get_all_params(specs, class_path, weighting_strategy.__name__, key) + + required_params = [p for p in list(inspect.signature(weighting_strategy.__init__).parameters.keys()) if p != "self"] + ParameterValidator.assert_all_in_valid_list(params.keys(), required_params, "ExampleWeightingParser", f"{key}/{weighting_strategy.__name__}") + + return weighting_strategy, params diff --git a/immuneML/dsl/instruction_parsers/ExploratoryAnalysisParser.py b/immuneML/dsl/instruction_parsers/ExploratoryAnalysisParser.py index b621fb515..2b13d139d 100644 --- a/immuneML/dsl/instruction_parsers/ExploratoryAnalysisParser.py +++ b/immuneML/dsl/instruction_parsers/ExploratoryAnalysisParser.py @@ -23,7 +23,7 @@ class ExploratoryAnalysisParser: Each analysis is defined by a dataset identifier, a report identifier and optionally encoding and labels and are loaded into ExploratoryAnalysisUnit objects; - DSL example for ExploratoryAnalysisInstruction assuming that d1, r1, r2, e1 are defined previously in definitions section: + DSL example for ExploratoryAnalysisInstruction assuming that d1, p1, r1, r2, e1, w1 are defined previously in definitions section: .. highlight:: yaml .. code-block:: yaml @@ -32,12 +32,14 @@ class ExploratoryAnalysisParser: type: ExploratoryAnalysis number_of_processes: 4 analyses: - my_first_analysis: + my_first_analysis: # simple analysis running a report on a dataset dataset: d1 report: r1 - my_second_analysis: + my_second_analysis: # more complicated analysis; including preprocessing, encoding, example weighting and running a report dataset: d1 + preprocessing_sequence: p1 encoding: e1 + example_weighting: w1 report: r2 labels: - CD @@ -64,7 +66,7 @@ def parse(self, key: str, instruction: dict, symbol_table: SymbolTable, path: Pa def _prepare_params(self, analysis: dict, symbol_table: SymbolTable, yaml_location: str) -> dict: - valid_keys = ["dataset", "report", "preprocessing_sequence", "labels", "encoding", "number_of_processes", "dim_reduction"] + valid_keys = ["dataset", "report", "preprocessing_sequence", "labels", "encoding", "example_weighting", "dim_reduction"] ParameterValidator.assert_keys(list(analysis.keys()), valid_keys, "ExploratoryAnalysisParser", "analysis", False) params = {"dataset": symbol_table.get(analysis["dataset"]), "report": copy.deepcopy(symbol_table.get(analysis["report"]))} @@ -91,6 +93,9 @@ def _prepare_optional_params(self, analysis: dict, symbol_table: SymbolTable, ya if "preprocessing_sequence" in analysis: params["preprocessing_sequence"] = symbol_table.get(analysis["preprocessing_sequence"]) + if "example_weighting" in analysis: + params["example_weighting"] = symbol_table.get(analysis["example_weighting"]).build_object(dataset, **symbol_table.get_config(analysis["example_weighting"])["example_weighting_params"]) + if "dim_reduction" in analysis: valid_dim_reductions = {el.symbol: el.item for el in symbol_table.get_by_type(SymbolType.ML_METHOD) if isinstance(el.item, DimRedMethod)} diff --git a/immuneML/dsl/instruction_parsers/TrainMLModelParser.py b/immuneML/dsl/instruction_parsers/TrainMLModelParser.py index 454160596..61f8317d5 100644 --- a/immuneML/dsl/instruction_parsers/TrainMLModelParser.py +++ b/immuneML/dsl/instruction_parsers/TrainMLModelParser.py @@ -10,6 +10,7 @@ from immuneML.dsl.symbol_table.SymbolTable import SymbolTable from immuneML.environment.EnvironmentSettings import EnvironmentSettings from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy from immuneML.hyperparameter_optimization.HPSetting import HPSetting from immuneML.hyperparameter_optimization.config.LeaveOneOutConfig import LeaveOneOutConfig from immuneML.hyperparameter_optimization.config.ManualSplitConfig import ManualSplitConfig @@ -28,10 +29,11 @@ class TrainMLModelParser: def parse(self, key: str, instruction: dict, symbol_table: SymbolTable, path: Path = None) -> TrainMLModelInstruction: valid_keys = ["assessment", "selection", "dataset", "strategy", "labels", "metrics", "settings", "number_of_processes", "type", "reports", - "optimization_metric", 'refit_optimal_model'] + "optimization_metric", "refit_optimal_model", "example_weighting", "export_all_ml_settings"] ParameterValidator.assert_type_and_value(instruction['settings'], list, TrainMLModelParser.__name__, 'settings') ParameterValidator.assert_keys(list(instruction.keys()), valid_keys, TrainMLModelParser.__name__, "TrainMLModel") ParameterValidator.assert_type_and_value(instruction['refit_optimal_model'], bool, TrainMLModelParser.__name__, 'refit_optimal_model') + ParameterValidator.assert_type_and_value(instruction['export_all_ml_settings'], bool, TrainMLModelParser.__name__, 'export_all_ml_settings') ParameterValidator.assert_type_and_value(instruction['metrics'], list, TrainMLModelParser.__name__, 'metrics') ParameterValidator.assert_type_and_value(instruction['optimization_metric'], str, TrainMLModelParser.__name__, 'optimization_metric') ParameterValidator.assert_type_and_value(instruction['number_of_processes'], int, TrainMLModelParser.__name__, 'number_of_processes') @@ -39,6 +41,9 @@ def parse(self, key: str, instruction: dict, symbol_table: SymbolTable, path: Pa if instruction["reports"] is not None: ParameterValidator.assert_type_and_value(instruction['reports'], list, TrainMLModelParser.__name__, 'reports') + if instruction["example_weighting"] is not None: + ParameterValidator.assert_type_and_value(instruction['example_weighting'], str, TrainMLModelParser.__name__, 'example_weighting') + settings = self._parse_settings(instruction, symbol_table) dataset = symbol_table.get(instruction["dataset"]) label_config = LabelHelper.create_label_config(instruction["labels"], dataset, TrainMLModelParser.__name__, key) @@ -52,12 +57,15 @@ def parse(self, key: str, instruction: dict, symbol_table: SymbolTable, path: Pa path = self._prepare_path(instruction) context = self._prepare_context(instruction, symbol_table) reports = self._prepare_reports(instruction["reports"], symbol_table) + example_weighting = self._prepare_example_weighting(instruction, symbol_table) hp_instruction = TrainMLModelInstruction(dataset=dataset, hp_strategy=strategy(settings, metric_search_criterion), hp_settings=settings, assessment=assessment, selection=selection, metrics=metrics, optimization_metric=optimization_metric, refit_optimal_model=instruction['refit_optimal_model'], label_configuration=label_config, path=path, context=context, - number_of_processes=instruction["number_of_processes"], reports=reports, name=key) + number_of_processes=instruction["number_of_processes"], reports=reports, + example_weighting=example_weighting, export_all_ml_settings=instruction['export_all_ml_settings'], + name=key) return hp_instruction @@ -162,6 +170,19 @@ def _parse_split_config(self, instruction_key, instruction: dict, split_key: str raise ValueError(f"{TrainMLModelParser.__name__}: Stratified k-fold cross-validation cannot be used when " f"{len(label_config.get_labels_by_name())} labels are specified. It support only one label (and multiple classes).") + if split_strategy == SplitType.MANUAL: + ParameterValidator.assert_keys(keys=instruction[split_key]["manual_config"].keys(), + valid_keys=["train_metadata_path", "test_metadata_path"], + location=TrainMLModelParser.__name__, parameter_name="manual_config", exclusive=True) + + ParameterValidator.assert_valid_tabular_file(instruction[split_key]["manual_config"]["train_metadata_path"], + location=TrainMLModelParser.__name__, + parameter_name="train_metadata_path") + + ParameterValidator.assert_valid_tabular_file(instruction[split_key]["manual_config"]["test_metadata_path"], + location=TrainMLModelParser.__name__, + parameter_name="test_metadata_path") + return SplitConfig(split_strategy=split_strategy, split_count=int(instruction[split_key]["split_count"]), training_percentage=training_percentage, @@ -190,3 +211,20 @@ def _prepare_report_config(self, instruction_key, instruction, split_key, symbol report_config_input = {} return report_config_input + + + def _prepare_example_weighting(self, instruction: dict, symbol_table: SymbolTable) -> ExampleWeightingStrategy: + example_weighting = instruction["example_weighting"] + + if example_weighting is not None: + weighting_strategy_cls = symbol_table.get(example_weighting) + + weighting_strategy_object = weighting_strategy_cls.build_object(symbol_table.get(instruction["dataset"]), + **symbol_table.get_config(example_weighting)[ + "example_weighting_params"]).set_context({"dataset": symbol_table.get(instruction['dataset'])}) + + + ParameterValidator.assert_type_and_value(weighting_strategy_object, ExampleWeightingStrategy, TrainMLModelParser.__name__, "example_weighting") + return weighting_strategy_object + else: + return None \ No newline at end of file diff --git a/immuneML/dsl/symbol_table/SymbolType.py b/immuneML/dsl/symbol_table/SymbolType.py index 623262b4e..bf7fbd95e 100644 --- a/immuneML/dsl/symbol_table/SymbolType.py +++ b/immuneML/dsl/symbol_table/SymbolType.py @@ -13,3 +13,4 @@ class SymbolType(Enum): PREPROCESSING = 7 INSTRUCTION = 8 OUTPUT = 9 + WEIGHTING = 10 diff --git a/immuneML/encodings/EncoderParams.py b/immuneML/encodings/EncoderParams.py index d2122ae28..ad257b842 100644 --- a/immuneML/encodings/EncoderParams.py +++ b/immuneML/encodings/EncoderParams.py @@ -8,7 +8,6 @@ class EncoderParams: result_path: Path label_config: LabelConfiguration - filename: str = "" pool_size: int = 4 model: dict = None learn_model: bool = True diff --git a/immuneML/encodings/abundance_encoding/AbundanceEncoderHelper.py b/immuneML/encodings/abundance_encoding/AbundanceEncoderHelper.py index 7f0cf91d2..cf8a6706a 100644 --- a/immuneML/encodings/abundance_encoding/AbundanceEncoderHelper.py +++ b/immuneML/encodings/abundance_encoding/AbundanceEncoderHelper.py @@ -12,22 +12,6 @@ class AbundanceEncoderHelper: INVALID_P_VALUE = 2.0 - @staticmethod - def check_labels(label_config: LabelConfiguration, location: str): - labels = label_config.get_label_objects() - assert len(labels) == 1, f"{location}: this encoding works only for single label." - - label = labels[0] - - assert isinstance(label, Label) and label.positive_class is not None and label.positive_class != "", \ - f"{location}: positive_class parameter was not set for label {label}. It has to be set to determine the " \ - f"receptor sequences associated with the positive class. " \ - f"To use this encoder, in the label definition in the specification of the instruction, define " \ - f"the positive class for the label. See documentation for this encoder for more details." - - assert len(label.values) == 2, f"{location}: only binary classification (2 classes) is possible when extracting " \ - f"relevant sequences for the label, but got these classes for label {label.name} instead: {label.values}." - @staticmethod def check_is_positive_class(dataset, matrix_repertoire_ids, label_config: LabelConfiguration): label = label_config.get_label_objects()[0] diff --git a/immuneML/encodings/abundance_encoding/CompAIRRSequenceAbundanceEncoder.py b/immuneML/encodings/abundance_encoding/CompAIRRSequenceAbundanceEncoder.py index 50c14f3b9..baa1a4337 100644 --- a/immuneML/encodings/abundance_encoding/CompAIRRSequenceAbundanceEncoder.py +++ b/immuneML/encodings/abundance_encoding/CompAIRRSequenceAbundanceEncoder.py @@ -105,11 +105,17 @@ def __init__(self, p_value_threshold: float, compairr_path: str, sequence_batch_ self.context = None self.compairr_sequence_presence = None - self.compairr_params = CompAIRRParams(compairr_path=Path(compairr_path), keep_compairr_input=True, - differences=0, indels=False, - ignore_counts=True, ignore_genes=ignore_genes, - threads=threads, output_filename=None, - log_filename=None, output_pairs=False, pairs_filename=None) + self.compairr_params = CompAIRRParams(compairr_path=Path(compairr_path), + keep_compairr_input=True, + differences=0, + indels=False, + ignore_counts=True, + ignore_genes=ignore_genes, + threads=threads, + output_filename=None, + log_filename=None, + output_pairs=False, + pairs_filename=None) @staticmethod def _prepare_parameters(p_value_threshold: float, compairr_path: str, sequence_batch_size: int, ignore_genes: bool, keep_temporary_files: bool, @@ -141,7 +147,7 @@ def build_object(dataset, **params): return CompAIRRSequenceAbundanceEncoder(**prepared_params) def encode(self, dataset, params: EncoderParams): - AbundanceEncoderHelper.check_labels(params.label_config, CompAIRRSequenceAbundanceEncoder.__name__) + EncoderHelper.check_positive_class_labels(params.label_config, CompAIRRSequenceAbundanceEncoder.__name__) self.compairr_params.is_cdr3 = dataset.repertoires[0].get_region_type() == RegionType.IMGT_CDR3 self.compairr_sequence_presence = self._prepare_sequence_presence_data(dataset, params) @@ -278,6 +284,7 @@ def _encode_data(self, dataset: RepertoireDataset, params: EncoderParams): encoded_data = EncodedData(examples, dataset.get_metadata([label.name]) if params.encode_labels else None, dataset.get_repertoire_ids(), [CompAIRRSequenceAbundanceEncoder.RELEVANT_SEQUENCE_ABUNDANCE, CompAIRRSequenceAbundanceEncoder.TOTAL_SEQUENCE_ABUNDANCE], + example_weights=dataset.get_example_weights(), encoding=CompAIRRSequenceAbundanceEncoder.__name__, info={"relevant_sequence_path": self.relevant_sequence_path, "contingency_table_path": self.contingency_table_path, diff --git a/immuneML/encodings/abundance_encoding/KmerAbundanceEncoder.py b/immuneML/encodings/abundance_encoding/KmerAbundanceEncoder.py index fceb05e8c..5bd331773 100644 --- a/immuneML/encodings/abundance_encoding/KmerAbundanceEncoder.py +++ b/immuneML/encodings/abundance_encoding/KmerAbundanceEncoder.py @@ -123,7 +123,7 @@ def build_object(dataset, **params): return KmerAbundanceEncoder(**prepared_params) def encode(self, dataset, params: EncoderParams): - AbundanceEncoderHelper.check_labels(params.label_config, KmerAbundanceEncoder.__name__) + EncoderHelper.check_positive_class_labels(params.label_config, KmerAbundanceEncoder.__name__) self._prepare_kmer_presence_data(dataset, params) return self._encode_data(dataset, params) @@ -171,6 +171,7 @@ def _encode_data(self, dataset: RepertoireDataset, params: EncoderParams): encoded_data = EncodedData(examples, dataset.get_metadata([label.name]) if params.encode_labels else None, dataset.get_repertoire_ids(), [KmerAbundanceEncoder.RELEVANT_SEQUENCE_ABUNDANCE, KmerAbundanceEncoder.TOTAL_SEQUENCE_ABUNDANCE], + example_weights=dataset.get_example_weights(), encoding=KmerAbundanceEncoder.__name__, info={"relevant_sequence_path": self.relevant_sequence_path, "contingency_table_path": self.contingency_table_path, diff --git a/immuneML/encodings/abundance_encoding/SequenceAbundanceEncoder.py b/immuneML/encodings/abundance_encoding/SequenceAbundanceEncoder.py index d17adaf50..1c4c26eb1 100644 --- a/immuneML/encodings/abundance_encoding/SequenceAbundanceEncoder.py +++ b/immuneML/encodings/abundance_encoding/SequenceAbundanceEncoder.py @@ -94,7 +94,7 @@ def build_object(dataset, **params): return SequenceAbundanceEncoder(**params) def encode(self, dataset, params: EncoderParams): - AbundanceEncoderHelper.check_labels(params.label_config, SequenceAbundanceEncoder.__name__) + EncoderHelper.check_positive_class_labels(params.label_config, SequenceAbundanceEncoder.__name__) self.comparison_data = self._build_comparison_data(dataset, params) return self._encode_data(dataset, params) @@ -118,6 +118,7 @@ def _encode_data(self, dataset: RepertoireDataset, params: EncoderParams): encoded_data = EncodedData(examples, dataset.get_metadata([label_name]) if params.encode_labels else None, dataset.get_repertoire_ids(), [SequenceAbundanceEncoder.RELEVANT_SEQUENCE_ABUNDANCE, SequenceAbundanceEncoder.TOTAL_SEQUENCE_ABUNDANCE], + example_weights=dataset.get_example_weights(), encoding=SequenceAbundanceEncoder.__name__, info={'relevant_sequence_path': self.relevant_sequence_path, "contingency_table_path": self.contingency_table_path, "p_values_path": self.p_values_path}) diff --git a/immuneML/encodings/atchley_kmer_encoding/AtchleyKmerEncoder.py b/immuneML/encodings/atchley_kmer_encoding/AtchleyKmerEncoder.py index e14c000fe..cc2fe1c36 100644 --- a/immuneML/encodings/atchley_kmer_encoding/AtchleyKmerEncoder.py +++ b/immuneML/encodings/atchley_kmer_encoding/AtchleyKmerEncoder.py @@ -103,6 +103,7 @@ def encode(self, dataset, params: EncoderParams): feature_names = [f"atchley_factor_{j}_aa_{i}" for i in range(1, self.k + 1) for j in range(1, Util.ATCHLEY_FACTOR_COUNT + 1)] + ["abundance"] encoded_data = EncodedData(examples=examples, example_ids=dataset.get_example_ids(), feature_names=feature_names, labels=labels, + example_weights=dataset.get_example_weights(), encoding=AtchleyKmerEncoder.__name__, info={"kmer_keys": self.kmer_keys}) encoded_dataset = dataset.clone() diff --git a/immuneML/encodings/deeprc/DeepRCEncoder.py b/immuneML/encodings/deeprc/DeepRCEncoder.py index edfe026d9..0996fd191 100644 --- a/immuneML/encodings/deeprc/DeepRCEncoder.py +++ b/immuneML/encodings/deeprc/DeepRCEncoder.py @@ -93,7 +93,8 @@ def encode(self, dataset, params: EncoderParams) -> RepertoireDataset: encoded_dataset = dataset.clone() encoded_dataset.encoded_data = EncodedData(examples=None, labels=dataset.get_metadata(labels) if params.encode_labels else None, - example_ids=dataset.repertoire_ids, + example_ids=dataset.get_repertoire_ids(), + example_weights=dataset.get_example_weights(), encoding=DeepRCEncoder.__name__, info={"metadata_filepath": metadata_filepath, "max_sequence_length": self.max_sequence_length}) diff --git a/immuneML/encodings/distance_encoding/CompAIRRDistanceEncoder.py b/immuneML/encodings/distance_encoding/CompAIRRDistanceEncoder.py index cb0c69949..40fd5ad6b 100644 --- a/immuneML/encodings/distance_encoding/CompAIRRDistanceEncoder.py +++ b/immuneML/encodings/distance_encoding/CompAIRRDistanceEncoder.py @@ -141,6 +141,7 @@ def encode(self, dataset: RepertoireDataset, params: EncoderParams) -> Repertoir labels=labels, feature_names=distance_matrix.columns.values, example_ids=distance_matrix.index.values, + example_weights=EncoderHelper.get_example_weights_by_identifiers(dataset, distance_matrix.index.values), encoding=CompAIRRDistanceEncoder.__name__) return encoded_dataset diff --git a/immuneML/encodings/distance_encoding/DistanceEncoder.py b/immuneML/encodings/distance_encoding/DistanceEncoder.py index ca6f9374c..d4c9bbf22 100644 --- a/immuneML/encodings/distance_encoding/DistanceEncoder.py +++ b/immuneML/encodings/distance_encoding/DistanceEncoder.py @@ -1,3 +1,4 @@ +import warnings from pathlib import Path import pandas as pd @@ -118,13 +119,13 @@ def build_labels(self, dataset: RepertoireDataset, params: EncoderParams) -> dic return tmp_labels def encode(self, dataset, params: EncoderParams) -> RepertoireDataset: - train_repertoire_ids = EncoderHelper.prepare_training_ids(dataset, params) distance_matrix = self.build_distance_matrix(dataset, params, train_repertoire_ids) labels = self.build_labels(dataset, params) if params.encode_labels else None encoded_dataset = dataset.clone() encoded_dataset.encoded_data = EncodedData(examples=distance_matrix, labels=labels, example_ids=distance_matrix.index.values, + example_weights=EncoderHelper.get_example_weights_by_identifiers(dataset, distance_matrix.index.values), encoding=DistanceEncoder.__name__) return encoded_dataset diff --git a/immuneML/encodings/distance_encoding/TCRdistEncoder.py b/immuneML/encodings/distance_encoding/TCRdistEncoder.py index b51d69556..f55033dd1 100644 --- a/immuneML/encodings/distance_encoding/TCRdistEncoder.py +++ b/immuneML/encodings/distance_encoding/TCRdistEncoder.py @@ -1,3 +1,4 @@ +import warnings from pathlib import Path import pandas as pd @@ -60,6 +61,7 @@ def encode(self, dataset, params: EncoderParams): encoded_dataset = dataset.clone() encoded_dataset.encoded_data = EncodedData(examples=distance_matrix, labels=labels, example_ids=distance_matrix.index.values, + example_weights=EncoderHelper.get_example_weights_by_identifiers(dataset, distance_matrix.index.values), encoding=TCRdistEncoder.__name__) return encoded_dataset diff --git a/immuneML/encodings/evenness_profile/EvennessProfileEncoder.py b/immuneML/encodings/evenness_profile/EvennessProfileEncoder.py index b57b62afc..b1de32143 100644 --- a/immuneML/encodings/evenness_profile/EvennessProfileEncoder.py +++ b/immuneML/encodings/evenness_profile/EvennessProfileEncoder.py @@ -120,6 +120,7 @@ def _encode_data(self, dataset, params: EncoderParams) -> EncodedData: feature_names=feature_names, example_ids=example_ids, feature_annotations=feature_annotations, + example_weights=EncoderHelper.get_example_weights_by_identifiers(dataset, example_ids), encoding=EvennessProfileEncoder.__name__) return encoded_data diff --git a/immuneML/encodings/kmer_frequency/KmerFrequencyEncoder.py b/immuneML/encodings/kmer_frequency/KmerFrequencyEncoder.py index a933be9ce..0df6cbd13 100644 --- a/immuneML/encodings/kmer_frequency/KmerFrequencyEncoder.py +++ b/immuneML/encodings/kmer_frequency/KmerFrequencyEncoder.py @@ -229,6 +229,7 @@ def _encode_data(self, dataset, params: EncoderParams) -> EncodedData: feature_names=feature_names, example_ids=example_ids, feature_annotations=feature_annotations, + example_weights=EncoderHelper.get_example_weights_by_identifiers(dataset, example_ids), encoding=KmerFrequencyEncoder.__name__) return encoded_data diff --git a/immuneML/encodings/motif_encoding/MotifEncoder.py b/immuneML/encodings/motif_encoding/MotifEncoder.py new file mode 100644 index 000000000..898a87cb4 --- /dev/null +++ b/immuneML/encodings/motif_encoding/MotifEncoder.py @@ -0,0 +1,426 @@ +import logging +import warnings +from functools import partial +from multiprocessing.pool import Pool +from pathlib import Path + +import pandas as pd +import numpy as np +from sklearn.metrics import precision_score, recall_score, confusion_matrix + +from immuneML.caching.CacheHandler import CacheHandler +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.data_model.encoded_data.EncodedData import EncodedData +from immuneML.encodings.DatasetEncoder import DatasetEncoder +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.PositionalMotifParams import PositionalMotifParams +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.util.EncoderHelper import EncoderHelper +from immuneML.util.ParameterValidator import ParameterValidator + + +from immuneML.encodings.motif_encoding.PositionalMotifHelper import PositionalMotifHelper +from immuneML.util.PathBuilder import PathBuilder + + +class MotifEncoder(DatasetEncoder): + """ + This encoder enumerates every possible positional motif, and keeps only the motifs associated with the positive class. + A 'motif' is defined as a combination of position-specific amino acids. These motifs may contain one or multiple gaps. + Motifs are filtered out based on a minimal precision and recall threshold for predicting the positive class. + + Note: the MotifEncoder can only be used for sequences of the same length. + + The ideal recall threshold(s) given a user-defined precision threshold can be calibrated using the + :py:obj:`~immuneML.reports.data_reports.MotifGeneralizationAnalysis` report. It is recommended to first run this report + in :py:obj:`~immuneML.workflows.instructions.exploratory_analysis.ExploratoryAnalysisInstruction` before using this encoder for ML. + + This encoder can be used in combination with the :py:obj:`~immuneML.ml_methods.BinaryFeatureClassifier` in order to + learn a minimal set of compatible motifs for predicting the positive class. + Alternatively, it may be combined with scikit-learn methods, such as for example :py:obj:`~immuneML.ml_methods.LogisticRegression`, + to learn a weight per motif. + + + Arguments: + + max_positions (int): The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4. + + min_positions (int): The minimum motif size (see also: max_positions). The default value for max_positions is 1. + + min_precision (float): The minimum precision threshold for keeping a motif. The default value for min_precision is 0.8. + + min_recall (float): The minimum recall threshold for keeping a motif. The default value for min_precision is 0. + It is also possible to specify a recall threshold for each motif size. In this case, a dictionary must be specified where + the motif sizes are keys and the recall values are values. Use the :py:obj:`~immuneML.reports.data_reports.MotifGeneralizationAnalysis` report + to calibrate the optimal recall threshold given a user-defined precision threshold to ensure generalisability to unseen data. + + min_true_positives (int): The minimum number of true positive sequences that a motif needs to occur in. The default value for min_true_positives is 10. + + candidate_motif_filepath (str): Optional filepath for pre-filterd candidate motifs. This may be used to save time. Only the given candidate motifs are considered. + When this encoder has been run previously, a candidate motifs file named 'all_candidate_motifs.tsv' will have been exported. This file contains all + possible motifs with high enough min_true_positives without applying precision and recall thresholds. + The file must be a tab-separated file, structured as follows: + + ======== ============== + indices amino_acids + ======== ============== + 1&2&3 A&G&C + 5&7 E&D + ======== ============== + + The example above contains two motifs: AGC in positions 123, and E-D in positions 5-7 (with a gap at position 6). + + label (str): The name of the binary label to train the encoder for. This is only necessary when the dataset contains multiple labels. + + + YAML specification: + + .. indent with spaces + .. code-block:: yaml + + my_motif_encoder: + MotifEncoder: + max_positions: 4 + min_precision: 0.8 + min_recall: # different recall thresholds for each motif size + 1: 0.5 # For shorter motifs, a stricter recall threshold is used + 2: 0.1 + 3: 0.01 + 4: 0.001 + min_true_positives: 10 + + + + + """ + + def __init__(self, max_positions: int = None, min_positions: int = None, + min_precision: float = None, min_recall: dict = None, + min_true_positives: int = None, + candidate_motif_filepath: str = None, label: str = None, name: str = None): + self.max_positions = max_positions + self.min_positions = min_positions + self.min_precision = min_precision + self.min_recall = min_recall + self.min_true_positives = min_true_positives + self.candidate_motif_filepath = Path(candidate_motif_filepath) if candidate_motif_filepath is not None else None + self.learned_motif_filepath = None + + self.label = label + self.name = name + self.context = None + + @staticmethod + def _prepare_parameters(max_positions: int = None, min_positions: int = None, min_precision: float = None, min_recall: dict = None, + min_true_positives: int = None, candidate_motif_filepath: str = None, label: str = None, name: str = None): + + location = MotifEncoder.__name__ + + ParameterValidator.assert_type_and_value(max_positions, int, location, "max_positions", min_inclusive=1) + ParameterValidator.assert_type_and_value(min_positions, int, location, "min_positions", min_inclusive=1) + assert max_positions >= min_positions, f"{location}: max_positions ({max_positions}) must be greater than or equal to min_positions ({min_positions})" + + ParameterValidator.assert_type_and_value(min_precision, (int, float), location, "min_precision", min_inclusive=0, max_inclusive=1) + ParameterValidator.assert_type_and_value(min_true_positives, int, location, "min_true_positives", min_inclusive=1) + + if isinstance(min_recall, dict): + assert set(min_recall.keys()) == set(range(min_positions, max_positions+1)), f"{location}: {min_recall} is not a valid value for parameter min_recall. " \ + f"When setting separate recall cutoffs for each motif size, the keys of the dictionary " \ + f"must equal to {list(range(min_positions, max_positions+1))}." + for recall_cutoff in min_recall.values(): + assert isinstance(recall_cutoff, (int, float)) or recall_cutoff is None, f"{location}: {min_recall} is not a valid value for parameter min_recall. " \ + f"When setting separate recall cutoffs for each motif size, the values of the dictionary " \ + f"must be numeric or None." + + min_recall = {key: value if isinstance(value, (int, float)) else 1 for key, value in min_recall.items()} + + else: + ParameterValidator.assert_type_and_value(min_recall, (int, float), location, "min_recall", min_inclusive=0, max_inclusive=1) + min_recall = {motif_size: min_recall for motif_size in range(min_positions, max_positions+1)} + + if candidate_motif_filepath is not None: + PositionalMotifHelper.check_motif_filepath(candidate_motif_filepath, location, "candidate_motif_filepath") + + if label is not None: + ParameterValidator.assert_type_and_value(label, str, location, "label") + + return { + "max_positions": max_positions, + "min_positions": min_positions, + "min_precision": min_precision, + "min_recall": min_recall, + "min_true_positives": min_true_positives, + "candidate_motif_filepath": candidate_motif_filepath, + "label": label, + "name": name, + } + + @staticmethod + def build_object(dataset=None, **params): + if isinstance(dataset, SequenceDataset): + prepared_params = MotifEncoder._prepare_parameters(**params) + return MotifEncoder(**prepared_params) + else: + raise ValueError(f"{MotifEncoder.__name__} is not defined for dataset types which are not SequenceDataset.") + + def encode(self, dataset, params: EncoderParams): + if params.learn_model: + EncoderHelper.check_positive_class_labels(params.label_config, MotifEncoder.__name__) + return self._encode_data(dataset, params) + else: + learned_motifs = PositionalMotifHelper.read_motifs_from_file(self.learned_motif_filepath) + return self.get_encoded_dataset_from_motifs(dataset, learned_motifs, params) + + def _encode_data(self, dataset, params: EncoderParams): + learned_motifs = self._compute_motifs(dataset, params) + + self.learned_motif_filepath = params.result_path / "significant_motifs.tsv" + self.motif_stats_filepath = params.result_path / "motif_stats.tsv" + + PositionalMotifHelper.write_motifs_to_file(learned_motifs, self.learned_motif_filepath) + self._write_motif_stats(learned_motifs, self.motif_stats_filepath) + + return self.get_encoded_dataset_from_motifs(dataset, learned_motifs, params) + + def _compute_motifs(self, dataset, params): + motifs = self._prepare_candidate_motifs(dataset, params) + + y_true = self._get_y_true(dataset, params.label_config) + + motifs = self._filter_motifs(motifs, dataset, y_true, params.pool_size, generalized=False) + + # Option disabled for now + # if self.generalize_motifs: + # motifs += self._filter_motifs(PositionalMotifHelper.get_generalized_motifs(motifs), dataset, y_true, params.pool_size, generalized=True) + + return motifs + + def _write_motif_stats(self, learned_motifs, motif_stats_filepath): + try: + data = {} + + data["motif_size"] = list(range(self.min_positions, self.max_positions + 1)) + data["min_precision"] = [self.min_precision] * self.max_positions + data["min_recall"] = [self.min_recall.get(motif_size, 1) for motif_size in range(self.min_positions, self.max_positions + 1)] + + all_motif_sizes = [len(motif[0]) for motif in learned_motifs] + data["n_motifs"] = [all_motif_sizes.count(motif_size) for motif_size in range(self.min_positions, self.max_positions + 1)] + + df = pd.DataFrame(data) + df.to_csv(motif_stats_filepath, index=False, sep="\t") + except Exception as e: + warnings.warn(f"{MotifEncoder.__name__}: could not write motif stats. Exception was: {e}") + + def get_encoded_dataset_from_motifs(self, dataset, motifs, params): + labels = EncoderHelper.encode_element_dataset_labels(dataset, params.label_config) + + examples, feature_names, feature_annotations = self._construct_encoded_data_matrix(dataset, motifs, + params.label_config, params.pool_size) + + self._export_confusion_matrix(params.result_path, feature_annotations) + + encoded_dataset = dataset.clone() + encoded_dataset.encoded_data = EncodedData(examples=examples, + labels=labels, + feature_names=feature_names, + feature_annotations=feature_annotations, + example_ids=dataset.get_example_ids(), + encoding=MotifEncoder.__name__, + example_weights=dataset.get_example_weights(), + info={"candidate_motif_filepath": self.candidate_motif_filepath, + "learned_motif_filepath": self.learned_motif_filepath, + "positive_class": self._get_positive_class(params.label_config)}) + + return encoded_dataset + + def _export_confusion_matrix(self, result_path, feature_annotations): + try: + PathBuilder.build(result_path) + feature_annotations.to_csv(result_path / "confusion_matrix.tsv", index=False, sep="\t") + except Exception as e: + logging.exception(f"MotifEncoder: An exception occurred while exporting the confusion matrix: {e}") + + def _prepare_candidate_motifs(self, dataset, params): + full_dataset = EncoderHelper.get_current_dataset(dataset, self.context) + candidate_motifs = self._get_candidate_motifs(full_dataset, params.pool_size) + assert len(candidate_motifs) > 0, f"{MotifEncoder.__name__}: no candidate motifs were found. " \ + f"Please try decreasing the value for parameter 'min_true_positives'." + + self.candidate_motif_filepath = params.result_path / "all_candidate_motifs.tsv" + PositionalMotifHelper.write_motifs_to_file(candidate_motifs, self.candidate_motif_filepath) + + return candidate_motifs + + def _get_candidate_motifs(self, full_dataset, pool_size=4): + '''Returns all candidate motifs, which are either read from the input file or computed by finding + all motifs occuring in at least a given number of sequences of the full dataset.''' + if self.candidate_motif_filepath is None: + return CacheHandler.memo_by_params(self._build_candidate_motifs_params(full_dataset), + lambda: self._compute_candidate_motifs(full_dataset, pool_size)) + else: + return PositionalMotifHelper.read_motifs_from_file(self.candidate_motif_filepath) + + def _build_candidate_motifs_params(self, dataset: SequenceDataset): + return (("dataset_identifier", dataset.identifier), + ("sequence_ids", tuple(dataset.get_example_ids()), + ("example_weights", type(dataset.get_example_weights())), + ("max_positions", self.max_positions), + ("min_positions", self.min_positions), + ("min_true_positives", self.min_true_positives))) + + def _compute_candidate_motifs(self, full_dataset, pool_size=4): + np_sequences = PositionalMotifHelper.get_numpy_sequence_representation(full_dataset) + params = PositionalMotifParams(max_positions=self.max_positions, min_positions=self.min_positions, + count_threshold=self.min_true_positives, pool_size=pool_size) + return PositionalMotifHelper.compute_all_candidate_motifs(np_sequences, params) + + def _get_y_true(self, dataset, label_config: LabelConfiguration): + labels = EncoderHelper.encode_element_dataset_labels(dataset, label_config) + + label_name = self._get_label_name(label_config) + label = label_config.get_label_object(label_name) + + return np.array([cls == label.positive_class for cls in labels[label_name]]) + + def _get_positive_class(self, label_config): + label_name = self._get_label_name(label_config) + label = label_config.get_label_object(label_name) + + return label.positive_class + + def _get_label_name(self, label_config: LabelConfiguration): + if self.label is not None: + assert self.label in label_config.get_labels_by_name(), f"{MotifEncoder.__name__}: specified label " \ + f"'{self.label}' was not present among the dataset labels: " \ + f"{', '.join(label_config.get_labels_by_name())}" + label_name = self.label + else: + label_name = EncoderHelper.get_single_label_name_from_config(label_config, MotifEncoder.__name__) + + return label_name + + def check_filtered_motifs(self, filtered_motifs): + assert len(filtered_motifs) > 0, f"{MotifEncoder.__name__}: no significant motifs were found. " \ + f"Please try decreasing the values for parameters 'min_precision' or 'min_recall'" + + def _get_recall_repr(self): + '''Returns a string representation of the recall cutoff.''' + if len(set(self.min_recall.values())) == 1: + return str(list(self.min_recall.values())[0]) + else: + return ", ".join([f"{recall} (motif size {motif_size})" for motif_size, recall in self.min_recall.items()]) + + def _filter_motifs(self, candidate_motifs, dataset, y_true, pool_size, generalized=False): + motif_type = "generalized motifs" if generalized else "motifs" + + logging.info(f"{MotifEncoder.__name__}: filtering {len(candidate_motifs)} {motif_type} with precision >= {self.min_precision} and recall >= {self._get_recall_repr()}") + + np_sequences = PositionalMotifHelper.get_numpy_sequence_representation(dataset) + weights = dataset.get_example_weights() + + with Pool(pool_size) as pool: + partial_func = partial(self._check_motif, np_sequences=np_sequences, y_true=y_true, weights=weights) + + filtered_motifs = list(filter(None, pool.map(partial_func, candidate_motifs))) + + if not generalized: + self.check_filtered_motifs(filtered_motifs) + + logging.info(f"{MotifEncoder.__name__}: filtering {motif_type} done, {len(filtered_motifs)} motifs left") + + return filtered_motifs + + def _check_motif(self, motif, np_sequences, y_true, weights): + indices, amino_acids = motif + + pred = PositionalMotifHelper.test_motif(np_sequences, indices, amino_acids) + + if sum(pred & y_true) >= self.min_true_positives: + if precision_score(y_true=y_true, y_pred=pred, sample_weight=weights) >= self.min_precision: + if len(indices) in self.min_recall.keys(): + if recall_score(y_true=y_true, y_pred=pred, sample_weight=weights) >= self.min_recall[len(indices)]: + return motif + + + + def _construct_encoded_data_matrix(self, dataset, motifs, label_config, number_of_processes): + feature_names = [PositionalMotifHelper.motif_to_string(indices, amino_acids, motif_sep="-", newline=False) + for indices, amino_acids in motifs] + + weights = dataset.get_example_weights() + y_true = self._get_y_true(dataset, label_config) + np_sequences = PositionalMotifHelper.get_numpy_sequence_representation(dataset) + + logging.info(f"{MotifEncoder.__name__}: building encoded data matrix...") + + with Pool(number_of_processes) as pool: + predictions = pool.starmap(partial(self._test_motif, np_sequences=np_sequences), motifs) + conf_matrix_raw = np.array(pool.map(partial(self._get_confusion_matrix, y_true=y_true, weights=None), predictions)) + + if weights is not None: + conf_matrix_weighted = np.array(pool.map(partial(self._get_confusion_matrix, y_true=y_true, weights=weights), predictions)) + else: + conf_matrix_weighted = None + + # precision_scores = pool.map(partial(self._get_precision, y_true=y_true, weights=weights), predictions) + # recall_scores = pool.map(partial(self._get_recall, y_true=y_true, weights=weights), predictions) + # tp_counts = pool.map(partial(self._get_tp, y_true=y_true), predictions) + + logging.info(f"{MotifEncoder.__name__}: building encoded data matrix done") + + prefix = "weighted_" if weights is not None else "" + + feature_annotations = self._get_feature_annotations(feature_names, conf_matrix_raw, conf_matrix_weighted) + + return np.column_stack(predictions), feature_names, feature_annotations + + def _get_feature_annotations(self, feature_names, conf_matrix_raw, conf_matrix_weighted): + feature_annotations_mapping = {"feature_names": feature_names, + "TN": conf_matrix_raw.T[0], + "FP": conf_matrix_raw.T[1], + "FN": conf_matrix_raw.T[2], + "TP": conf_matrix_raw.T[3]} + + if conf_matrix_weighted is not None: + feature_annotations_mapping["weighted_TN"] = conf_matrix_weighted.T[0] + feature_annotations_mapping["weighted_FP"] = conf_matrix_weighted.T[1] + feature_annotations_mapping["weighted_FN"] = conf_matrix_weighted.T[2] + feature_annotations_mapping["weighted_TP"] = conf_matrix_weighted.T[3] + + return pd.DataFrame(feature_annotations_mapping) + + def _get_predictions(self, np_sequences, motifs, number_of_processes): + with Pool(number_of_processes) as pool: + partial_func = partial(self._test_motif, np_sequences=np_sequences) + predictions = pool.starmap(partial_func, motifs) + + return predictions + + def _test_motif(self, indices, amino_acids, np_sequences): + return PositionalMotifHelper.test_motif(np_sequences=np_sequences, indices=indices, amino_acids=amino_acids) + + def _get_confusion_matrix(self, pred, y_true, weights): + return confusion_matrix(y_true=y_true, y_pred=pred, sample_weight=weights).ravel() + # + # def _get_precision(self, pred, y_true, weights): + # return precision_score(y_true=y_true, y_pred=pred, sample_weight=weights, zero_division=0) + # + # def _get_recall(self, pred, y_true, weights): + # return recall_score(y_true=y_true, y_pred=pred, sample_weight=weights, zero_division=0) + + # def _get_tp(self, pred, y_true): + # return sum(pred & y_true) + + def set_context(self, context: dict): + self.context = context + return self + + @staticmethod + def export_encoder(path: Path, encoder) -> Path: + encoder_file = DatasetEncoder.store_encoder(encoder, path / "encoder.pickle") + return encoder_file + + @staticmethod + def load_encoder(encoder_file: Path): + encoder = DatasetEncoder.load_encoder(encoder_file) + return encoder diff --git a/immuneML/encodings/motif_encoding/PositionalMotifHelper.py b/immuneML/encodings/motif_encoding/PositionalMotifHelper.py new file mode 100644 index 000000000..616e42f01 --- /dev/null +++ b/immuneML/encodings/motif_encoding/PositionalMotifHelper.py @@ -0,0 +1,333 @@ +import logging +import numpy as np +from multiprocessing import Pool +import itertools as it +from functools import partial +from pathlib import Path + +from immuneML.caching.CacheHandler import CacheHandler +from immuneML.encodings.motif_encoding.PositionalMotifParams import PositionalMotifParams +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.SequenceType import SequenceType +from immuneML.util.ParameterValidator import ParameterValidator +from immuneML.util.PathBuilder import PathBuilder + + +class PositionalMotifHelper: + + @staticmethod + def get_numpy_sequence_representation(dataset): + return CacheHandler.memo_by_params((("dataset_identifier", dataset.identifier), + "np_sequence_representation", + ("example_ids", tuple(dataset.get_example_ids()))), + lambda: PositionalMotifHelper.compute_numpy_sequence_representation(dataset)) + + @staticmethod + def compute_numpy_sequence_representation(dataset, location=None): + '''Computes an efficient unicode representation for SequenceDatasets where all sequences have the same length''' + + location = PositionalMotifHelper.__name__ if location is None else location + + n_sequences = dataset.get_example_count() + all_sequences = [None] * n_sequences + sequence_length = None + + for i, sequence in enumerate(dataset.get_data()): + sequence_str = sequence.get_sequence() + all_sequences[i] = sequence_str + + if sequence_length is None: + sequence_length = len(sequence_str) + else: + assert len(sequence_str) == sequence_length, f"{location}: expected all " \ + f"sequences to be of length {sequence_length}, found " \ + f"{len(sequence_str)}: '{sequence_str}'." + + unicode = np.array(all_sequences, dtype=f"U{sequence_length}") + return unicode.view('U1').reshape(n_sequences, -1) + + @staticmethod + def test_aa(sequences, index, aa): + if aa.isupper(): + return sequences[:, index] == aa + else: + return sequences[:, index] != aa.upper() + + @staticmethod + def test_position(np_sequences, index, aas): + return np.logical_or.reduce([PositionalMotifHelper.test_aa(np_sequences, index, aa) for aa in aas]) + + @staticmethod + def test_motif(np_sequences, indices, amino_acids): + ''' + Tests for all sequences whether it contains the given motif (defined by indices and amino acids) + ''' + return np.logical_and.reduce([PositionalMotifHelper.test_position(np_sequences, index, amino_acid) for + index, amino_acid in zip(indices, amino_acids)]) + + @staticmethod + def _test_new_position(existing_positions, new_position, negative_aa=False): + if new_position in existing_positions: + return False + + # regular amino acids are only allowed to be added to the right of a motif (to prevent recomputing the same motif) + # whereas negative amino acids may be added anywhere + if not negative_aa: + if max(existing_positions) > new_position: + return False + + return True + + @staticmethod + def add_position_to_base_motif(base_motif, new_position, new_aa): + # new_index = base_motif[0] + [new_position] + # new_aas = base_motif[1] + [new_aa] + + new_index = sorted(base_motif[0] + [new_position]) + new_aas = base_motif[1].copy() + new_aas.insert(new_index.index(new_position), new_aa) + + return new_index, new_aas + + @staticmethod + def extend_motif(base_motif, np_sequences, legal_positional_aas, count_threshold=10, negative_aa=False): + new_candidates = [] + + sequence_length = len(np_sequences[0]) + + for new_position in range(sequence_length): + if PositionalMotifHelper._test_new_position(base_motif[0], new_position, negative_aa=negative_aa): + for new_aa in legal_positional_aas[new_position]: + new_aa = new_aa.lower() if negative_aa else new_aa + + new_index, new_aas = PositionalMotifHelper.add_position_to_base_motif(base_motif, new_position, new_aa) + pred = PositionalMotifHelper.test_motif(np_sequences, new_index, new_aas) + + if sum(pred) >= count_threshold: + new_candidates.append([new_index, new_aas]) + + return new_candidates + + # @staticmethod + # def identify_n_possible_motifs(np_sequences, count_threshold, motif_sizes): + # n_possible_motifs = {} + # + # legal_pos_aas = PositionalMotifHelper.identify_legal_positional_aas(np_sequences, count_threshold=count_threshold) + # n_aas_per_pos = {position: len(aas) for position, aas in legal_pos_aas.items()} + # + # for motif_size in motif_sizes: + # n_possible_motifs[motif_size] = PositionalMotifHelper._identify_n_motifs_of_size(n_aas_per_pos, motif_size) + # + # return n_possible_motifs + # + # @staticmethod + # def _identify_n_motifs_of_size(n_aas_per_pos, motif_size): + # n_motifs_for_motif_size = 0 + # + # for index_set in it.combinations(n_aas_per_pos.keys(), motif_size): + # n_motifs_for_index = 1 + # + # for index in index_set: + # n_motifs_for_index *= n_aas_per_pos[index] + # + # n_motifs_for_motif_size += n_motifs_for_index + # + # return n_motifs_for_motif_size + + @staticmethod + def identify_legal_positional_aas(np_sequences, count_threshold=10): + sequence_length = len(np_sequences[0]) + + legal_positional_aas = {position: [] for position in range(sequence_length)} + + for index in range(sequence_length): + for amino_acid in EnvironmentSettings.get_sequence_alphabet(SequenceType.AMINO_ACID): + pred = PositionalMotifHelper.test_position(np_sequences, index, amino_acid) + if sum(pred) >= count_threshold: + legal_positional_aas[index].append(amino_acid) + + return legal_positional_aas + + @staticmethod + def _get_single_aa_candidate_motifs(legal_positional_aas): + return {1: [[[index], [amino_acid]] for index in legal_positional_aas.keys() for amino_acid in + legal_positional_aas[index]]} + + @staticmethod + def _add_multi_aa_candidate_motifs(np_sequences, candidate_motifs, legal_positional_aas, params): + for n_positions in range(2, params.max_positions + 1): + logging.info(f"{PositionalMotifHelper.__name__}: extrapolating motifs with {n_positions} positions and occurrence > {params.count_threshold}") + + with Pool(params.pool_size) as pool: + partial_func = partial(PositionalMotifHelper.extend_motif, np_sequences=np_sequences, + legal_positional_aas=legal_positional_aas, count_threshold=params.count_threshold, + negative_aa=False) + new_candidates = pool.map(partial_func, candidate_motifs[n_positions - 1]) + + candidate_motifs[n_positions] = list( + it.chain.from_iterable(new_candidates)) + + logging.info(f"{PositionalMotifHelper.__name__}: found {len(candidate_motifs[n_positions])} candidate motifs with {n_positions} positions") + + return candidate_motifs + + def _add_negative_aa_candidate_motifs(np_sequences, candidate_motifs, legal_positional_aas, params): + ''' + Negative aa option is temporarily not in use for MotifEncoder, some fixes still need to be made: + - for a negative aa to be legal, both positive and negative version of that aa must occur at least count_threshold + times in that position. This to prevent 'clutter' motifs: if F never occurs in position 8, it is not worth having + a motif with not-F in position 8. + - All motif-related reports need to be checked to see if they can work with negative aas. + ''' + + for n_positions in range(max(params.min_positions, 2), params.max_positions + 1): + logging.info(f"{PositionalMotifHelper.__name__}: computing motifs with {n_positions+1} positions of which 1 negative amino acid") + + with Pool(params.pool_size) as pool: + partial_func = partial(PositionalMotifHelper.extend_motif, np_sequences=np_sequences, + legal_positional_aas=legal_positional_aas, count_threshold=params.count_threshold, + negative_aa=True) + new_candidates = pool.map(partial_func, candidate_motifs[n_positions - 1]) + new_candidates = list(it.chain.from_iterable(new_candidates)) + + candidate_motifs[n_positions].extend(new_candidates) + + logging.info(f"{PositionalMotifHelper.__name__}: found {len(new_candidates)} candidate motifs with {n_positions} positions of which 1 negative amino acid") + + return candidate_motifs + + @staticmethod + def compute_all_candidate_motifs(np_sequences, params: PositionalMotifParams): + + logging.info(f"{PositionalMotifHelper.__name__}: computing candidate motifs with occurrence > {params.count_threshold} in dataset") + + legal_positional_aas = PositionalMotifHelper.identify_legal_positional_aas(np_sequences, params.count_threshold) + candidate_motifs = PositionalMotifHelper._get_single_aa_candidate_motifs(legal_positional_aas) + candidate_motifs = PositionalMotifHelper._add_multi_aa_candidate_motifs(np_sequences, candidate_motifs, legal_positional_aas, params) + + # todo caching at single aa and multi-aa + if params.allow_negative_aas: + candidate_motifs = PositionalMotifHelper._add_negative_aa_candidate_motifs(np_sequences, candidate_motifs, legal_positional_aas, params) + + candidate_motifs = {motif_size: motifs for motif_size, motifs in candidate_motifs.items() if motif_size >= params.min_positions} + + candidate_motifs = list(it.chain(*candidate_motifs.values())) + + logging.info(f"{PositionalMotifHelper.__name__}: candidate motif computing done. Found {len(candidate_motifs)} with a length between {params.min_positions} and {params.max_positions}") + + return candidate_motifs + + @staticmethod + def motif_to_string(indices, amino_acids, value_sep="&", motif_sep="\t", newline=True): + suffix = "\n" if newline else "" + return f"{value_sep.join([str(idx) for idx in indices])}{motif_sep}{value_sep.join(amino_acids)}{suffix}" + + @staticmethod + def string_to_motif(string, value_sep, motif_sep): + indices_str, amino_acids_str = string.strip().split(motif_sep) + indices = [int(i) for i in indices_str.split(value_sep)] + amino_acids = amino_acids_str.split(value_sep) + return indices, amino_acids + + @staticmethod + def get_motif_size(string_repr, value_sep="&", motif_sep="-"): + return len(PositionalMotifHelper.string_to_motif(string_repr, value_sep=value_sep, motif_sep=motif_sep)[0]) + + @staticmethod + def check_file_header(header, motif_filepath, expected_header="indices\tamino_acids\n"): + assert header == expected_header, f"{PositionalMotifHelper.__name__}: motif file at {motif_filepath} " \ + f"is expected to contain this header: '{expected_header}', " \ + f"found the following instead: '{header}'" + @staticmethod + def check_motif_filepath(motif_filepath, location, parameter_name, expected_header="indices\tamino_acids\n"): + ParameterValidator.assert_type_and_value(motif_filepath, str, location, parameter_name) + + motif_filepath = Path(motif_filepath) + + assert motif_filepath.is_file(), f"{location}: the file {motif_filepath} does not exist. " \ + f"Specify the correct path under motif_filepath." + + with open(motif_filepath) as file: + PositionalMotifHelper.check_file_header(file.readline(), motif_filepath, expected_header) + + @staticmethod + def read_motifs_from_file(filepath): + with open(filepath) as file: + PositionalMotifHelper.check_file_header(file.readline(), filepath, expected_header="indices\tamino_acids\n") + motifs = [PositionalMotifHelper.string_to_motif(line, value_sep="&", motif_sep="\t") for line in file.readlines()] + + return motifs + + @staticmethod + def write_motifs_to_file(motifs, filepath): + PathBuilder.build(filepath.parent) + + with open(filepath, "a") as file: + file.write("indices\tamino_acids\n") + + for indices, amino_acids in motifs: + file.write(PositionalMotifHelper.motif_to_string(indices, amino_acids)) + + @staticmethod + def get_generalized_motifs(motifs): + ''' + Generalized motifs option is temporarily not in use by MotifEncoder, as there does not seem to be a clear purpose as of now. + ''' + sorted_motifs = PositionalMotifHelper.sort_motifs_by_index(motifs) + generalized_motifs = [] + + for indices, all_motif_amino_acids in sorted_motifs.items(): + if len(all_motif_amino_acids) > 1 and len(indices) > 1: + generalized_motifs.extend(list(PositionalMotifHelper.get_generalized_motifs_for_index(indices, all_motif_amino_acids))) + + return generalized_motifs + + @staticmethod + def sort_motifs_by_index(motifs): + sorted_motifs = {} + + for index, amino_acids in motifs: + if tuple(index) not in sorted_motifs: + sorted_motifs[tuple(index)] = [amino_acids] + else: + sorted_motifs[tuple(index)].append(amino_acids) + + return sorted_motifs + + @staticmethod + def get_generalized_motifs_for_index(indices, all_motif_amino_acids): + # loop over motifs, allowing flexibility only in the amino acid at flex_aa_index + for flex_aa_index in range(len(indices)): + shared_aa_indices = [i for i in range(len(indices)) if i != flex_aa_index] + + # for each motif, get the flexible aa (1) and constant aas (>= 1) + flex_aa = [motif[flex_aa_index] for motif in all_motif_amino_acids] + constant_aas = ["".join([motif[index] for index in shared_aa_indices]) for motif in all_motif_amino_acids] + + # get only those motifs where there exist another motif sharing the constant_aas + is_generalizable = [i for i in range(len(all_motif_amino_acids)) if constant_aas.count(constant_aas[i]) > 1] + flex_aa = [flex_aa[i] for i in is_generalizable] + constant_aas = [constant_aas[i] for i in is_generalizable] + + # from a constant part and multiple flexible amino acids, construct a generalized motif + for constant_motif_part in set(constant_aas): + flex_motif_aas = [flex_aa[i] for i in range(len(constant_aas)) if constant_aas[i] == constant_motif_part] + + for flex_aa_subset in PositionalMotifHelper.get_flex_aa_sets(flex_motif_aas): + generalized_motif = list(constant_motif_part) + generalized_motif.insert(flex_aa_index, flex_aa_subset) + + yield [list(indices), generalized_motif] + + @staticmethod + def get_flex_aa_sets(amino_acids): + sets = [] + amino_acids = sorted(amino_acids) + amino_acids = [aa for aa in amino_acids if aa.isupper()] + + for subset_size in range(2, len(amino_acids)+1): + for combo in it.combinations(amino_acids, subset_size): + sets.append("".join(combo)) + + return sets + diff --git a/immuneML/encodings/motif_encoding/PositionalMotifParams.py b/immuneML/encodings/motif_encoding/PositionalMotifParams.py new file mode 100644 index 000000000..6814eb1b7 --- /dev/null +++ b/immuneML/encodings/motif_encoding/PositionalMotifParams.py @@ -0,0 +1,9 @@ +from dataclasses import dataclass + +@dataclass +class PositionalMotifParams: + max_positions: int + min_positions: int + count_threshold: int + pool_size: int = 4 + allow_negative_aas: bool = False diff --git a/immuneML/encodings/motif_encoding/SimilarToPositiveSequenceEncoder.py b/immuneML/encodings/motif_encoding/SimilarToPositiveSequenceEncoder.py new file mode 100644 index 000000000..2712ad8c7 --- /dev/null +++ b/immuneML/encodings/motif_encoding/SimilarToPositiveSequenceEncoder.py @@ -0,0 +1,254 @@ +import logging +from pathlib import Path + +import numpy as np + + +from immuneML.analysis.SequenceMatcher import SequenceMatcher +from immuneML.data_model.dataset.Dataset import Dataset +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.data_model.encoded_data.EncodedData import EncodedData +from immuneML.encodings.DatasetEncoder import DatasetEncoder +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.util.CompAIRRHelper import CompAIRRHelper +from immuneML.util.CompAIRRParams import CompAIRRParams +from immuneML.util.EncoderHelper import EncoderHelper +from immuneML.util.ParameterValidator import ParameterValidator + +from immuneML.util.PathBuilder import PathBuilder + + +class SimilarToPositiveSequenceEncoder(DatasetEncoder): + """ + A simple baseline encoding, to be used in combination with :py:obj:`~immuneML.ml_methods.BinaryFeatureClassifier.BinaryFeatureClassifier` using keep_all = True. + This encoder keeps track of all positive sequences in the training set, and ignores the negative sequences. + Any sequence within a given hamming distance from a positive training sequence will be classified positive, + all other sequences will be classified negative. + + Arguments: + + hamming_distance (int): Maximum number of differences allowed between any positive sequence of the training set and a + new observed sequence in order for the observed sequence to be classified as 'positive'. + + compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR + has been installed such that it can be called directly on the command line with the command 'compairr', + or that it is located at /usr/local/bin/compairr. + + ignore_genes (bool): Only used when compairr is used. Whether to ignore V and J gene information. If False, the V and J genes between two sequences + have to match for the sequence to be considered 'similar'. If True, gene information is ignored. By default, ignore_genes is False. + + threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. + The default number of threads is 8. + + keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence + presence matrix. This may take a lot of storage space if the input dataset is large. By default temporary files are not kept. + + + + YAML specification: + + .. indent with spaces + .. code-block:: yaml + + my_sequence_encoder: + SimilarToPositiveSequenceEncoder: + hamming_distance: 2 + """ + + def __init__(self, hamming_distance: int = None, compairr_path: str = None, + ignore_genes: bool = None, threads: int = None, keep_temporary_files: bool = None, + name: str = None): + self.hamming_distance = hamming_distance + self.compairr_path = Path(compairr_path) if compairr_path is not None else None + self.ignore_genes = ignore_genes + self.threads = threads + self.keep_temporary_files = keep_temporary_files + + self.positive_sequences = None + self.name = name + self.context = None + + @staticmethod + def _prepare_parameters(hamming_distance: int = None, compairr_path: str = None, ignore_genes: bool = None, + threads: int = None, keep_temporary_files: bool = None, name: str = None): + location = SimilarToPositiveSequenceEncoder.__name__ + + ParameterValidator.assert_type_and_value(hamming_distance, int, location, "hamming_distance", min_inclusive=0) + + if compairr_path is not None: + ParameterValidator.assert_type_and_value(compairr_path, str, location, "compairr_path") + CompAIRRHelper.check_compairr_path(compairr_path) + + ParameterValidator.assert_type_and_value(ignore_genes, bool, location, "ignore_genes") + ParameterValidator.assert_type_and_value(threads, int, location, "threads") + ParameterValidator.assert_type_and_value(keep_temporary_files, int, location, "keep_temporary_files") + + + return { + "hamming_distance": hamming_distance, + "ignore_genes": ignore_genes, + "threads": threads, + "keep_temporary_files": keep_temporary_files, + "compairr_path": compairr_path, + "name": name, + } + + @staticmethod + def build_object(dataset=None, **params): + if isinstance(dataset, SequenceDataset): + prepared_params = SimilarToPositiveSequenceEncoder._prepare_parameters(**params) + return SimilarToPositiveSequenceEncoder(**prepared_params) + else: + raise ValueError(f"{SimilarToPositiveSequenceEncoder.__name__} is not defined for dataset types which are not SequenceDataset.") + + def encode(self, dataset, params: EncoderParams): + if params.learn_model: + EncoderHelper.check_positive_class_labels(params.label_config, SimilarToPositiveSequenceEncoder.__name__) + + self.positive_sequences = self._get_positive_sequences(dataset, params) + + return self._encode_data(dataset, params) + + def _get_positive_sequences(self, dataset, params): + subset_path = PathBuilder.build(params.result_path / "positive_sequences") + label_name = EncoderHelper.get_single_label_name_from_config(params.label_config, + SimilarToPositiveSequenceEncoder.__name__) + + label_obj = params.label_config.get_label_object(label_name) + classes = dataset.get_metadata([label_name])[label_name] + + subset_indices = [idx for idx in range(dataset.get_example_count()) if classes[idx] == label_obj.positive_class] + + return dataset.make_subset(subset_indices, + path=subset_path, + dataset_type=Dataset.SUBSAMPLED) + + def _encode_data(self, dataset, params: EncoderParams): + examples = self.get_sequence_matching_feature(dataset, params) + + labels = EncoderHelper.encode_element_dataset_labels(dataset, params.label_config) + + encoded_dataset = dataset.clone() + encoded_dataset.encoded_data = EncodedData(examples=examples, + labels=labels, + feature_names=["similar_to_positive_sequence"], + feature_annotations=None, + example_ids=dataset.get_example_ids(), + encoding=SimilarToPositiveSequenceEncoder.__name__, + example_weights=dataset.get_example_weights(), + info={}) + + return encoded_dataset + + def get_sequence_matching_feature(self, dataset, params: EncoderParams): + if self.compairr_path is None: + return self.get_sequence_matching_feature_without_compairr(dataset) + else: + return self.get_sequence_matching_feature_with_compairr(dataset, params) + + def get_sequence_matching_feature_with_compairr(self, dataset, params: EncoderParams): + compairr_result_path = PathBuilder.build(params.result_path / f"compairr_data/learn_{params.learn_model}") + compairr_params = self._get_compairr_params() + + pos_sequences_path, all_sequences_path = self._write_compairr_input_files(dataset, compairr_result_path, compairr_params) + compairr_result = self._run_compairr(compairr_params, all_sequences_path, pos_sequences_path, compairr_result_path) + examples = self._parse_compairr_results(dataset, compairr_result, compairr_params, compairr_result_path) + + if not self.keep_temporary_files: + import shutil + shutil.rmtree(compairr_result_path, ignore_errors=False, onerror=None) + + return examples + + def _write_compairr_input_files(self, dataset, compairr_result_path, compairr_params): + pos_sequences_path = compairr_result_path / "positive_sequences.tsv" + all_sequences_path = compairr_result_path / "all_sequences.tsv" + + CompAIRRHelper.write_sequences_file(self.positive_sequences, pos_sequences_path, compairr_params, repertoire_id="positive_sequences") + CompAIRRHelper.write_sequences_file(dataset, all_sequences_path, compairr_params, repertoire_id="all_sequences") + + return pos_sequences_path, all_sequences_path + + def _run_compairr(self, compairr_params, all_sequences_path, pos_sequences_path, compairr_result_path): + import subprocess + + args = CompAIRRHelper.get_cmd_args(compairr_params, [all_sequences_path, pos_sequences_path], compairr_result_path) + logging.info(f"{SimilarToPositiveSequenceEncoder.__name__}: running CompAIRR with the following arguments: {' '.join(args)}") + compairr_result = subprocess.run(args, capture_output=True, text=True) + + return compairr_result + + def _parse_compairr_results(self, dataset, compairr_result, compairr_params, compairr_result_path): + result = CompAIRRHelper.process_compairr_output_file(compairr_result, compairr_params, compairr_result_path) + result.index = result.index.astype(str) + + if list(result.index) != dataset.get_example_ids(): + if set(list(result.index)) != set(dataset.get_example_ids()): + logging.warning(f"{SimilarToPositiveSequenceEncoder.__name__}: CompAIRR index: {list(result.index)}") + logging.warning(f"{SimilarToPositiveSequenceEncoder.__name__}: Dataset identifiers: {dataset.get_example_ids()}") + assert False, f"{SimilarToPositiveSequenceEncoder.__name__}: error when reindexing CompAIRR results: CompAIRR index does not match dataset identfiers. See the log file for more information." + + result = result.reindex(dataset.get_example_ids()) + + return np.array([result["positive_sequences"] > 0]).T + + def _get_compairr_params(self): + return CompAIRRParams(compairr_path=self.compairr_path, + keep_compairr_input=self.keep_temporary_files, + differences=self.hamming_distance, + indels=False, + ignore_counts=True, + ignore_genes=self.ignore_genes, + threads=self.threads, + output_pairs=False, + pairs_filename=None, + output_filename="compairr_out.txt", + log_filename="compairr_log.txt", + do_repertoire_overlap=False, + do_sequence_matching=True) + + def get_sequence_matching_feature_without_compairr(self, dataset): + matcher = SequenceMatcher() + + examples = [] + + for sequence in dataset.get_data(): + is_matching = False + + for ref_sequence in self.positive_sequences.get_data(): + if matcher.matches_sequence(sequence, ref_sequence, self.hamming_distance): + is_matching = True + break + + examples.append(is_matching) + + return np.array([examples]).T + + def _get_y_true(self, dataset, label_config: LabelConfiguration): + labels = EncoderHelper.encode_element_dataset_labels(dataset, label_config) + + label_name = EncoderHelper.get_single_label_name_from_config(label_config, SimilarToPositiveSequenceEncoder.__name__) + label = label_config.get_label_object(label_name) + + return np.array([cls == label.positive_class for cls in labels[label_name]]) + + def _get_positive_class(self, label_config): + label_name = EncoderHelper.get_single_label_name_from_config(label_config, SimilarToPositiveSequenceEncoder.__name__) + label = label_config.get_label_object(label_name) + + return label.positive_class + + def set_context(self, context: dict): + self.context = context + return self + + @staticmethod + def export_encoder(path: Path, encoder) -> Path: + encoder_file = DatasetEncoder.store_encoder(encoder, path / "encoder.pickle") + return encoder_file + + @staticmethod + def load_encoder(encoder_file: Path): + encoder = DatasetEncoder.load_encoder(encoder_file) + return encoder diff --git a/immuneML/encodings/motif_encoding/__init__.py b/immuneML/encodings/motif_encoding/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/immuneML/encodings/onehot/OneHotReceptorEncoder.py b/immuneML/encodings/onehot/OneHotReceptorEncoder.py index f764af2e7..d60cac697 100644 --- a/immuneML/encodings/onehot/OneHotReceptorEncoder.py +++ b/immuneML/encodings/onehot/OneHotReceptorEncoder.py @@ -4,6 +4,7 @@ from immuneML.data_model.encoded_data.EncodedData import EncodedData from immuneML.encodings.EncoderParams import EncoderParams from immuneML.encodings.onehot.OneHotEncoder import OneHotEncoder +from immuneML.util.EncoderHelper import EncoderHelper class OneHotReceptorEncoder(OneHotEncoder): @@ -36,7 +37,6 @@ def _encode_data(self, dataset: ReceptorDataset, params: EncoderParams): max_seq_len = max(max([len(seq) for seq in first_chain_seqs]), max([len(seq) for seq in second_chain_seqs])) - example_ids = dataset.get_example_ids() labels = self._get_labels(receptor_objs, params) if params.encode_labels else None examples_first_chain = self._encode_sequence_list(first_chain_seqs, pad_n_sequences=len(receptor_objs), @@ -54,9 +54,10 @@ def _encode_data(self, dataset: ReceptorDataset, params: EncoderParams): encoded_data = EncodedData(examples=examples, labels=labels, - example_ids=example_ids, + example_ids=dataset.get_example_ids(), feature_names=feature_names, encoding=OneHotEncoder.__name__, + example_weights=dataset.get_example_weights(), info={"chain_names": receptor_objs[0].get_chains() if all(receptor_obj.get_chains() == receptor_objs[0].get_chains() for receptor_obj in receptor_objs) else None}) return encoded_data diff --git a/immuneML/encodings/onehot/OneHotRepertoireEncoder.py b/immuneML/encodings/onehot/OneHotRepertoireEncoder.py index ca6966f53..d321bc21c 100644 --- a/immuneML/encodings/onehot/OneHotRepertoireEncoder.py +++ b/immuneML/encodings/onehot/OneHotRepertoireEncoder.py @@ -10,6 +10,7 @@ from immuneML.data_model.encoded_data.EncodedData import EncodedData from immuneML.encodings.EncoderParams import EncoderParams from immuneML.encodings.onehot.OneHotEncoder import OneHotEncoder +from immuneML.util.EncoderHelper import EncoderHelper class OneHotRepertoireEncoder(OneHotEncoder): @@ -77,6 +78,7 @@ def _encode_data(self, dataset, params: EncoderParams): example_ids=repertoire_names, labels=labels, feature_names=feature_names, + example_weights=EncoderHelper.get_example_weights_by_identifiers(dataset, repertoire_names), encoding=OneHotEncoder.__name__) return encoded_data diff --git a/immuneML/encodings/onehot/OneHotSequenceEncoder.py b/immuneML/encodings/onehot/OneHotSequenceEncoder.py index d5b90bbd6..e7138bb18 100644 --- a/immuneML/encodings/onehot/OneHotSequenceEncoder.py +++ b/immuneML/encodings/onehot/OneHotSequenceEncoder.py @@ -35,7 +35,6 @@ def _encode_data(self, dataset: SequenceDataset, params: EncoderParams): f"{OneHotEncoder.__name__}: sequence dataset {dataset.name} (id: {dataset.identifier}) contains empty sequences for the specified " f"sequence type {self.sequence_type.name.lower()}. Please check that the dataset is imported correctly.") - example_ids = dataset.get_example_ids() max_seq_len = max([len(seq) for seq in sequences]) labels = self._get_labels(sequence_objs, params) if params.encode_labels else None @@ -49,8 +48,9 @@ def _encode_data(self, dataset: SequenceDataset, params: EncoderParams): encoded_data = EncodedData(examples=examples, labels=labels, - example_ids=example_ids, + example_ids=dataset.get_example_ids(), feature_names=feature_names, + example_weights=dataset.get_example_weights(), encoding=OneHotEncoder.__name__) return encoded_data diff --git a/immuneML/encodings/reference_encoding/MatchedRegexRepertoireEncoder.py b/immuneML/encodings/reference_encoding/MatchedRegexRepertoireEncoder.py index 63e12f0f5..e80146257 100644 --- a/immuneML/encodings/reference_encoding/MatchedRegexRepertoireEncoder.py +++ b/immuneML/encodings/reference_encoding/MatchedRegexRepertoireEncoder.py @@ -27,7 +27,8 @@ def _encode_new_dataset(self, dataset, params: EncoderParams): encoded_dataset.add_encoded_data(EncodedData( examples=encoded_repertoires, - example_ids=list(dataset.get_metadata(["subject_id"]).values())[0], + example_ids=dataset.get_example_ids(), #list(dataset.get_metadata(["subject_id"]).values())[0], + example_weights=dataset.get_example_weights(), feature_names=list(feature_annotations["chain_id"]), feature_annotations=feature_annotations, labels=labels, diff --git a/immuneML/encodings/word2vec/Word2VecEncoder.py b/immuneML/encodings/word2vec/Word2VecEncoder.py index f6e7c30fc..c1354880d 100644 --- a/immuneML/encodings/word2vec/Word2VecEncoder.py +++ b/immuneML/encodings/word2vec/Word2VecEncoder.py @@ -178,10 +178,12 @@ def _build_encoded_dataset(self, dataset: Dataset, scaled_examples, labels, para feature_annotations = pd.DataFrame({"feature": feature_names}) encoded_dataset.encoded_data = EncodedData(examples=scaled_examples, - labels={label: labels[i] for i, label in enumerate(label_names)} if labels is not None else None, - example_ids=[example.identifier for example in encoded_dataset.get_data()], + labels={label: labels[i] for i, label in + enumerate(label_names)} if labels is not None else None, + example_ids=dataset.get_example_ids(), feature_names=feature_names, feature_annotations=feature_annotations, + example_weights=dataset.get_example_weights(), encoding=Word2VecEncoder.__name__) return encoded_dataset diff --git a/immuneML/environment/Constants.py b/immuneML/environment/Constants.py index e38ad3b1e..995a2ca1c 100644 --- a/immuneML/environment/Constants.py +++ b/immuneML/environment/Constants.py @@ -1,6 +1,6 @@ class Constants: - VERSION = "3.0.0a1" + VERSION = "3.0.0a2" # encoding constants FEATURE_DELIMITER = "-" diff --git a/immuneML/example_weighting/ExampleWeightingParams.py b/immuneML/example_weighting/ExampleWeightingParams.py new file mode 100644 index 000000000..b78c756f5 --- /dev/null +++ b/immuneML/example_weighting/ExampleWeightingParams.py @@ -0,0 +1,9 @@ +from dataclasses import dataclass +from pathlib import Path + + +@dataclass +class ExampleWeightingParams: + result_path: Path + pool_size: int = 4 + learn_model: bool = True diff --git a/immuneML/example_weighting/ExampleWeightingStrategy.py b/immuneML/example_weighting/ExampleWeightingStrategy.py new file mode 100644 index 000000000..d4c3111b4 --- /dev/null +++ b/immuneML/example_weighting/ExampleWeightingStrategy.py @@ -0,0 +1,21 @@ +import abc + +from immuneML.example_weighting.ExampleWeightingParams import ExampleWeightingParams + + +class ExampleWeightingStrategy(metaclass=abc.ABCMeta): + + def __init__(self, name): + self.name = name + + @staticmethod + @abc.abstractmethod + def build_object(dataset, **params): + pass + + @abc.abstractmethod + def compute_weights(self, dataset, params: ExampleWeightingParams): + pass + + def set_context(self, context: dict): + return self \ No newline at end of file diff --git a/immuneML/example_weighting/__init__.py b/immuneML/example_weighting/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/immuneML/example_weighting/predefined_weighting/PredefinedWeighting.py b/immuneML/example_weighting/predefined_weighting/PredefinedWeighting.py new file mode 100644 index 000000000..c95befe1d --- /dev/null +++ b/immuneML/example_weighting/predefined_weighting/PredefinedWeighting.py @@ -0,0 +1,69 @@ +from pathlib import Path +import pandas as pd + +from immuneML.example_weighting.ExampleWeightingParams import ExampleWeightingParams +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy +from immuneML.util.ParameterValidator import ParameterValidator + + +class PredefinedWeighting(ExampleWeightingStrategy): + ''' + + Example weighting strategy where weights are supplied in a file. + + Arguments: + + file_path (Path): Path to the example weights, should contain the columns 'identifier' and 'example_weight': + + ========== ============== + identifier example_weight + ========== ============== + 1 0.5 + 2 1 + 3 1 + ======== ============== + + separator (str): Column separator in the input file. + + ''' + + def __init__(self, file_path, separator, name: str = None): + super().__init__(name) + self.file_path = Path(file_path) + self.separator = separator + + @staticmethod + def _prepare_parameters(file_path, separator, name: str = None): + file_path = Path(file_path) + + if not file_path.is_file(): + raise FileNotFoundError(f"{PredefinedWeighting.__class__.__name__}: example weigths could not be loaded from {file_path}. " + f"Check if the path to the file is properly set.") + + ParameterValidator.assert_type_and_value(separator, str, location=PredefinedWeighting.__name__, parameter_name="separator") + + return { + "file_path": file_path, + "separator": separator, + "name": name + } + + @staticmethod + def build_object(dataset=None, **params): + prepared_params = PredefinedWeighting._prepare_parameters(**params) + + return PredefinedWeighting(**prepared_params) + + def compute_weights(self, dataset, params: ExampleWeightingParams): + weights_df = self._read_example_weights_file() + + return self._get_example_weights(dataset, weights_df) + + def _get_example_weights(self, dataset, weights_df): + return [self._get_example_weight_by_identifier(example.identifier, weights_df) for example in dataset.get_data()] + + def _read_example_weights_file(self): + return pd.read_csv(self.file_path, sep=self.separator, usecols=["identifier", "example_weight"]) + + def _get_example_weight_by_identifier(self, identifier, weights_df): + return float(weights_df[weights_df["identifier"] == identifier].example_weight) diff --git a/immuneML/example_weighting/predefined_weighting/__init__.py b/immuneML/example_weighting/predefined_weighting/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/immuneML/hyperparameter_optimization/core/HPAssessment.py b/immuneML/hyperparameter_optimization/core/HPAssessment.py index 5403c8570..5bf3461ea 100644 --- a/immuneML/hyperparameter_optimization/core/HPAssessment.py +++ b/immuneML/hyperparameter_optimization/core/HPAssessment.py @@ -102,7 +102,7 @@ def reeval_on_assessment_split(state, train_val_dataset: Dataset, test_dataset: report_context=state.context, ml_reports=state.assessment.reports.model_reports.values(), number_of_processes=state.number_of_processes, encoding_reports=state.assessment.reports.encoding_reports.values(), - label_config=LabelConfiguration([label])).run(split_index) + label_config=LabelConfiguration([label]), example_weighting=state.example_weighting).run(split_index) state.assessment_states[split_index].label_states[label.name].assessment_items[str(hp_setting)] = assessment_item diff --git a/immuneML/hyperparameter_optimization/core/HPSelection.py b/immuneML/hyperparameter_optimization/core/HPSelection.py index 53e7441a2..be2f67866 100644 --- a/immuneML/hyperparameter_optimization/core/HPSelection.py +++ b/immuneML/hyperparameter_optimization/core/HPSelection.py @@ -71,8 +71,8 @@ def run_setting(state: TrainMLModelState, hp_setting, train_dataset, val_dataset hp_item = MLProcess(train_dataset=train_dataset, test_dataset=val_dataset, encoding_reports=state.selection.reports.encoding_reports.values(), label_config=LabelConfiguration([label]), report_context=state.context, number_of_processes=state.number_of_processes, metrics=state.metrics, optimization_metric=state.optimization_metric, - ml_reports=state.selection.reports.model_reports.values(), label=label, path=current_path, hp_setting=hp_setting) \ - .run(split_index) + ml_reports=state.selection.reports.model_reports.values(), label=label, path=current_path, hp_setting=hp_setting, + example_weighting=state.example_weighting).run(split_index) state.assessment_states[assessment_index].label_states[label.name].selection_state.hp_items[hp_setting.get_key()].append(hp_item) diff --git a/immuneML/hyperparameter_optimization/core/HPUtil.py b/immuneML/hyperparameter_optimization/core/HPUtil.py index 44ded2868..41188e48a 100644 --- a/immuneML/hyperparameter_optimization/core/HPUtil.py +++ b/immuneML/hyperparameter_optimization/core/HPUtil.py @@ -7,6 +7,8 @@ from immuneML.environment.Constants import Constants from immuneML.environment.Label import Label from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.example_weighting.ExampleWeightingParams import ExampleWeightingParams +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy from immuneML.hyperparameter_optimization.HPSetting import HPSetting from immuneML.hyperparameter_optimization.config.SplitConfig import SplitConfig from immuneML.hyperparameter_optimization.states.HPSelectionState import HPSelectionState @@ -14,8 +16,10 @@ from immuneML.reports.ReportResult import ReportResult from immuneML.reports.ReportUtil import ReportUtil from immuneML.util.PathBuilder import PathBuilder +from immuneML.workflows.steps.DataWeighter import DataWeighter from immuneML.workflows.steps.DataEncoder import DataEncoder from immuneML.workflows.steps.DataEncoderParams import DataEncoderParams +from immuneML.workflows.steps.DataWeighterParams import DataWeighterParams from immuneML.workflows.steps.MLMethodAssessment import MLMethodAssessment from immuneML.workflows.steps.MLMethodAssessmentParams import MLMethodAssessmentParams from immuneML.workflows.steps.data_splitter.DataSplitter import DataSplitter @@ -67,6 +71,20 @@ def preprocess_dataset(dataset: Dataset, preproc_sequence: list, path: Path, con else: return dataset + @staticmethod + def weight_examples(dataset, weighting_strategy: ExampleWeightingStrategy, path: Path, learn_model: bool, number_of_processes: int): + weighted_dataset = DataWeighter.run(DataWeighterParams( + dataset=dataset, + weighting_strategy=weighting_strategy, + weighting_params=ExampleWeightingParams( + result_path=path, + pool_size=number_of_processes, + learn_model=learn_model + ), + )) + return weighted_dataset + + @staticmethod def encode_dataset(dataset, hp_setting: HPSetting, path: Path, learn_model: bool, context: dict, number_of_processes: int, label_configuration: LabelConfiguration, encode_labels: bool = True): @@ -81,7 +99,6 @@ def encode_dataset(dataset, hp_setting: HPSetting, path: Path, learn_model: bool pool_size=number_of_processes, label_config=label_configuration, learn_model=learn_model, - filename="train_dataset.pkl" if learn_model else "test_dataset.pkl", encode_labels=encode_labels ), )) diff --git a/immuneML/hyperparameter_optimization/states/HPItem.py b/immuneML/hyperparameter_optimization/states/HPItem.py index 1a789c6b9..8f658a0db 100644 --- a/immuneML/hyperparameter_optimization/states/HPItem.py +++ b/immuneML/hyperparameter_optimization/states/HPItem.py @@ -19,6 +19,7 @@ class HPItem: train_predictions_path: Path = None test_predictions_path: Path = None ml_details_path: Path = None + ml_settings_export_path: Path = None train_dataset: Dataset = None test_dataset: Dataset = None split_index: int = None diff --git a/immuneML/hyperparameter_optimization/states/TrainMLModelState.py b/immuneML/hyperparameter_optimization/states/TrainMLModelState.py index 70d3fe3ed..4df14886d 100644 --- a/immuneML/hyperparameter_optimization/states/TrainMLModelState.py +++ b/immuneML/hyperparameter_optimization/states/TrainMLModelState.py @@ -4,6 +4,7 @@ from immuneML.data_model.dataset.Dataset import Dataset from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy from immuneML.hyperparameter_optimization.HPSetting import HPSetting from immuneML.hyperparameter_optimization.config.SplitConfig import SplitConfig from immuneML.hyperparameter_optimization.states.HPAssessmentState import HPAssessmentState @@ -29,6 +30,8 @@ class TrainMLModelState: reports: dict = field(default_factory=dict) name: str = None refit_optimal_model: bool = None + export_all_ml_settings: bool = None + example_weighting: ExampleWeightingStrategy = None optimal_hp_items: Dict[str, HPItem] = field(default_factory=dict) optimal_hp_item_paths: Dict[str, str] = field(default_factory=dict) assessment_states: List[HPAssessmentState] = field(default_factory=list) diff --git a/immuneML/ml_methods/BinaryFeatureClassifier.py b/immuneML/ml_methods/BinaryFeatureClassifier.py new file mode 100644 index 000000000..1b0d3c6ef --- /dev/null +++ b/immuneML/ml_methods/BinaryFeatureClassifier.py @@ -0,0 +1,386 @@ +import sys +import copy +import logging +import warnings +from pathlib import Path + +import numpy as np +import yaml + +from functools import partial +from multiprocessing.pool import Pool + +from immuneML.data_model.encoded_data.EncodedData import EncodedData +from immuneML.environment.Label import Label +from immuneML.ml_methods.classifiers.MLMethod import MLMethod +from immuneML.ml_methods.util.Util import Util +from immuneML.ml_metrics.ClassificationMetric import ClassificationMetric +from immuneML.ml_metrics.MetricUtil import MetricUtil +from immuneML.util.PathBuilder import PathBuilder + + +class BinaryFeatureClassifier(MLMethod): + """ + A simple classifier that takes in encoded data containing features with only 1/0 or True/False values. + + This classifier gives a positive prediction if any of the binary features for an example are 'true'. + Optionally, the classifier can select an optimal subset of these features. In this case, the given data is split + into a training and validation set, a minimal set of features is learned through greedy forward selection, + and the validation set is used to determine when to stop growing the set of features (earlystopping). + Earlystopping is reached when the optimization metric on the validation set no longer improves for a given number of features (patience). + The optimization metric is the same metric as the one used for optimization in the :py:obj:`~immuneML.workflows.instructions.TrainMLModelInstruction`. + + Currently, this classifier can be used in combination with two encoders: + + - The classifier can be used in combination with the :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder`, + such that sequences containing any of the positive class-associated motifs are classified as positive. + A reduced subset of binding-associated motifs can be learned (when keep_all is false). + This results in a set of complementary motifs, minimizing the redundant predictions made by different motifs. + + - Alternatively, this classifier can be combined with the :py:obj:`~immuneML.encodings.motif_encoding.SimilarToPositiveSequenceEncoder.SimilarToPositiveSequenceEncoder` + such that any sequence that falls within a given hamming distance from any of the positive class sequences in the training set + are classified as positive. Parameter keep_all should be set to true, since this encoder creates only 1 feature. + + + Arguments: + + training_percentage (float): What percentage of data to use for training (the rest will be used for validation); values between 0 and 1 + + keep_all (bool): Whether to keep all the input features (true) or learn a reduced subset (false). By default, keep_all is false. + + random_seed (int): Random seed for splitting the data into training and validation sets when learning a minimal subset of features. This is only used when keep_all is false. + + max_features (int): The maximum number of features to allow in the reduced subset. When this number is reached, no more features are added even if the earlystopping criterion is not reached yet. + This is only used when keep_all is false. By default, max_features is 100. + + patience (int): The patience for earlystopping. When earlystopping is reached, more features are added to the reduced set to test whether the optimization metric on the validation set improves again. By default, patience is 5. + + min_delta (float): The delta value used to test if there was improvement between the previous set of features and the new set of features (+1). By default, min_delta is 0, meaning the new set of features does not need to yield a higher optimization metric score on the validation set, but it needs to be at least equally high as the previous set. + + + YAML specification: + + .. indent with spaces + .. code-block:: yaml + + my_motif_classifier: + MotifClassifier: + training_percentage: 0.7 + max_features: 100 + patience: 5 + min_delta: 0 + keep_all: false + + """ + + def __init__(self, training_percentage: float = None, + random_seed: int = None, max_features: int = None, patience: int = None, + min_delta: float = None, keep_all: bool = None, + result_path: Path = None): + super().__init__() + self.training_percentage = training_percentage + self.random_seed = random_seed + self.max_features = max_features + self.patience = patience + self.min_delta = min_delta + self.keep_all = keep_all + + self.train_indices = None + self.val_indices = None + self.feature_names = None + self.rule_tree_indices = None + self.rule_tree_features = None + self.label = None + self.optimization_metric = None + self.class_mapping = None + + self.result_path = result_path + + def predict(self, encoded_data: EncodedData, label: Label): + self._check_features(encoded_data.feature_names) + + return {self.label.name: self._get_rule_tree_predictions_class(encoded_data, self.rule_tree_indices)} + + def predict_proba(self, encoded_data: EncodedData, label: Label): + warnings.warn(f"{BinaryFeatureClassifier.__name__}: cannot predict probabilities.") + return None + + def fit(self, encoded_data: EncodedData, label: Label, optimization_metric: str, cores_for_training: int = 2): + self.feature_names = encoded_data.feature_names + self.label = label + self.class_mapping = Util.make_binary_class_mapping(encoded_data.labels[self.label.name]) + self.optimization_metric = optimization_metric + + self.rule_tree_indices = self._build_rule_tree(encoded_data, cores_for_training) + self.rule_tree_features = self._get_rule_tree_features_from_indices(self.rule_tree_indices, self.feature_names) + self._export_selected_features(self.result_path, self.rule_tree_features) + + logging.info(f"{BinaryFeatureClassifier.__name__}: finished training.") + + def _get_optimization_scoring_fn(self): + return MetricUtil.get_metric_fn(ClassificationMetric[self.optimization_metric.upper()]) + + def _build_rule_tree(self, encoded_data, cores_for_training): + if self.keep_all: + rules = list(range(len(self.feature_names))) + logging.info(f"{BinaryFeatureClassifier.__name__}: all {len(rules)} rules kept.") + else: + encoded_train_data, encoded_val_data = self._prepare_and_split_data(encoded_data) + if self.max_features is None: + self.max_features = encoded_train_data.examples.shape[1] + + rules = self._start_recursive_search(encoded_train_data, encoded_val_data, cores_for_training) + + logging.info(f"{BinaryFeatureClassifier.__name__}: selected {len(rules)} out of {len(self.feature_names)} rules.") + + return rules + + def _start_recursive_search(self, encoded_train_data, encoded_val_data, cores_for_training): + old_recursion_limit = sys.getrecursionlimit() + new_recursion_limit = old_recursion_limit + encoded_train_data.examples.shape[1] + sys.setrecursionlimit(new_recursion_limit) + + rules = self._recursively_select_rules(encoded_train_data=encoded_train_data, + encoded_val_data=encoded_val_data, + index_candidates=list(range(encoded_train_data.examples.shape[1])), + prev_rule_indices=[], + prev_train_predictions=np.array([False] * encoded_train_data.examples.shape[0]), + prev_val_predictions=np.array([False] * encoded_val_data.examples.shape[0]), + prev_val_scores=[], + cores_for_training=cores_for_training) + + sys.setrecursionlimit(old_recursion_limit) + + return rules + + def _get_rule_tree_features_from_indices(self, rule_tree_indices, feature_names): + return [feature_names[idx] for idx in rule_tree_indices] + + def _recursively_select_rules(self, encoded_train_data, encoded_val_data, index_candidates, prev_rule_indices, prev_train_predictions, prev_val_predictions, prev_val_scores, cores_for_training): + new_rule_indices, new_train_predictions, new_index_candidates = self._add_next_best_rule(encoded_train_data, prev_rule_indices, prev_train_predictions, index_candidates, cores_for_training) + + if new_rule_indices == prev_rule_indices: + logging.info(f"{BinaryFeatureClassifier.__name__}: no improvement on training set") + return self._get_optimal_indices(prev_rule_indices, self._test_is_improvement(prev_val_scores, self.min_delta)) + + logging.info(f"{BinaryFeatureClassifier.__name__}: added rule {len(new_rule_indices)}/{min(self.max_features, encoded_train_data.examples.shape[1])} ({len(new_index_candidates)} candidates left)") + + new_val_predictions = np.logical_or(prev_val_predictions, encoded_val_data.examples[:, new_rule_indices[-1]]) + new_val_scores = prev_val_scores + [self._test_performance_predictions(encoded_val_data, pred=new_val_predictions)] + is_improvement = self._test_is_improvement(new_val_scores, self.min_delta) + + if len(new_rule_indices) >= self.max_features: + logging.info(f"{BinaryFeatureClassifier.__name__}: max features reached") + return self._get_optimal_indices(new_rule_indices, is_improvement) + + if self._test_earlystopping(is_improvement): + logging.info(f"{BinaryFeatureClassifier.__name__}: earlystopping criterion reached") + return self._get_optimal_indices(new_rule_indices, is_improvement) + + return self._recursively_select_rules(encoded_train_data, encoded_val_data, + prev_rule_indices=new_rule_indices, + index_candidates=new_index_candidates, + prev_train_predictions=new_train_predictions, + prev_val_predictions=new_val_predictions, + prev_val_scores=new_val_scores, + cores_for_training=cores_for_training) + + def _test_earlystopping(self, is_improvement): + # patience has not reached yet, continue training + if len(is_improvement) < self.patience: + return False + + # last few trees did not improve, stop training + if not any(is_improvement[-self.patience:]): + return True + + return False + + def _test_is_improvement(self, scores, min_delta): + if len(scores) == 0: + return [] + + best = scores[0] + is_improvement = [True] + + for score in scores[1:]: + if score > best + min_delta: + best = score + is_improvement.append(True) + else: + is_improvement.append(False) + + return is_improvement + + def _get_optimal_indices(self, rule_indices, is_improvement): + if len(rule_indices) == 0: + return [] + + optimal_tree_idx = max([i if is_improvement[i] else -1 for i in range(len(is_improvement))]) + + return rule_indices[:optimal_tree_idx + 1] + + def _add_next_best_rule(self, encoded_train_data, prev_rule_indices, prev_predictions, index_candidates, cores_for_training): + if len(index_candidates) == 0: + return prev_rule_indices, prev_predictions, [] + + prev_train_performance = self._test_performance_predictions(encoded_train_data, pred=prev_predictions) + new_training_performances = self._test_new_train_performances(encoded_train_data, prev_predictions, + index_candidates, prev_train_performance, cores_for_training) + + best_new_performance = max(new_training_performances) + best_new_index = index_candidates[new_training_performances.index(best_new_performance)] + new_index_candidates = [index_candidates[i] for i, new_performance in enumerate(new_training_performances) if new_performance > prev_train_performance] + + if best_new_performance > prev_train_performance: + new_rule_indices = prev_rule_indices + [best_new_index] + new_predictions = np.logical_or(prev_predictions, encoded_train_data.examples[:, best_new_index]) + + return new_rule_indices, new_predictions, new_index_candidates + else: + return prev_rule_indices, prev_predictions, new_index_candidates + + def _test_new_train_performances(self, encoded_train_data, prev_predictions, index_candidates, + prev_train_performance, cores_for_training): + y_true_train = Util.map_to_new_class_values(encoded_train_data.labels[self.label.name], self.class_mapping) + + example_weights = encoded_train_data.example_weights + + with Pool(cores_for_training) as pool: + partial_func = partial(self._apply_optimization_fn_to_new_rule_combo, y_true_train=y_true_train, + example_weights=example_weights, prev_predictions=prev_predictions, + prev_train_performance=prev_train_performance) + scores = pool.map(partial_func, encoded_train_data.examples[:, index_candidates].T) + return scores + + def _apply_optimization_fn_to_new_rule_combo(self, new_rule_predictions, + y_true_train, example_weights, + prev_predictions, prev_train_performance): + new_predictions = np.logical_or(prev_predictions, new_rule_predictions) + + if np.array_equal(new_predictions, prev_predictions): + return prev_train_performance + else: + optimization_scoring_fn = self._get_optimization_scoring_fn() + return optimization_scoring_fn(y_true=y_true_train, + y_pred=new_predictions, + sample_weight=example_weights) + + + # def _apply_optimization_fn_to_new_rule_combo(self, optimization_scoring_fn, examples, y_true_train, + # example_weights, prev_predictions, new_rule_idx): + # return optimization_scoring_fn(y_true=y_true_train, + # y_pred=np.logical_or(prev_predictions, examples[:, new_rule_idx]), + # sample_weight=example_weights) + # + + def _get_unused_rule_indices(self, encoded_train_data, rule_indices): + return [idx for idx in range(encoded_train_data.examples.shape[1]) if idx not in rule_indices] + + def _test_performance_predictions(self, encoded_data, pred): + y_true = Util.map_to_new_class_values(encoded_data.labels[self.label.name], self.class_mapping) + optimization_scoring_fn = self._get_optimization_scoring_fn() + + return optimization_scoring_fn(y_true=y_true, y_pred=pred, sample_weight=encoded_data.example_weights) + + + def _get_new_performances(self, encoded_data, prev_predictions, new_indices_to_test): + return [self._test_performance_predictions(encoded_data=encoded_data, + pred=np.logical_or(prev_predictions, encoded_data.examples[:, idx])) + for idx in new_indices_to_test] + + def _test_performance_rule_tree(self, encoded_data, rule_indices): + pred = self._get_rule_tree_predictions_bool(encoded_data, rule_indices) + return self._test_performance_predictions(encoded_data, pred=pred) + + def _get_rule_tree_predictions_bool(self, encoded_data, rule_indices): + return np.logical_or.reduce([encoded_data.examples[:, i] for i in rule_indices]) + + def _get_rule_tree_predictions_class(self, encoded_data, rule_indices): + y = self._get_rule_tree_predictions_bool(encoded_data, rule_indices).astype(int) + return Util.map_to_old_class_values(y, self.class_mapping) + + def _check_features(self, encoded_data_features): + if self.feature_names != encoded_data_features: + mssg = f"{BinaryFeatureClassifier.__name__}: features during evaluation did not match the features set during fitting." + + logging.info(mssg + f"\n\nEvaluation features: {encoded_data_features}\nFitting features: {self.feature_names}") + raise ValueError(mssg + " See the log file for more info.") + + def _export_selected_features(self, path, rule_tree_features): + if path is not None: + PathBuilder.build(path) + with open(path / "selected_features.txt", "w") as file: + file.writelines([f"{feature}\n" for feature in rule_tree_features]) + + def fit_by_cross_validation(self, encoded_data: EncodedData, label: Label = None, optimization_metric: str = None, + number_of_splits: int = 5, cores_for_training: int = -1): + logging.warning(f"{BinaryFeatureClassifier.__name__}: cross_validation is not implemented for this method. Using standard fitting instead...") + self.fit(encoded_data=encoded_data, label=label) + + def _prepare_and_split_data(self, encoded_data: EncodedData): + train_indices, val_indices = Util.get_train_val_indices(len(encoded_data.example_ids), self.training_percentage, random_seed=self.random_seed) + + self.train_indices = train_indices + self.val_indices = val_indices + + train_data = Util.subset_encoded_data(encoded_data, train_indices) + val_data = Util.subset_encoded_data(encoded_data, val_indices) + + return train_data, val_data + + def store(self, path: Path, feature_names=None, details_path: Path = None): + PathBuilder.build(path) + + custom_vars = copy.deepcopy(vars(self)) + del custom_vars["result_path"] + + if self.label: + custom_vars["label"] = {key.lstrip("_"): value for key, value in vars(self.label).items()} + + params_path = path / "custom_params.yaml" + with params_path.open('w') as file: + yaml.dump(custom_vars, file) + + def load(self, path): + params_path = path / "custom_params.yaml" + with params_path.open("r") as file: + custom_params = yaml.load(file, Loader=yaml.SafeLoader) + + for param, value in custom_params.items(): + if hasattr(self, param): + if param == "label": + setattr(self, "label", Label(**value)) + else: + setattr(self, param, value) + + def check_if_exists(self, path): + return self.rule_tree_indices is not None + + def get_params(self): + params = copy.deepcopy(vars(self)) + return params + + def get_label_name(self): + return self.label.name + + def get_package_info(self) -> str: + return Util.get_immuneML_version() + + def get_feature_names(self) -> list: + return self.feature_names + + def can_predict_proba(self) -> bool: + return False + + def get_class_mapping(self) -> dict: + return self.class_mapping + + def get_compatible_encoders(self): + from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder + from immuneML.encodings.motif_encoding.SimilarToPositiveSequenceEncoder import SimilarToPositiveSequenceEncoder + return [MotifEncoder, SimilarToPositiveSequenceEncoder] + + + + diff --git a/immuneML/ml_methods/KerasSequenceCNN.py b/immuneML/ml_methods/KerasSequenceCNN.py new file mode 100644 index 000000000..74f4ed3ae --- /dev/null +++ b/immuneML/ml_methods/KerasSequenceCNN.py @@ -0,0 +1,274 @@ +import copy +import logging +from pathlib import Path +import yaml + +from immuneML.data_model.encoded_data.EncodedData import EncodedData +from immuneML.encodings.onehot.OneHotSequenceEncoder import OneHotSequenceEncoder +from immuneML.environment.Label import Label +from immuneML.ml_methods.classifiers.MLMethod import MLMethod +from immuneML.ml_methods.util.Util import Util +from immuneML.util.PathBuilder import PathBuilder + + +class KerasSequenceCNN(MLMethod): + """ + A CNN-based classifier for sequence datasets. Should be used in combination with :py:obj:`source.encodings.onehot.OneHotEncoder.OneHotEncoder`. + This classifier integrates the CNN proposed by Mason et al., the original code can be found at: https://github.com/dahjan/DMS_opt/blob/master/scripts/CNN.py + + Note: make sure keras and tensorflow dependencies are installed (see installation instructions). + + Reference: + Derek M. Mason, Simon Friedensohn, Cédric R. Weber, Christian Jordi, Bastian Wagner, Simon M. Men1, Roy A. Ehling, + Lucia Bonati, Jan Dahinden, Pablo Gainza, Bruno E. Correia and Sai T. Reddy + ‘Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning’. + Nat Biomed Eng 5, 600–612 (2021). https://doi.org/10.1038/s41551-021-00699-9 + + Arguments: + + units_per_layer (list): A nested list specifying the layers of the CNN. The first element in each nested list defines the layer type, other elements define the layer parameters. + Valid layer types are: CONV (keras.layers.Conv1D), DROP (keras.layers.Dropout), POOL (keras.layers.MaxPool1D), FLAT (keras.layers.Flatten), DENSE (keras.layers.Dense). + The parameters per layer type are as follows: + + - [CONV, , , ] + + - [DROP, ] + + - [POOL, , ] + + - [FLAT] + + - [DENSE, ] + + activation (str): The Activation function to use in the convolutional or dense layers. Activation functions can be chosen from keras.activations. For example, rely or softmax. By default, relu is used. + + training_percentage (float): The fraction of sequences that will be randomly assigned to form the training set (the rest will be the validation set). Should be a value between 0 and 1. By default, training_percentage is 0.7. + + + YAML specification: + + .. indent with spaces + .. code-block:: yaml + + my_cnn: + KerasSequenceCNN: + training_percentage: 0.7 + units_per_layer: [[CONV, 400, 3, 1], [DROP, 0.5], [POOL, 2, 1], [FLAT], [DENSE, 50]] + activation: relu + + + + """ + + def __init__(self, units_per_layer: list = None, activation: str = None, training_percentage: float = None): + + super().__init__() + + self.units_per_layer = units_per_layer # todo refactor this to something more sensible + self.activation = activation + self.training_percentage = training_percentage + + self.background_probabilities = None + self.CNN = None + self.label = None + self.class_mapping = None + self.result_path = None + self.feature_names = None + + def predict(self, encoded_data: EncodedData, label: Label): + predictions_proba = self.predict_proba(encoded_data, label)[label.name][label.positive_class] + + return {label.name: [self.class_mapping[val] for val in (predictions_proba > 0.5).tolist()]} + + def predict_proba(self, encoded_data: EncodedData, label: Label): + predictions = self.model.predict(x=encoded_data.examples).flatten() + + return {label.name: {label.positive_class: predictions, + label.get_binary_negative_class(): 1 - predictions}} + + def _create_cnn(self, units_per_layer, input_shape, + activation): + """ + Based on: https://github.com/dahjan/DMS_opt/blob/master/scripts/CNN.py + + Generate the CNN layers with a Keras wrapper. + + Parameters + --- + units_per_layer: architecture features in list format, i.e.: + Filter information: [CONV, # filters, kernel size, stride] + Max Pool information: [POOL, pool size, stride] + Dropout information: [DROP, dropout rate] + Flatten: [FLAT] + Dense layer: [DENSE, number nodes] + + input_shape: a tuple defining the input shape of the data + + activation: Activation function to use , i.e. ReLU, softmax + + # note: 'regularizer' option was removed, original authors used kernel_regularizer and bias_regularizer = None + """ + import keras + + # Initialize the CNN + model = keras.Sequential() + + # Input layer + model.add(keras.layers.InputLayer(input_shape)) + + # Build network + for i, units in enumerate(units_per_layer): + if units[0] == 'CONV': + model.add(keras.layers.Conv1D(filters=units[1], + kernel_size=units[2], + strides=units[3], + activation=activation, + kernel_regularizer=None, + bias_regularizer=None, + padding='same')) + elif units[0] == 'POOL': + model.add(keras.layers.MaxPool1D(pool_size=units[1], + strides=units[2])) + elif units[0] == 'DENSE': + model.add(keras.layers.Dense(units=units[1], + activation=activation, + kernel_regularizer=None, + bias_regularizer=None)) + elif units[0] == 'DROP': + model.add(keras.layers.Dropout(rate=units[1])) + elif units[0] == 'FLAT': + model.add(keras.layers.Flatten()) + else: + raise NotImplementedError('Layer type not implemented') + + # Output layer + # Activation function: Sigmoid + model.add(keras.layers.Dense(1, activation='sigmoid')) + + return model + + def fit(self, encoded_data: EncodedData, label: Label, optimization_metric=None, cores_for_training: int = 2): + self.feature_names = encoded_data.feature_names + self.label = label + self.class_mapping = Util.make_binary_class_mapping(encoded_data.labels[self.label.name]) + + encoded_train_data, encoded_val_data = self._prepare_and_split_data(encoded_data) + + self.model = self._create_cnn(units_per_layer=self.units_per_layer, # todo better input format... + input_shape=encoded_data.examples.shape[1:], + activation=self.activation) + + self._fit(encoded_train_data=encoded_train_data, encoded_val_data=encoded_val_data) + + + def _prepare_and_split_data(self, encoded_data: EncodedData): + train_indices, val_indices = Util.get_train_val_indices(len(encoded_data.example_ids), self.training_percentage) + + train_data = Util.subset_encoded_data(encoded_data, train_indices) + val_data = Util.subset_encoded_data(encoded_data, val_indices) + + return train_data, val_data + + def _fit(self, encoded_train_data, encoded_val_data): + """reference to original code, maybe the input should just be the encoded data instead? #todo""" + from keras.optimizers import Adam + + X_train = encoded_train_data.examples + X_val = encoded_val_data.examples + y_train = Util.map_to_new_class_values(encoded_train_data.labels[self.label.name], self.class_mapping) + y_val = Util.map_to_new_class_values(encoded_val_data.labels[self.label.name], self.class_mapping) + w_train = encoded_train_data.example_weights + w_val = encoded_val_data.example_weights + + # Compiling the CNN + opt = Adam(learning_rate=0.000075) + self.model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy']) + + # Fit the CNN to the training set + _ = self.model.fit( + x=X_train, y=y_train, sample_weight=w_train, shuffle=True, + validation_data=(X_val, y_val, w_val) if w_val is not None else (X_val, y_val), + epochs=20, batch_size=16, verbose=0 + ) + + + def fit_by_cross_validation(self, encoded_data: EncodedData, label: Label = None, optimization_metric: str = None, + number_of_splits: int = 5, cores_for_training: int = -1): + logging.warning(f"{KerasSequenceCNN.__name__}: cross_validation is not implemented for this method. Using standard fitting instead...") + self.fit(encoded_data=encoded_data, label=label) + + def store(self, path: Path, feature_names=None, details_path: Path = None): + PathBuilder.build(path) + + self.model.save(path / "model") + + custom_vars = copy.deepcopy(vars(self)) + del custom_vars["model"] + del custom_vars["result_path"] + + if self.label: + custom_vars["label"] = self.label.get_desc_for_storage() + + params_path = path / "custom_params.yaml" + with params_path.open('w') as file: + yaml.dump(custom_vars, file) + + def load(self, path): + import keras + + params_path = path / "custom_params.yaml" + + with params_path.open("r") as file: + custom_params = yaml.load(file, Loader=yaml.SafeLoader) + + for param, value in custom_params.items(): + if hasattr(self, param): + if param == "label": + setattr(self, "label", Label(**value)) + else: + setattr(self, param, value) + + self.model = keras.models.load_model(path / "model") + + def check_if_exists(self, path): + return self.model is not None + + def get_params(self): + params = copy.deepcopy(vars(self)) + params["model"] = copy.deepcopy(self.model).state_dict() + return params + + def get_label_name(self): + return self.label.name + + def get_package_info(self) -> str: + return Util.get_immuneML_version() + + def get_feature_names(self) -> list: + return self.feature_names + + def can_predict_proba(self) -> bool: + return True + + def get_class_mapping(self) -> dict: + return self.class_mapping + + def get_compatible_encoders(self): + from immuneML.encodings.onehot.OneHotEncoder import OneHotEncoder + return [OneHotEncoder] + + def check_encoder_compatibility(self, encoder): + """Checks whether the given encoder is compatible with this ML method, and throws an error if it is not.""" + from immuneML.encodings.onehot.OneHotEncoder import OneHotEncoder + + if not issubclass(encoder.__class__, OneHotEncoder): + raise ValueError( + f"{encoder.__class__.__name__} is not compatible with ML Method {self.__class__.__name__}. " + f"Please use one of the following encoders instead: {', '.join([enc_class.__name__ for enc_class in self.get_compatible_encoders()])}") + else: + if not isinstance(encoder, OneHotSequenceEncoder): + raise ValueError( + f"{self.__class__.__name__} is only compatible with SequenceDatasets.") + + assert encoder.flatten == False, f"{self.__class__.__name__} is only compatible with OneHotEncoder when setting OneHotEncoder.flatten to False" + assert encoder.use_positional_info == False, f"{self.__class__.__name__} is only compatible with OneHotEncoder when setting OneHotEncoder.use_positional_info to False" diff --git a/immuneML/ml_methods/classifiers/AtchleyKmerMILClassifier.py b/immuneML/ml_methods/classifiers/AtchleyKmerMILClassifier.py index 1fc3f2835..4a0fbd78d 100644 --- a/immuneML/ml_methods/classifiers/AtchleyKmerMILClassifier.py +++ b/immuneML/ml_methods/classifiers/AtchleyKmerMILClassifier.py @@ -1,6 +1,7 @@ import copy import logging import random +import warnings from pathlib import Path import numpy as np @@ -94,7 +95,10 @@ def __init__(self, iteration_count: int = None, threshold: float = None, evaluat def _make_log_reg(self): return PyTorchLogisticRegression(in_features=self.input_size, zero_abundance_weight_init=self.zero_abundance_weight_init) - def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = 2): + def fit(self, encoded_data: EncodedData, label: Label, optimization_metric=None, cores_for_training: int = 2): + if encoded_data.example_weights is not None: + warnings.warn(f"{self.__class__.__name__}: cannot fit this classifier with example weights, fitting without example weights instead... Example weights will still be applied when computing evaluation metrics after fitting.") + self.feature_names = encoded_data.feature_names self.label = label @@ -169,11 +173,11 @@ def predict(self, encoded_data: EncodedData, label: Label): predictions_proba = self.predict_proba(encoded_data, label) return {label.name: [self.class_mapping[val] for val in (predictions_proba[label.name][label.positive_class] > 0.5).tolist()]} - def fit_by_cross_validation(self, encoded_data: EncodedData, number_of_splits: int = 5, label: Label = None, cores_for_training: int = -1, - optimization_metric=None): + def fit_by_cross_validation(self, encoded_data: EncodedData, label: Label = None, optimization_metric: str = None, + number_of_splits: int = 5, cores_for_training: int = -1): logging.warning(f"AtchleyKmerMILClassifier: fitting by cross validation is not implemented internally for the model, fitting without " f"cross-validation instead.") - self.fit(encoded_data, label) + self.fit(encoded_data=encoded_data, label=label) def store(self, path: Path, feature_names=None, details_path: Path = None): PathBuilder.build(path) diff --git a/immuneML/ml_methods/classifiers/DeepRC.py b/immuneML/ml_methods/classifiers/DeepRC.py index 929396d69..7488f2b5e 100644 --- a/immuneML/ml_methods/classifiers/DeepRC.py +++ b/immuneML/ml_methods/classifiers/DeepRC.py @@ -249,7 +249,10 @@ def _prepare_caching_params(self, encoded_data: EncodedData, type: str, label_na ("evaluate_at", self.evaluate_at), ("pytorch_device_name", self.pytorch_device_name)) - def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = 2): + def fit(self, encoded_data: EncodedData, label: Label, optimization_metric=None, cores_for_training: int = 2): + if encoded_data.example_weights is not None: + warnings.warn(f"{self.__class__.__name__}: cannot fit this classifier with example weights, fitting without example weights instead... Example weights will still be applied when computing evaluation metrics after fitting.") + self.feature_names = encoded_data.feature_names self.label = label self.model = CacheHandler.memo_by_params(self._prepare_caching_params(encoded_data, "fit", label.name), @@ -349,9 +352,8 @@ def _fit_for_label(self, metadata_file, hdf5_filepath: Path, train_indices, val_ show_progress=False, device=self.pytorch_device, evaluate_at=self.evaluate_at, task_definition=task_definition, early_stopping_target_id=label.name) - def fit_by_cross_validation(self, encoded_data: EncodedData, number_of_splits: int = 5, label: Label = None, - cores_for_training: int = -1, - optimization_metric=None): + def fit_by_cross_validation(self, encoded_data: EncodedData, label: Label = None, optimization_metric: str = None, + number_of_splits: int = 5, cores_for_training: int = -1): warnings.warn("DeepRC: cross-validation on this classifier is not defined: fitting one model instead...") self.fit(encoded_data, label) diff --git a/immuneML/ml_methods/classifiers/MLMethod.py b/immuneML/ml_methods/classifiers/MLMethod.py index 536986a4e..02edcad25 100644 --- a/immuneML/ml_methods/classifiers/MLMethod.py +++ b/immuneML/ml_methods/classifiers/MLMethod.py @@ -30,7 +30,7 @@ def __init__(self): self.label = None @abc.abstractmethod - def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = 2): + def fit(self, encoded_data: EncodedData, label: Label, optimization_metric: str, cores_for_training: int = 2): """ The fit function fits the parameters of the machine learning model. @@ -45,9 +45,14 @@ def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = label (Label): the label for which the classifier will be created. immuneML also supports multi-label classification, but it is handled outside MLMethod class by creating an MLMethod instance for each label. This means that each MLMethod should handle only one label. + optimization_metric (str): the name of the optimization metric to be used to select the best model during cross-validation; when used with + TrainMLModel instruction which is almost exclusively the case when the immuneML is run from the specification, this maps to the + optimization metric in the instruction. + cores_for_training (int): if parallelization is available in the MLMethod (and the availability depends on the specific classifier), this is the number of processes that will be creating when fitting the model to speed up the computation. + Returns: it doesn't return anything, but fits the model parameters instead @@ -80,8 +85,7 @@ def predict(self, encoded_data: EncodedData, label: Label): pass @abc.abstractmethod - def fit_by_cross_validation(self, encoded_data: EncodedData, number_of_splits: int = 5, label: Label = None, cores_for_training: int = -1, - optimization_metric=None): + def fit_by_cross_validation(self, encoded_data: EncodedData, label: Label, optimization_metric: str, number_of_splits: int = 5, cores_for_training: int = -1): """ The fit_by_cross_validation function should implement finding the best model hyperparameters through cross-validation. In immuneML, preprocessing, encoding and ML hyperparameters can be optimized by using nested cross-validation (see TrainMLModelInstruction for more @@ -98,19 +102,20 @@ def fit_by_cross_validation(self, encoded_data: EncodedData, number_of_splits: i which make multidimensional outputs that do not follow this pattern, but they are tailored to specific ML methods which require such input (for instance, one hot encoding and ReceptorCNN method). - number_of_splits (int): number of splits for the cross-validation to be performed for selection the best hyperparameters of the ML model; - note that if this is used in combination with nested cross-validation in TrainMLModel instruction, it can result in very few examples in - each split depending on the orginal dataset size and the nested cross-validation setup. - label (Label): the label for which the classifier will be created. immuneML also supports multi-label classification, but it is handled outside MLMethod class by creating an MLMethod instance for each label. This means that each MLMethod should handle only one label. - cores_for_training (int): number of processes to be used during the cross-validation for model selection - optimization_metric (str): the name of the optimization metric to be used to select the best model during cross-validation; when used with TrainMLModel instruction which is almost exclusively the case when the immuneML is run from the specification, this maps to the optimization metric in the instruction. + number_of_splits (int): number of splits for the cross-validation to be performed for selection the best hyperparameters of the ML model; + note that if this is used in combination with nested cross-validation in TrainMLModel instruction, it can result in very few examples in + each split depending on the orginal dataset size and the nested cross-validation setup. + + cores_for_training (int): number of processes to be used during the cross-validation for model selection + + Returns: it doesn't return anything, but fits the model parameters instead diff --git a/immuneML/ml_methods/classifiers/PrecomputedKNN.py b/immuneML/ml_methods/classifiers/PrecomputedKNN.py index 669c0895f..012609c9c 100644 --- a/immuneML/ml_methods/classifiers/PrecomputedKNN.py +++ b/immuneML/ml_methods/classifiers/PrecomputedKNN.py @@ -57,7 +57,6 @@ def get_params(self): def can_predict_proba(self) -> bool: return True - def get_compatible_encoders(self): from immuneML.encodings.distance_encoding.CompAIRRDistanceEncoder import CompAIRRDistanceEncoder from immuneML.encodings.distance_encoding.DistanceEncoder import DistanceEncoder diff --git a/immuneML/ml_methods/classifiers/ProbabilisticBinaryClassifier.py b/immuneML/ml_methods/classifiers/ProbabilisticBinaryClassifier.py index 0efb76773..d7da210a5 100644 --- a/immuneML/ml_methods/classifiers/ProbabilisticBinaryClassifier.py +++ b/immuneML/ml_methods/classifiers/ProbabilisticBinaryClassifier.py @@ -67,7 +67,10 @@ def __init__(self, max_iterations: int = None, update_rate: float = None, likeli self.label = None self.feature_names = None - def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = 2): + def fit(self, encoded_data: EncodedData, label: Label, optimization_metric=None, cores_for_training: int = 2): + if encoded_data.example_weights is not None: + warnings.warn(f"{self.__class__.__name__}: cannot fit this classifier with example weights, fitting without example weights instead... Example weights will still be applied when computing evaluation metrics after fitting.") + self.feature_names = encoded_data.feature_names X = encoded_data.examples assert X.shape[1] == 2, "ProbabilisticBinaryClassifier: the shape of the input is not compatible with the classifier. " \ @@ -84,10 +87,10 @@ def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = self.alpha_1, self.beta_1 = self._find_beta_distribution_parameters( X[np.nonzero(np.array(encoded_data.labels[self.label.name]) == self.class_mapping[1])], self.N_1) - def fit_by_cross_validation(self, encoded_data: EncodedData, number_of_splits: int = 5, label: Label = None, cores_for_training: int = -1, - optimization_metric=None): + def fit_by_cross_validation(self, encoded_data: EncodedData, label: Label = None, optimization_metric: str = None, + number_of_splits: int = 5, cores_for_training: int = -1): warnings.warn("ProbabilisticBinaryClassifier: cross-validation on this classifier is not defined: fitting one model instead...") - self.fit(encoded_data, label) + self.fit(encoded_data=encoded_data, label=label) def predict(self, encoded_data: EncodedData, label: Label): """ diff --git a/immuneML/ml_methods/classifiers/ReceptorCNN.py b/immuneML/ml_methods/classifiers/ReceptorCNN.py index 406d7014b..0f19d593e 100644 --- a/immuneML/ml_methods/classifiers/ReceptorCNN.py +++ b/immuneML/ml_methods/classifiers/ReceptorCNN.py @@ -1,7 +1,7 @@ import copy import logging import math -import random +import warnings from pathlib import Path import numpy as np @@ -122,7 +122,7 @@ def __init__(self, kernel_count: int = None, kernel_size=None, positional_channe self.feature_names = None def predict(self, encoded_data: EncodedData, label: Label): - predictions_proba = self.predict_proba(encoded_data, label)[label.name][self.label.positive_class] + predictions_proba = self.predict_proba(encoded_data, label)[label.name][label.positive_class] return {label.name: [self.class_mapping[val] for val in (predictions_proba > 0.5).tolist()]} def set_background_probabilities(self): @@ -147,7 +147,9 @@ def predict_proba(self, encoded_data: EncodedData, label: Label): return {self.label.name: {self.label.positive_class: np.array(predictions), self.label.get_binary_negative_class(): 1 - np.array(predictions)}} - def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = 2): + def fit(self, encoded_data: EncodedData, label: Label, optimization_metric=None, cores_for_training: int = 2): + if encoded_data.example_weights is not None: + warnings.warn(f"{self.__class__.__name__}: cannot fit this classifier with example weights, fitting without example weights instead... Example weights will still be applied when computing evaluation metrics after fitting.") self.feature_names = encoded_data.feature_names @@ -203,8 +205,8 @@ def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = logging.info("ReceptorCNN: finished training.") - def fit_by_cross_validation(self, encoded_data: EncodedData, number_of_splits: int = 5, label: Label = None, cores_for_training: int = -1, - optimization_metric=None): + def fit_by_cross_validation(self, encoded_data: EncodedData, label: Label = None, optimization_metric: str = None, + number_of_splits: int = 5, cores_for_training: int = -1): logging.warning(f"{ReceptorCNN.__name__}: cross_validation is not implemented for this method. Using standard fitting instead...") self.fit(encoded_data=encoded_data, label=label) @@ -216,12 +218,7 @@ def _get_data_batch(self, encoded_data: EncodedData, label_name: str): encoded_data.example_ids[start_index: end_index] def _prepare_and_split_data(self, encoded_data: EncodedData): - indices = list(range(len(encoded_data.example_ids))) - random.shuffle(indices) - - limit = int(len(encoded_data.example_ids) * self.training_percentage) - train_indices = indices[:limit] - val_indices = indices[limit:] + train_indices, val_indices = Util.get_train_val_indices(len(encoded_data.example_ids), self.training_percentage) train_data = self._make_encoded_data(encoded_data, train_indices) val_data = self._make_encoded_data(encoded_data, val_indices) diff --git a/immuneML/ml_methods/classifiers/SklearnMethod.py b/immuneML/ml_methods/classifiers/SklearnMethod.py index 0acb45c2f..43c34b1e6 100644 --- a/immuneML/ml_methods/classifiers/SklearnMethod.py +++ b/immuneML/ml_methods/classifiers/SklearnMethod.py @@ -1,6 +1,7 @@ import abc import os import warnings +import inspect from pathlib import Path import dill @@ -87,7 +88,7 @@ def __init__(self, parameter_grid: dict = None, parameters: dict = None): self.class_mapping = None self.label = None - def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = 2): + def fit(self, encoded_data: EncodedData, label: Label, optimization_metric=None, cores_for_training: int = 2): self.label = label self.class_mapping = Util.make_class_mapping(encoded_data.labels[self.label.name], self.label.positive_class) @@ -95,29 +96,39 @@ def fit(self, encoded_data: EncodedData, label: Label, cores_for_training: int = mapped_y = Util.map_to_new_class_values(encoded_data.labels[self.label.name], self.class_mapping) - self.model = self._fit(encoded_data.examples, mapped_y, cores_for_training) + self.model = self._fit(encoded_data.examples, mapped_y, encoded_data.example_weights, cores_for_training) def predict(self, encoded_data: EncodedData, label: Label): self.check_is_fitted(label.name) - predictions = self.model.predict(encoded_data.examples) + + predictions = self.apply_with_weights(self.model.predict, + encoded_data.example_weights, + X=encoded_data.examples) + return {label.name: Util.map_to_old_class_values(np.array(predictions), self.class_mapping)} def predict_proba(self, encoded_data: EncodedData, label: Label): if self.can_predict_proba(): - probabilities = self.model.predict_proba(encoded_data.examples) + probabilities = self.apply_with_weights(self.model.predict_proba, encoded_data.example_weights, X=encoded_data.examples) class_names = Util.map_to_old_class_values(self.model.classes_, self.class_mapping) return {label.name: {class_name: probabilities[:, i] for i, class_name in enumerate(class_names)}} else: + warnings.warn(f"{self.__class__.__name__}: cannot predict probabilities.") return None - def _fit(self, X, y, cores_for_training: int = 1): + def _fit(self, X, y, w=None, cores_for_training: int = 1): + self.model = self._get_ml_model(cores_for_training, X) + + if w is not None and not self.supports_example_weight(self.model.fit) and not self.supports_example_weight(self.model.predict): + warnings.warn(f"{self.__class__.__name__}: cannot fit this classifier with example weights, fitting without example weights instead... Example weights will still be applied when computing evaluation metrics after fitting.") + if not self.show_warnings: warnings.simplefilter("ignore") os.environ["PYTHONWARNINGS"] = "ignore" self.model = self._get_ml_model(cores_for_training, X) - self.model.fit(X, y) + self.apply_with_weights(self.model.fit, w, X=X, y=y) if not self.show_warnings: del os.environ["PYTHONWARNINGS"] @@ -125,6 +136,21 @@ def _fit(self, X, y, cores_for_training: int = 1): return self.model + def apply_with_weights(self, method, weights, **kwargs): + ''' + Can be used to run self.model.fit, self.model.predict or self.model.predict_proba with sample weights if supported + + :param method: self.model.fit, self.model.predict or self.model.predict_proba + :return: the result of the supplied method + ''' + if weights is not None and self.supports_example_weight(method): + return method(**kwargs, sample_weight=weights) + else: + return method(**kwargs) + + def supports_example_weight(self, method): + return "sample_weight" in inspect.signature(method).parameters + def can_predict_proba(self) -> bool: return False @@ -132,19 +158,20 @@ def check_is_fitted(self, label_name: str): if self.label.name == label_name or label_name is None: return check_is_fitted(self.model, ["estimators_", "coef_", "estimator", "_fit_X", "dual_coef_"], all_or_any=any) - def fit_by_cross_validation(self, encoded_data: EncodedData, number_of_splits: int = 5, label: Label = None, cores_for_training: int = -1, - optimization_metric='balanced_accuracy'): + def fit_by_cross_validation(self, encoded_data: EncodedData, label: Label = None, optimization_metric="balanced_accuracy", + number_of_splits: int = 5, cores_for_training: int = -1): self.class_mapping = Util.make_class_mapping(encoded_data.labels[label.name], label.positive_class) self.feature_names = encoded_data.feature_names self.label = label mapped_y = Util.map_to_new_class_values(encoded_data.labels[self.label.name], self.class_mapping) - self.model = self._fit_by_cross_validation(encoded_data.examples, mapped_y, number_of_splits, label, cores_for_training, - optimization_metric) + self.model = self._fit_by_cross_validation(X=encoded_data.examples, y=mapped_y, w=encoded_data.example_weights, + label=label, optimization_metric=optimization_metric, + number_of_splits=number_of_splits, cores_for_training=cores_for_training) - def _fit_by_cross_validation(self, X, y, number_of_splits: int = 5, label: Label = None, cores_for_training: int = 1, - optimization_metric: str = "balanced_accuracy"): + def _fit_by_cross_validation(self, X, y, w, label: Label = None, optimization_metric: str = "balanced_accuracy", + number_of_splits: int = 5, cores_for_training: int = 1): model = self._get_ml_model() scoring = ClassificationMetric.get_sklearn_score_name(ClassificationMetric.get_metric(optimization_metric.upper())) @@ -160,7 +187,8 @@ def _fit_by_cross_validation(self, X, y, number_of_splits: int = 5, label: Label self.model = RandomizedSearchCV(model, param_distributions=self._parameter_grid, cv=number_of_splits, n_jobs=cores_for_training, scoring=scoring, refit=True) - self.model.fit(X, y) + + self.apply_with_weights(self.model.fit, w, X=X, y=y) if not self.show_warnings: del os.environ["PYTHONWARNINGS"] @@ -255,9 +283,10 @@ def get_compatible_encoders(self): from immuneML.encodings.reference_encoding.MatchedSequencesEncoder import MatchedSequencesEncoder from immuneML.encodings.reference_encoding.MatchedReceptorsEncoder import MatchedReceptorsEncoder from immuneML.encodings.reference_encoding.MatchedRegexEncoder import MatchedRegexEncoder + from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder return [KmerFrequencyEncoder, OneHotEncoder, Word2VecEncoder, EvennessProfileEncoder, - MatchedSequencesEncoder, MatchedReceptorsEncoder, MatchedRegexEncoder] + MatchedSequencesEncoder, MatchedReceptorsEncoder, MatchedRegexEncoder, MotifEncoder] @staticmethod def get_usage_documentation(model_name): diff --git a/immuneML/ml_methods/generative_models/OLGA.py b/immuneML/ml_methods/generative_models/OLGA.py index 517e5c3d1..ff277d8e7 100644 --- a/immuneML/ml_methods/generative_models/OLGA.py +++ b/immuneML/ml_methods/generative_models/OLGA.py @@ -185,7 +185,7 @@ def _generate_productive_sequences(self, count: int, path: Path, seed: int, olga sequences.loc[i] = ( seq_row[0], seq_row[1], olga_model.v_gene_mapping[seq_row[2]], olga_model.j_gene_mapping[seq_row[3]], - RegionType.IMGT_JUNCTION.name, SequenceFrameType.IN.name, p_gen, int(olga_model == self._olga_model), + RegionType.IMGT_JUNCTION.name, SequenceFrameType.IN.value, p_gen, int(olga_model == self._olga_model), -1, self.chain.value) sequences.to_csv(path, index=False, sep='\t') diff --git a/immuneML/ml_methods/util/Util.py b/immuneML/ml_methods/util/Util.py index 90170765a..c0fa2288a 100644 --- a/immuneML/ml_methods/util/Util.py +++ b/immuneML/ml_methods/util/Util.py @@ -1,11 +1,13 @@ import logging from datetime import datetime +import random import numpy as np import pkg_resources import torch from sklearn.preprocessing import label_binarize +from immuneML.data_model.encoded_data.EncodedData import EncodedData from immuneML.environment.Constants import Constants @@ -104,3 +106,29 @@ def get_immuneML_version(): return 'immuneML ' + Constants.VERSION except Exception as e: return f'immuneML-dev-{datetime.now()}' + + @staticmethod + def get_train_val_indices(n_examples, training_percentage, random_seed=None): + indices = list(range(n_examples)) + + random.seed(random_seed) + random.shuffle(indices) + random.seed(None) + + limit = int(n_examples * training_percentage) + train_indices = indices[:limit] + val_indices = indices[limit:] + + return train_indices, val_indices + + @staticmethod + def subset_encoded_data(encoded_data: EncodedData, indices): + return EncodedData(examples=encoded_data.examples[indices], + labels={label_name: [encoded_data.labels[label_name][i] for i in indices] + for label_name in encoded_data.labels.keys()}, + example_ids=[encoded_data.example_ids[i] for i in indices], + example_weights=[encoded_data.example_weights[i] for i in indices] if encoded_data.example_weights is not None else None, + feature_names=encoded_data.feature_names, + feature_annotations=encoded_data.feature_annotations, + encoding=encoded_data.encoding, + info=encoded_data.info) diff --git a/immuneML/ml_metrics/MetricUtil.py b/immuneML/ml_metrics/MetricUtil.py index 8a8ef65da..1b6b97e6e 100644 --- a/immuneML/ml_metrics/MetricUtil.py +++ b/immuneML/ml_metrics/MetricUtil.py @@ -20,7 +20,7 @@ def get_metric_fn(metric: ClassificationMetric): return fn @staticmethod - def score_for_metric(metric: ClassificationMetric, predicted_y, predicted_proba_y, true_y, classes): + def score_for_metric(metric: ClassificationMetric, predicted_y, predicted_proba_y, true_y, classes, example_weights=None): ''' Note: when providing label classes, make sure the 'positive class' is sorted last. This sorting should be done automatically when accessing Label.values @@ -42,9 +42,9 @@ def score_for_metric(metric: ClassificationMetric, predicted_y, predicted_proba_ predictions = predicted_y if 'labels' in inspect.getfullargspec(fn).kwonlyargs or 'labels' in inspect.getfullargspec(fn).args: - score = fn(true_y, predictions, labels=classes) + score = fn(true_y, predictions, sample_weight=example_weights, labels=classes) else: - score = fn(true_y, predictions) + score = fn(true_y, predictions, sample_weight=example_weights) except ValueError as err: warnings.warn(f"MLMethodAssessment: score for metric {metric.name} could not be calculated." diff --git a/immuneML/ml_metrics/ml_metrics.py b/immuneML/ml_metrics/ml_metrics.py index e56a89083..4980524f4 100644 --- a/immuneML/ml_metrics/ml_metrics.py +++ b/immuneML/ml_metrics/ml_metrics.py @@ -2,24 +2,24 @@ from sklearn import metrics -def f1_score_weighted(true_y, predicted_y): - return metrics.f1_score(true_y, predicted_y, average="weighted") +def f1_score_weighted(true_y, predicted_y, sample_weight=None): + return metrics.f1_score(true_y, predicted_y, average="weighted", sample_weight=sample_weight) -def f1_score_micro(true_y, predicted_y): - return metrics.f1_score(true_y, predicted_y, average="micro") +def f1_score_micro(true_y, predicted_y, sample_weight=None): + return metrics.f1_score(true_y, predicted_y, average="micro", sample_weight=sample_weight) -def f1_score_macro(true_y, predicted_y): - return metrics.f1_score(true_y, predicted_y, average="macro") +def f1_score_macro(true_y, predicted_y, sample_weight=None): + return metrics.f1_score(true_y, predicted_y, average="macro", sample_weight=sample_weight) -def roc_auc_score(true_y, predicted_y, labels=None): +def roc_auc_score(true_y, predicted_y, sample_weight=None, labels=None): predictions = np.array(predicted_y) if not isinstance(predicted_y, np.ndarray) else predicted_y true_values = np.array(true_y) if not isinstance(true_y, np.ndarray) else true_y if predictions.shape == true_values.shape: - return metrics.roc_auc_score(true_values, predictions, labels=labels) + return metrics.roc_auc_score(true_values, predictions, sample_weight=sample_weight, labels=labels) elif len(predictions.shape) == 2 and predictions.shape[1] == 2: - return metrics.roc_auc_score(true_values, predictions[:, 1], labels=labels) + return metrics.roc_auc_score(true_values, predictions[:, 1], sample_weight=sample_weight, labels=labels) else: return -1 diff --git a/immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py b/immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py index 29a0ce7e8..bbacf8dc4 100644 --- a/immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py +++ b/immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py @@ -1,8 +1,10 @@ import warnings +from collections import Counter from pathlib import Path import pandas as pd import plotly.express as px +import numpy as np from immuneML.data_model.dataset.Dataset import Dataset from immuneML.data_model.dataset.ReceptorDataset import ReceptorDataset @@ -101,8 +103,7 @@ def _generate(self) -> ReportResult: figures.append(self._safe_plot(frequency_change=frequency_change, plot_callable="_plot_frequency_change")) return ReportResult(name=self.name, - info="A barplot showing the relative frequency of each amino acid at each position in " - "the sequences of a dataset.", + info="A barplot showing the relative frequency of each amino acid at each position in the sequences of a dataset.", output_figures=[fig for fig in figures if fig is not None], output_tables=[table for table in tables if table is not None]) @@ -273,8 +274,7 @@ def _get_position_order(self, positions): def _compute_frequency_change(self, freq_dist): classes = sorted(set(freq_dist["class"])) - assert len( - classes) == 2, f"{AminoAcidFrequencyDistribution.__name__}: cannot compute frequency change when the number of classes is not 2: {classes}" + assert len(classes) == 2, f"{AminoAcidFrequencyDistribution.__name__}: cannot compute frequency change when the number of classes is not 2: {classes}" class_a_df = freq_dist[freq_dist["class"] == classes[0]] class_b_df = freq_dist[freq_dist["class"] == classes[1]] @@ -333,13 +333,11 @@ def check_prerequisites(self): if self.split_by_label: if self.label_name is None: if len(self.dataset.get_label_names()) != 1: - warnings.warn( - f"{AminoAcidFrequencyDistribution.__name__}: ambiguous label: split_by_label was set to True but no label name was specified, and the number of available labels is {len(self.dataset.get_label_names())}: {self.dataset.get_label_names()}. Skipping this report...") + warnings.warn(f"{AminoAcidFrequencyDistribution.__name__}: ambiguous label: split_by_label was set to True but no label name was specified, and the number of available labels is {len(self.dataset.get_label_names())}: {self.dataset.get_label_names()}. Skipping this report...") return False else: if self.label_name not in self.dataset.get_label_names(): - warnings.warn( - f"{AminoAcidFrequencyDistribution.__name__}: the specified label name ({self.label_name}) was not available among the dataset labels: {self.dataset.get_label_names()}. Skipping this report...") + warnings.warn(f"{AminoAcidFrequencyDistribution.__name__}: the specified label name ({self.label_name}) was not available among the dataset labels: {self.dataset.get_label_names()}. Skipping this report...") return False return True diff --git a/immuneML/reports/data_reports/MotifGeneralizationAnalysis.py b/immuneML/reports/data_reports/MotifGeneralizationAnalysis.py new file mode 100644 index 000000000..c57cd27da --- /dev/null +++ b/immuneML/reports/data_reports/MotifGeneralizationAnalysis.py @@ -0,0 +1,366 @@ +from pathlib import Path + +import logging +import pandas as pd +import os +import warnings + +from immuneML.data_model.dataset.Dataset import Dataset +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.dsl.instruction_parsers.LabelHelper import LabelHelper +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.encodings.motif_encoding.PositionalMotifHelper import PositionalMotifHelper +from immuneML.ml_methods.util.Util import Util +from immuneML.reports.ReportResult import ReportResult +from immuneML.reports.data_reports.DataReport import DataReport +from immuneML.util.EncoderHelper import EncoderHelper +from immuneML.util.MotifPerformancePlotHelper import MotifPerformancePlotHelper +from immuneML.util.ParameterValidator import ParameterValidator +from immuneML.util.PathBuilder import PathBuilder +from immuneML.workflows.steps.DataEncoder import DataEncoder +from immuneML.workflows.steps.DataEncoderParams import DataEncoderParams + + +class MotifGeneralizationAnalysis(DataReport): + """ + This report splits the given dataset into a training and validation set, identifies significant motifs using the + :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder` + on the training set and plots the precision/recall and precision/true positive predictions of motifs + on both the training and validation sets. This can be used to: + - determine the optimal recall cutoff for motifs of a given size + - investigate how well motifs learned on a training set generalize to a test set + + After running this report and determining the optimal recall cutoffs, the report + :py:obj:`~immuneML.reports.encoding_reports.MotifTestSetPerformance.MotifTestSetPerformance` can be run to + plot the performance on an independent test set. + + Arguments: + + label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example. + + training_set_identifier_path (str): Path to a file containing 'sequence_identifiers' of the sequences used for the training set. This file should have a single column named 'example_id' and have one sequence identifier per line. If training_set_identifier_path is not set, a random subset of the data (according to training_percentage) will be assigned to be the training set. + + training_percentage (float): If training_set_identifier_path is not set, this value is used to specify the fraction of sequences that will be randomly assigned to form the training set. Should be a value between 0 and 1. By default, training_percentage is 0.7. + + random_seed (int): Random seed for splitting the data into training and validation sets a training_set_identifier_path is not provided. + + split_by_motif_size (bool): Whether to split the analysis per motif size. If true, a recall threshold is learned for each motif size, and figures are generated for each motif size independently. By default, split_by_motif_size is true. + + min_precision: :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder` parameter. The minimum precision threshold for keeping a motif on the training set. By default, min_precision is 0.9. + + test_precision_threshold (float). The desired precision on the test set, given that motifs are learned by using a training set with a precision threshold of min_precision. It is recommended for test_precision_threshold to be lower than min_precision, e.g., min_precision - 0.1. By default, test_precision_threshold is 0.8. + + min_recall (float): :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder` parameter. The minimum recall threshold for keeping a motif. Any learned recall threshold will be at least as high as the set min_recall value. The default value for min_recall is 0. + + min_true_positives (int): :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder` parameter. The minimum number of true positive training sequences that a motif needs to occur in. The default value for min_true_positives is 1. + + max_positions (int): :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder` parameter. The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4. + + min_positions (int): :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder` parameter. The minimum motif size (see also: max_positions). The default value for min_positions is 1. + + smoothen_combined_precision (bool): whether to add a smoothed line representing the combined precision to the precision-vs-TP plot. When set to True, this may take considerable extra time to compute. By default, plot_smoothed_combined_precision is set to True. + + min_points_in_window (int): Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This parameter determines the minimum number of points that need to be present in a window to determine the adaptive window size. By default, min_points_in_window is 50. + + smoothing_constant1: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This smoothing constant determines the dependence of the smoothness on the window size. Increasing this increases smoothness for regions where few points are present. By default, smoothing_constant1 is 5. + + smoothing_constant2: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing. with adaptive window size. This smoothing constant can be used to scale the overall kernel width, thus influencing the smoothness of all regions regardless of data density. By default, smoothing_constant2 is 10. + + training_set_name (str): Name of the training set to be used in figures. By default, the training_set_name is 'training set'. + + test_set_name (str): Name of the test set to be used in figures. By default, the test_set_name is 'test set'. + + highlight_motifs_path (str): Path to a set of motifs of interest to highlight in the output figures (such as implanted ground-truth motifs). By default, no motifs are highlighted. + + highlight_motifs_name (str): IF highlight_motifs_path is defined, this name will be used to label the motifs of interest in the output figures. + + + YAML specification: + + .. indent with spaces + .. code-block:: yaml + + + my_report: + MotifGeneralizationAnalysis: + ... + label: # Define a label, and the positive class for that given label + CMV: + positive_class: + + """ + + def __init__(self, training_set_identifier_path: str = None, training_percentage: float = None, + max_positions: int = None, min_positions: int = None, min_precision: float = None, min_recall: float = None, + min_true_positives: int = None, + test_precision_threshold: float = None, + split_by_motif_size: bool = None, random_seed: int = None, label: dict = None, + min_points_in_window: int = None, smoothing_constant1: float = None, smoothing_constant2: float = None, + highlight_motifs_path: str = None, highlight_motifs_name: str = None, + training_set_name: str = None, test_set_name: str = None, + dataset: SequenceDataset = None, result_path: Path = None, number_of_processes: int = 1, name: str = None): + super().__init__(dataset=dataset, result_path=result_path, number_of_processes=number_of_processes, name=name) + self.training_set_identifier_path = Path(training_set_identifier_path) if training_set_identifier_path is not None else None + self.training_percentage = training_percentage + self.max_positions = max_positions + self.min_positions = min_positions + self.min_precision = min_precision + self.test_precision_threshold = test_precision_threshold + self.min_recall = min_recall + self.min_true_positives = min_true_positives + self.split_by_motif_size = split_by_motif_size + self.min_points_in_window = min_points_in_window + self.smoothing_constant1 = smoothing_constant1 + self.smoothing_constant2 = smoothing_constant2 + self.random_seed = random_seed + self.label = label + self.n_positives_in_training_data = None + + self.training_set_name = training_set_name + self.test_set_name = test_set_name + self.highlight_motifs_name = highlight_motifs_name + self.highlight_motifs_path = Path(highlight_motifs_path) if highlight_motifs_path is not None else None + + @classmethod + def build_object(cls, **kwargs): + location = MotifGeneralizationAnalysis.__name__ + + ParameterValidator.assert_type_and_value(kwargs["max_positions"], int, location, "max_positions", min_inclusive=1) + ParameterValidator.assert_type_and_value(kwargs["min_positions"], int, location, "min_positions", min_inclusive=1) + assert kwargs["max_positions"] >= kwargs["min_positions"], f"{location}: max_positions ({kwargs['max_positions']}) must be greater than or equal to min_positions ({kwargs['min_positions']})" + + ParameterValidator.assert_type_and_value(kwargs["min_precision"], (int, float), location, "min_precision", min_inclusive=0, max_inclusive=1) + ParameterValidator.assert_type_and_value(kwargs["min_recall"], (int, float), location, "min_recall", min_inclusive=0, max_inclusive=1) + ParameterValidator.assert_type_and_value(kwargs["min_true_positives"], int, location, "min_true_positives", min_inclusive=1) + ParameterValidator.assert_type_and_value(kwargs["test_precision_threshold"], float, location, "test_precision_threshold", min_inclusive=0, max_exclusive=1) + ParameterValidator.assert_type_and_value(kwargs["split_by_motif_size"], bool, location, "split_by_motif_size") + ParameterValidator.assert_type_and_value(kwargs["min_points_in_window"], int, location, "min_points_in_window", min_inclusive=1) + ParameterValidator.assert_type_and_value(kwargs["smoothing_constant1"], (int, float), location, "smoothing_constant1", min_exclusive=0) + ParameterValidator.assert_type_and_value(kwargs["smoothing_constant2"], (int, float), location, "smoothing_constant2", min_exclusive=0) + + ParameterValidator.assert_type_and_value(kwargs["training_set_name"], str, location, "training_set_name") + ParameterValidator.assert_type_and_value(kwargs["test_set_name"], str, location, "test_set_name") + + assert kwargs["training_set_name"] != kwargs["test_set_name"], f"{location}: training_set_name cannot be the same as test_set_name. Both are: {kwargs['training_set_name']}" + + if kwargs["training_set_identifier_path"] is not None: + ParameterValidator.assert_type_and_value(kwargs["training_set_identifier_path"], str, location, "training_set_identifier_path") + assert os.path.isfile(kwargs["training_set_identifier_path"]), f"{location}: the file {kwargs['training_set_identifier_path']} does not exist. " \ + f"Specify the correct path under training_set_identifier_path." + else: + ParameterValidator.assert_type_and_value(kwargs["training_percentage"], float, location, "training_percentage", min_exclusive=0, max_exclusive=1) + + if "random_seed" in kwargs and kwargs["random_seed"] is not None: + ParameterValidator.assert_type_and_value(kwargs["random_seed"], int, location, "random_seed") + + ParameterValidator.assert_type_and_value(kwargs["label"], (dict, str), location, "label") + if type(kwargs["label"]) is dict: + assert len(kwargs["label"]) == 1, f"{location}: The number of specified labels must be 1, found {len(kwargs['label'])}: {', '.join(list(len(kwargs['label'].keys())))}" + + if "highlight_motifs_path" in kwargs and kwargs["highlight_motifs_path"] is not None: + PositionalMotifHelper.check_motif_filepath(kwargs["highlight_motifs_path"], location, "highlight_motifs_path") + + ParameterValidator.assert_type_and_value(kwargs["highlight_motifs_name"], str, location, "highlight_motifs_name") + + return MotifGeneralizationAnalysis(**kwargs) + + def _generate(self): + encoded_training_dataset, encoded_test_dataset = self._get_encoded_train_test_datasets() + training_plotting_data, test_plotting_data = MotifPerformancePlotHelper.get_plotting_data(encoded_training_dataset.encoded_data, + encoded_test_dataset.encoded_data, + self.highlight_motifs_path, self.highlight_motifs_name) + + self.n_positives_in_training_data = self._get_positive_count(encoded_training_dataset) + + return self._get_report_result(training_plotting_data, test_plotting_data) + + def _get_report_result(self, training_plotting_data, test_plotting_data): + if self.split_by_motif_size: + output_tables, output_plots, tp_cutoff_dict = self._construct_and_plot_data_per_motif_size(training_plotting_data, test_plotting_data) + else: + output_tables, output_plots, tp_cutoff_dict = self._construct_and_plot_data(training_plotting_data, test_plotting_data) + + tp_cutoff_table = self._write_tp_recall_thresholds(tp_cutoff_dict) + output_tables.append(tp_cutoff_table) + + return ReportResult(output_tables=output_tables, + output_figures=output_plots) + + def _construct_and_plot_data_per_motif_size(self, training_plotting_data, test_plotting_data): + output_tables, output_plots = [], [] + tp_cutoff_dict = {} + + training_plotting_data["motif_size"] = training_plotting_data["feature_names"].apply(PositionalMotifHelper.get_motif_size) + test_plotting_data["motif_size"] = test_plotting_data["feature_names"].apply(PositionalMotifHelper.get_motif_size) + + for motif_size in sorted(set(training_plotting_data["motif_size"])): + sub_training_plotting_data = training_plotting_data[training_plotting_data["motif_size"] == motif_size] + sub_test_plotting_data = test_plotting_data[test_plotting_data["motif_size"] == motif_size] + + sub_output_tables, sub_output_plots, sub_tp_cutoff_dict = self._construct_and_plot_data(sub_training_plotting_data, sub_test_plotting_data, motif_size=motif_size) + + output_tables.extend(sub_output_tables) + output_plots.extend(sub_output_plots) + tp_cutoff_dict.update(sub_tp_cutoff_dict) + + return output_tables, output_plots, tp_cutoff_dict + + + def _construct_and_plot_data(self, training_plotting_data, test_plotting_data, motif_size=None): + training_combined_precision = self._get_combined_precision(training_plotting_data) + test_combined_precision = self._get_combined_precision(test_plotting_data) + tp_cutoff = self._determine_tp_cutoff(test_combined_precision, motif_size) + + motif_size_suffix = f"_motif_size={motif_size}" if motif_size is not None else "" + motifs_name = f"motifs of length {motif_size}" if motif_size is not None else "motifs" + + output_tables = MotifPerformancePlotHelper.write_output_tables(self, training_plotting_data, test_plotting_data, training_combined_precision, test_combined_precision, motifs_name=motifs_name, file_suffix=motif_size_suffix) + output_plots = MotifPerformancePlotHelper.write_plots(self, training_plotting_data, test_plotting_data, training_combined_precision, test_combined_precision, training_tp_cutoff=None, test_tp_cutoff=tp_cutoff, motifs_name=motifs_name, file_suffix=motif_size_suffix) + + return output_tables, output_plots, {motif_size: tp_cutoff} + + def _write_tp_recall_thresholds(self, tp_cutoff_dict): + training_set_name = self.training_set_name.replace(" ", "_") + + data = {"precision_cutoff": [self.min_precision] * len(tp_cutoff_dict), + "n_positives_in_training_data": [self.n_positives_in_training_data] * len(tp_cutoff_dict), + f"{training_set_name}_tp_cutoff": [], + f"{training_set_name}_recall_cutoff": [], + "motif_size": []} + + for motif_size, tp_cutoff in tp_cutoff_dict.items(): + data["motif_size"].append(motif_size if motif_size is not None else "all") + data[f"{training_set_name}_tp_cutoff"].append(tp_cutoff) + data[f"{training_set_name}_recall_cutoff"].append(self._tp_to_recall(tp_cutoff)) + + return self._write_output_table(table=pd.DataFrame(data), + file_path=self.result_path / "tp_recall_cutoffs.tsv", + name=f"{self.training_set_name}-TP and recall cutoff(s)") + def _tp_to_recall(self, tp_cutoff): + if tp_cutoff is not None: + return tp_cutoff / self.n_positives_in_training_data + + def _get_encoded_train_test_datasets(self): + train_data_path = PathBuilder.build(self.result_path / "datasets/train") + test_data_path = PathBuilder.build(self.result_path / "datasets/test") + + train_indices, val_indices = self._get_train_val_indices() + + training_data = self.dataset.make_subset(train_indices, train_data_path, Dataset.TRAIN) + test_data = self.dataset.make_subset(val_indices, test_data_path, Dataset.TEST) + + encoder = self._get_encoder() + + encoded_training_dataset = self._encode_dataset(training_data, encoder, learn_model=True) + encoded_test_dataset = self._encode_dataset(test_data, encoder, learn_model=False) + + return encoded_training_dataset, encoded_test_dataset + + def _get_train_val_indices(self): + if self.training_set_identifier_path is None: + return Util.get_train_val_indices(self.dataset.get_example_count(), + self.training_percentage, random_seed=self.random_seed) + else: + return self._get_train_val_indices_from_file() + + + def _get_train_val_indices_from_file(self): + input_train_identifiers = list(pd.read_csv(self.training_set_identifier_path, usecols=["example_id"])["example_id"].astype(str)) + + train_indices = [] + val_indices = [] + val_identifiers = [] + actual_train_identifiers = [] + + for idx, sequence in enumerate(self.dataset.get_data()): + if str(sequence.identifier) in input_train_identifiers: + train_indices.append(idx) + actual_train_identifiers.append(sequence.identifier) + else: + val_indices.append(idx) + val_identifiers.append(sequence.identifier) + + self._write_identifiers(self.result_path / "training_set_identifiers.txt", actual_train_identifiers, "Training") + self._write_identifiers(self.result_path / "validation_set_identifiers.txt", val_identifiers, "Validation") + + assert len(train_indices) > 0, f"{MotifGeneralizationAnalysis.__name__}: error when reading training set identifiers from training_set_identifier_path, 0 of the identifiers were present in the dataset. Please check training_set_identifier_path: {self.training_set_identifier_path}, and see the log file for more information." + assert len(val_indices) > 0, f"{MotifGeneralizationAnalysis.__name__}: error when inferring validation set identifiers from training_set_identifier_path, all of the identifiers were present in the dataset resulting in 0 sequences in the validation set. Please check training_set_identifier_path: {self.training_set_identifier_path}, and see the log file for more information." + assert len(train_indices) == len(input_train_identifiers), f"{MotifGeneralizationAnalysis.__name__}: error when reading training set identifiers from training_set_identifier_path, not all identifiers provided in the file occurred in the dataset ({len(train_indices)} of {len(input_train_identifiers)} found). Please check training_set_identifier_path: {self.training_set_identifier_path}, and see the log file for more information." + + return train_indices, val_indices + + def _write_identifiers(self, path, identifiers, set_name): + logging.info(f"{MotifGeneralizationAnalysis.__name__}: {len(identifiers)} {set_name} set identifiers written to: {path}") + + with open(path, "w") as file: + file.writelines([f"{identifier}\n" for identifier in identifiers]) + + def _get_encoder(self): + encoder = MotifEncoder.build_object(self.dataset, **{"max_positions": self.max_positions, + "min_positions": self.min_positions, + "min_precision": self.min_precision, + "min_recall": self.min_recall, + "min_true_positives": self.min_true_positives, + "label": None, + "name": f"motif_encoder"}) + + return encoder + + def _encode_dataset(self, dataset, encoder, learn_model): + encoded_dataset = DataEncoder.run(DataEncoderParams(dataset=dataset, encoder=encoder, + encoder_params=EncoderParams(result_path=self.result_path / f"encoded_data/{dataset.name}", + label_config=self._get_label_config(dataset), + pool_size=self.number_of_processes, + learn_model=learn_model, + encode_labels=True), + )) + + return encoded_dataset + + def _get_label_config(self, dataset): + label_config = LabelHelper.create_label_config([self.label], dataset, MotifGeneralizationAnalysis.__name__, + f"{MotifGeneralizationAnalysis.__name__}/label") + EncoderHelper.check_positive_class_labels(label_config, f"{MotifGeneralizationAnalysis.__name__}/label") + + return label_config + + def _get_positive_count(self, dataset): + label_config = self._get_label_config(dataset) + label_name = label_config.get_label_objects()[0].name + label_positive_class = label_config.get_label_objects()[0].positive_class + + return sum([1 for label_class in dataset.get_metadata([label_name])[label_name] if label_class == label_positive_class]) + + def _get_combined_precision(self, plotting_data): + return MotifPerformancePlotHelper.get_combined_precision(plotting_data, + min_points_in_window=self.min_points_in_window, + smoothing_constant1=self.smoothing_constant1, + smoothing_constant2=self.smoothing_constant2) + + def _determine_tp_cutoff(self, combined_precision, motif_size=None): + col = "smooth_combined_precision" if "smooth_combined_precision" in combined_precision else "combined_precision" + + try: + # assert all(training_combined_precision["training_TP"] == test_combined_precision["training_TP"]) + # + # train_test_difference = training_combined_precision[col] - test_combined_precision[col] + # return min(test_combined_precision[train_test_difference <= self.precision_difference]["training_TP"]) + + max_tp_below_threshold = max(combined_precision[combined_precision[col] < self.test_precision_threshold]["training_TP"]) + all_above_threshold = combined_precision[combined_precision["training_TP"] > max_tp_below_threshold] + + return min(all_above_threshold["training_TP"]) + except ValueError: + motif_size_warning = f" for motif size = {motif_size}" if motif_size is not None else "" + warnings.warn(f"{MotifGeneralizationAnalysis.__name__}: could not automatically determine optimal TP threshold{motif_size_warning} with precison differenc based on {col}") + return None + + def _plot_precision_per_tp(self, file_path, plotting_data, combined_precision, dataset_type, tp_cutoff, motifs_name="motifs"): + return MotifPerformancePlotHelper.plot_precision_per_tp(file_path, plotting_data, combined_precision, dataset_type, + training_set_name=self.training_set_name, + tp_cutoff=tp_cutoff, motifs_name=motifs_name, + highlight_motifs_name=self.highlight_motifs_name) + + def _plot_precision_recall(self, file_path, plotting_data, min_recall=None, min_precision=None, dataset_type=None, motifs_name="motifs"): + return MotifPerformancePlotHelper.plot_precision_recall(file_path, plotting_data, min_recall=min_recall, min_precision=min_precision, + dataset_type=dataset_type, motifs_name=motifs_name, highlight_motifs_name=self.highlight_motifs_name) diff --git a/immuneML/reports/data_reports/SequenceLengthDistribution.py b/immuneML/reports/data_reports/SequenceLengthDistribution.py index 710d9225c..4231be74f 100644 --- a/immuneML/reports/data_reports/SequenceLengthDistribution.py +++ b/immuneML/reports/data_reports/SequenceLengthDistribution.py @@ -38,7 +38,6 @@ class SequenceLengthDistribution(DataReport): @classmethod def build_object(cls, **kwargs): - ParameterValidator.assert_sequence_type(kwargs) return SequenceLengthDistribution(**{**kwargs, 'sequence_type': SequenceType[kwargs['sequence_type'].upper()]}) diff --git a/immuneML/reports/data_reports/SignificantKmerPositions.py b/immuneML/reports/data_reports/SignificantKmerPositions.py index e9a762ee2..0f9fde805 100644 --- a/immuneML/reports/data_reports/SignificantKmerPositions.py +++ b/immuneML/reports/data_reports/SignificantKmerPositions.py @@ -80,6 +80,7 @@ def __init__(self, dataset: RepertoireDataset = None, reference_sequences_path: self.k_values = k_values self.label = label self.compairr_path = compairr_path + self.label_config = None def check_prerequisites(self): if isinstance(self.dataset, RepertoireDataset): diff --git a/immuneML/reports/data_reports/WeightsDistribution.py b/immuneML/reports/data_reports/WeightsDistribution.py new file mode 100644 index 000000000..afc59f9b1 --- /dev/null +++ b/immuneML/reports/data_reports/WeightsDistribution.py @@ -0,0 +1,134 @@ +import plotly.express as px +import warnings +from pathlib import Path + +import pandas as pd + +from immuneML.data_model.dataset.Dataset import Dataset +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.reports.ReportOutput import ReportOutput +from immuneML.reports.ReportResult import ReportResult +from immuneML.reports.data_reports.DataReport import DataReport +from immuneML.util.PathBuilder import PathBuilder +from immuneML.dsl.instruction_parsers.LabelHelper import LabelHelper + +class WeightsDistribution(DataReport): + """ + Plots the distribution of weights in a given Dataset. This report can only be used if example weighting has been applied to the given dataset. + + + # todo: the report should work with any label, with any number of classes (currently assumes is_binder with classes 0 and 1) + # use self.label property to find the label name. + # do not hardcode any text related to the is_binder label + + # todo: make sure the same classes always get the same color in the figures + # currently when running the report multiple times, class 0 is sometimes blue and sometimes purple + # solution: retrieve the classes, sort the classes, use color_discrete_map instead of color_discrete_sequence + # to map each class to a color. Get these colors for example from px.colors.diverging.Tealrose + + # todo: label is only a useful parameter when split_classes is true. these parameters can thus be merged + # instead, use parameter color_grouping_label. if set, use different colors. if not set, use one color + # see FeatureComparison report for an example where multiple labels can be used to change features of the plot + + # todo add unit tests + # todo add the sequence as one of the hover values (plotly argument hover_data=["..."]), see example: https://plotly.com/python/hover-text-and-formatting/ + + + Example YAML specification: + r1: + WeightsDistribution: + label: + is_binding + weight_thresholds: + - 1 + - 0.1 + - 0.001 + split_classes: + True + """ + @classmethod + def build_object(cls, **kwargs): + return WeightsDistribution(**kwargs) + + def __init__(self, dataset: Dataset = None, result_path: Path = None, number_of_processes: int = 1, name: str = None, label: dict = None, weight_thresholds: dict = None, split_classes: bool = None): + super().__init__(dataset=dataset, result_path=result_path, number_of_processes=number_of_processes, name=name) + self.label = label + self.weight_thresholds = weight_thresholds + self.split_classes = split_classes + self.label_config = None + + def check_prerequisites(self): + if self.dataset.get_example_weights() is not None: + return True + else: + warnings.warn("WeightsDistribution: report requires weights to be set for the given Dataset. Skipping this report...") + return False + + def _generate(self) -> ReportResult: + self.label_config = LabelHelper.create_label_config([self.label], self.dataset, WeightsDistribution.__name__, + f"{WeightsDistribution.__name__}/label") + data = self._get_plotting_data() + report_output_fig = self._safe_plot(data=data) + output_figures = None if report_output_fig is None else [report_output_fig] + return ReportResult(name=self.name, + info="A line graph of weights over sequences for is-binding and non-binding sequences", + output_figures=output_figures) + + + def get_weights_by_class_df(self, data, weights): + data["weights"] = weights + data.sort_values(by=["weights"], inplace=True, ascending=False) + + data["sorted"] = None + data.loc[data["is_binding"] == "0", "sorted"] = range(len(data[data["is_binding"] == "0"])) + data.loc[data["is_binding"] == "1", "sorted"] = range(len(data[data["is_binding"] == "1"])) + + return data + + def get_weights_both_classes_df(self, data, weights): + data["weights"] = weights + data.sort_values(by=["weights"], inplace=True, ascending=False) + data.loc[:, "sorted"] = range(len(data)) + + return data + + def _get_plotting_data(self): + weights = self.dataset.get_example_weights() + data = self.dataset.get_metadata([self.label], return_df=True) + + # todo remove this, this was temporary + if isinstance(self.dataset, SequenceDataset): + data["seq"] = [seq.sequence_aa for seq in self.dataset.get_data()] + + if self.split_classes: + data = self.get_weights_by_class_df(data, weights) + else: + data = self.get_weights_both_classes_df(data, weights) + + return data + + def _plot(self, data) -> ReportOutput: + print(data) + PathBuilder.build(self.result_path) + + if self.split_classes: + color = self.label + else: + color = None + + fig = px.line(data, y="weights", color=color, x="sorted", log_y=True, + color_discrete_sequence=["#65D0B8", "#AB80D8"], + hover_data=["seq"], + labels={ + "weights": "Weights", + "sorted": "Sequences (sorted by weight)", + "is_binding": "Is binder" + }, template="plotly_white") + + if self.weight_thresholds is not None: + for hline in self.weight_thresholds: + fig.add_shape(type="line", x0=0, x1=max(data["sorted"]), y0=hline, y1=hline, line=dict(color="black", width=1.5, dash="dash")) + + file_path = self.result_path / "weights_distribution.html" + fig.write_html(str(file_path)) + return ReportOutput(path=file_path, name="weights distribution plot") \ No newline at end of file diff --git a/immuneML/reports/encoding_reports/FeatureDistribution.py b/immuneML/reports/encoding_reports/FeatureDistribution.py index 7e10d0031..9c1db82ea 100644 --- a/immuneML/reports/encoding_reports/FeatureDistribution.py +++ b/immuneML/reports/encoding_reports/FeatureDistribution.py @@ -122,6 +122,7 @@ def _plot_sparse(self, data_long_format) -> ReportOutput: return ReportOutput(path=file_path, name="feature boxplots") def _plot_normal(self, data_long_format) -> ReportOutput: + figure = px.box(data_long_format, x=self.x, y="value", color=self.color, facet_row=self.facet_row, facet_col=self.facet_column, labels={ diff --git a/immuneML/reports/encoding_reports/GroundTruthMotifOverlap.py b/immuneML/reports/encoding_reports/GroundTruthMotifOverlap.py new file mode 100644 index 000000000..668745a44 --- /dev/null +++ b/immuneML/reports/encoding_reports/GroundTruthMotifOverlap.py @@ -0,0 +1,176 @@ +from pathlib import Path + +import logging +import plotly.express as px +import numpy as np +import pandas as pd + +from immuneML.data_model.dataset.Dataset import Dataset +from immuneML.encodings.motif_encoding.PositionalMotifHelper import PositionalMotifHelper + +from immuneML.reports.ReportOutput import ReportOutput +from immuneML.reports.ReportResult import ReportResult +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.reports.encoding_reports.EncodingReport import EncodingReport +from immuneML.util.PathBuilder import PathBuilder + + +class GroundTruthMotifOverlap(EncodingReport): + """ + Creates report displaying overlap between learned motifs and groundtruth motifs + + # todo: arguments, yaml spec, explanation of format of highlight motifs file + """ + + def __init__(self, dataset: Dataset = None, result_path: Path = None, name: str = None, + number_of_processes: int = 1, groundtruth_motifs_path: str = None): + super().__init__(dataset=dataset, result_path=result_path, name=name, number_of_processes=number_of_processes) + self.groundtruth_motifs_path = groundtruth_motifs_path + + @classmethod + def build_object(cls, **kwargs): + location = GroundTruthMotifOverlap.__name__ + + if "groundtruth_motifs_path" in kwargs and kwargs["groundtruth_motifs_path"] is not None: + PositionalMotifHelper.check_motif_filepath(kwargs["groundtruth_motifs_path"], location, "groundtruth_motifs_path", expected_header="indices\tamino_acids\tn_sequences\n") + + return GroundTruthMotifOverlap(**kwargs) + + def _generate(self): + PathBuilder.build(self.result_path) + + groundtruth_motifs, implant_rate_dict = self._read_groundtruth_motifs(self.groundtruth_motifs_path) + + learned_motifs = self.dataset.encoded_data.feature_names + + overlap_df = self._generate_overlap(learned_motifs, groundtruth_motifs, implant_rate_dict) + output_table = self._write_output_table(overlap_df, self.result_path / "ground_truth_motif_overlap.tsv", name=None) + output_figure = self._safe_plot(overlap_df=overlap_df) + + return ReportResult( + name=self.name, + output_figures=[output_figure] if output_figure is not None else [], + output_tables=[output_table], + ) + + def _read_groundtruth_motifs(self, filepath): + with open(filepath) as file: + PositionalMotifHelper.check_file_header(file.readline(), filepath, expected_header="indices\tamino_acids\tn_sequences\n") + groundtruth_motifs = [] + groundtruth_implant_rate = [] + for line in file.readlines(): + motif, implant_rate = self._get_motif_and_implant_rate( + line, motif_sep="\t" + ) + groundtruth_motifs.append(motif) + groundtruth_implant_rate.append(implant_rate) + + implant_rate_dict = { + groundtruth_motifs[i]: groundtruth_implant_rate[i] + for i in range(len(groundtruth_motifs)) + } + return groundtruth_motifs, implant_rate_dict + + def _get_motif_and_implant_rate(self, string, motif_sep): + indices_str, amino_acids_str, implant_rate = string.strip().split(motif_sep) + motif = indices_str + "-" + amino_acids_str + return motif, implant_rate + + def _generate_overlap(self, learned_motifs, groundtruth_motifs, implant_rate_dict): + motif_size_list = list() + implant_rate_list = list() + max_overlap_list = list() + learned_motif_list = list() + gt_motif_list = list() + + for learned_motif in learned_motifs: + motif_size = len(learned_motif.split("-")[0].replace("&", "")) + for groundtruth_motif in groundtruth_motifs: + max_overlap = self._get_max_overlap(learned_motif, groundtruth_motif) + + if max_overlap != 0: + motif_size_list.append(motif_size) + implant_rate_list.append(implant_rate_dict[groundtruth_motif]) + max_overlap_list.append(max_overlap) + learned_motif_list.append(learned_motif) + gt_motif_list.append(groundtruth_motif) + + df = pd.DataFrame() + df["learned_motif"] = learned_motif_list + df["ground_truth_motif"] = gt_motif_list + df["implant_rate"] = implant_rate_list + df["max_overlap"] = max_overlap_list + df["motif_size"] = motif_size_list + + return df + + def _get_max_overlap(self, learned_motif, groundtruth_motif): + # assumes no duplicates will occur as is the case with motifs + + split_learned = learned_motif.replace("&", "").split("-") + split_groundtruth = groundtruth_motif.replace("&", "").split("-") + + learned_aa = split_learned[0] + learned_indices = split_learned[1] + groundtruth_aa = split_groundtruth[0] + groundtruth_indices = split_groundtruth[1] + + learned_pairs = [learned_aa[i] + learned_indices[i] for i in range(len(learned_aa))] + groundtruth_pairs = [groundtruth_aa[i] + groundtruth_indices[i] for i in range(len(groundtruth_aa))] + + score = 0 + for pair in learned_pairs: + if pair in groundtruth_pairs: + score += 1 + + return score + + def _get_color_discrete_sequence(self): + return px.colors.qualitative.Pastel[:-1] + px.colors.qualitative.Set3 + + def _plot(self, overlap_df) -> ReportOutput: + file_path = self.result_path / f"motif_overlap.html" + facet_barplot = px.histogram( + overlap_df, + x="implant_rate", + labels={ + "implant_rate": "Number of implanted ground truth motifs", + "max_overlap": "Ground truth motif overlap", + "motif_size": "Motif size", + }, + facet_col="max_overlap", + color_discrete_sequence=self._get_color_discrete_sequence(), + category_orders=dict(implant_rate=sorted([int(rate) for rate in overlap_df["implant_rate"].unique()]), + motif_size=sorted([int(size) for size in overlap_df["motif_size"].unique()]), + max_overlap=sorted([int(overlap) for overlap in overlap_df["max_overlap"].unique()])), + facet_col_spacing=0.05, + color="motif_size", + title="Amount of overlapping motifs per implant rate", + template="plotly_white" + ) + facet_barplot.update_yaxes(matches=None, showticklabels=True) + facet_barplot.update_layout( + yaxis_title="Number of overlapping learned motifs", + ) + facet_barplot.write_html(str(file_path)) + + return ReportOutput( + path=file_path, name="Amount of overlapping motifs per implant rate" + ) + + def check_prerequisites(self): + valid_encodings = [MotifEncoder.__name__] + + if self.dataset.encoded_data is None or self.dataset.encoded_data.info is None: + logging.warning( + "GroundTruthMotifOverlap: the dataset is not encoded, skipping this report..." + ) + return False + elif self.dataset.encoded_data.encoding not in valid_encodings: + logging.warning( + f"GroundTruthMotifOverlap: the dataset encoding ({self.dataset.encoded_data.encoding}) was not in the list of valid " + f"encodings ({valid_encodings}), skipping this report..." + ) + return False + else: + return True diff --git a/immuneML/reports/encoding_reports/MotifTestSetPerformance.py b/immuneML/reports/encoding_reports/MotifTestSetPerformance.py new file mode 100644 index 000000000..0a25a2673 --- /dev/null +++ b/immuneML/reports/encoding_reports/MotifTestSetPerformance.py @@ -0,0 +1,325 @@ +import logging +import warnings +from pathlib import Path +import shutil + +import numpy as np + +from immuneML.IO.dataset_import.DataImport import DataImport +from immuneML.IO.dataset_import.DatasetImportParams import DatasetImportParams +from immuneML.data_model.dataset.Dataset import Dataset +from immuneML.dsl.DefaultParamsLoader import DefaultParamsLoader +from immuneML.dsl.import_parsers.ImportParser import ImportParser +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.encodings.motif_encoding.PositionalMotifHelper import PositionalMotifHelper +from immuneML.environment.Label import Label +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.reports.ReportResult import ReportResult +from immuneML.reports.encoding_reports.EncodingReport import EncodingReport +from immuneML.util.ImportHelper import ImportHelper +from immuneML.util.MotifPerformancePlotHelper import MotifPerformancePlotHelper +from immuneML.util.ParameterValidator import ParameterValidator +from immuneML.util.ReflectionHandler import ReflectionHandler + + +class MotifTestSetPerformance(EncodingReport): + """ + This report can be used to show the performance of a learned set motifs using the :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder` + on an independent test set of unseen data. + + It is recommended to first run the report :py:obj:`~immuneML.reports.data_reports.MotifGeneralizationAnalysis.MotifGeneralizationAnalysis` + in order to calibrate the optimal recall thresholds and plot the performance of motifs on training- and validation sets. + + Arguments: + + test_dataset (dict): parameters for importing a SequenceDataset to use as an independent test set. By default, + the import parameters 'is_repertoire' and 'paired' will be set to False to ensure a SequenceDataset is imported. + + + YAML specification: + + .. indent with spaces + .. code-block:: yaml + + my_motif_report: + MotifTestSetPerformance: + test_dataset: + format: AIRR # choose any valid import format + params: + path: path/to/files/ + is_repertoire: False # is_repertoire must be False to import a SequenceDataset + paired: False # paired must be False to import a SequenceDataset + # optional other parameters... + + """ + + def __init__(self, dataset: Dataset = None, result_path: Path = None, test_dataset_import_cls: DataImport = None, + test_dataset_import_params: DatasetImportParams = None, + training_set_name: str = None, test_set_name: str = None, + split_by_motif_size: bool = None, + highlight_motifs_path: str = None, highlight_motifs_name: str = None, + min_points_in_window: int = None, + smoothing_constant1: float = None, smoothing_constant2: float = None, + keep_test_dataset: bool = None, + number_of_processes: int = 1, name: str = None): + super().__init__(dataset=dataset, result_path=result_path, number_of_processes=number_of_processes, name=name) + self.test_dataset_import_cls = test_dataset_import_cls + self.test_dataset_import_params = test_dataset_import_params + + self.keep_test_dataset = keep_test_dataset + self.split_by_motif_size = split_by_motif_size + self.training_set_name = training_set_name + self.test_set_name = test_set_name + self.highlight_motifs_path = highlight_motifs_path + self.highlight_motifs_name = highlight_motifs_name + self.min_points_in_window = min_points_in_window + self.smoothing_constant1 = smoothing_constant1 + self.smoothing_constant2 = smoothing_constant2 + + @classmethod + def build_object(cls, **kwargs): + location = MotifTestSetPerformance.__name__ + + import_cls, test_dataset_import_params = MotifTestSetPerformance._parse_dataset_params(kwargs) + + kwargs["test_dataset_import_cls"] = import_cls + kwargs["test_dataset_import_params"] = test_dataset_import_params + + del kwargs["test_dataset"] + + ParameterValidator.assert_type_and_value(kwargs["split_by_motif_size"], bool, location, "split_by_motif_size") + ParameterValidator.assert_type_and_value(kwargs["training_set_name"], str, location, "training_set_name") + ParameterValidator.assert_type_and_value(kwargs["test_set_name"], str, location, "test_set_name") + ParameterValidator.assert_type_and_value(kwargs["keep_test_dataset"], bool, location, "keep_test_dataset") + ParameterValidator.assert_type_and_value(kwargs["min_points_in_window"], int, location, "min_points_in_window", min_inclusive=1) + ParameterValidator.assert_type_and_value(kwargs["smoothing_constant1"], (int, float), location, "smoothing_constant1", min_exclusive=0) + ParameterValidator.assert_type_and_value(kwargs["smoothing_constant2"], (int, float), location, "smoothing_constant2", min_exclusive=0) + + if "highlight_motifs_path" in kwargs and kwargs["highlight_motifs_path"] is not None: + PositionalMotifHelper.check_motif_filepath(kwargs["highlight_motifs_path"], location, "highlight_motifs_path") + + ParameterValidator.assert_type_and_value(kwargs["highlight_motifs_name"], str, location, "highlight_motifs_name") + + return MotifTestSetPerformance(**kwargs) + + @staticmethod + def _parse_dataset_params(kwargs): + location = MotifTestSetPerformance.__name__ + + ParameterValidator.assert_type_and_value(kwargs["test_dataset"], dict, location, "test_dataset") + ParameterValidator.assert_keys_present(kwargs["test_dataset"].keys(), ["format", "params"], location, + "test_dataset") + ParameterValidator.assert_type_and_value(kwargs["test_dataset"]["format"], str, location, "test_dataset/format") + + import_cls = ReflectionHandler.get_class_by_name("{}Import".format(kwargs["test_dataset"]["format"])) + default_params = DefaultParamsLoader.load(ImportParser.keyword, kwargs["test_dataset"]["format"]) + params_dict = {**default_params, **kwargs["test_dataset"]["params"]} + + test_dataset_import_params = DatasetImportParams.build_object(**params_dict) + + if test_dataset_import_params.is_repertoire: + warnings.warn(f"{location}: This report only allows the reference dataset to be of type SequenceDataset. " + "Setting 'test_dataset/params/is_repertoire' to False...") + test_dataset_import_params.is_repertoire = False + + if test_dataset_import_params.paired: + warnings.warn(f"{location}: This report only allows the reference dataset to be of type SequenceDataset. " + "Setting 'test_dataset/params/paired' to False...") + test_dataset_import_params.paired = False + + assert test_dataset_import_params.metadata_column_mapping is not None, f"{location}: This report requires a test_dataset containing the same label as the encoded dataset. Please set a label using 'test_dataset/params/metadata_column_mapping'." + + return import_cls, test_dataset_import_params + + def _generate(self) -> ReportResult: + test_dataset = self._get_test_dataset() + test_encoded_data = self._encode_test_data(test_dataset) + + training_plotting_data, test_plotting_data = MotifPerformancePlotHelper.get_plotting_data(self.dataset.encoded_data, + test_encoded_data.encoded_data, + self.highlight_motifs_path, + self.highlight_motifs_name) + + training_plotting_data["motif_size"] = training_plotting_data["feature_names"].apply(PositionalMotifHelper.get_motif_size) + test_plotting_data["motif_size"] = test_plotting_data["feature_names"].apply(PositionalMotifHelper.get_motif_size) + + output_tables, output_plots = self._get_report_outputs(training_plotting_data, test_plotting_data) + + if not self.keep_test_dataset: + shutil.rmtree(self.test_dataset_import_params.result_path) + + return ReportResult(name=self.name, + info="Performance of motifs on an independent test set", + output_figures=output_plots, + output_tables=output_tables) + + def _get_report_outputs(self, training_plotting_data, test_plotting_data): + if self.split_by_motif_size: + return self._construct_and_plot_data_per_motif_size(training_plotting_data, test_plotting_data) + else: + return self._construct_and_plot_data(training_plotting_data, test_plotting_data) + + + def _construct_and_plot_data_per_motif_size(self, training_plotting_data, test_plotting_data): + output_tables, output_plots = [], [] + + for motif_size in sorted(set(training_plotting_data["motif_size"])): + sub_training_plotting_data = training_plotting_data[training_plotting_data["motif_size"] == motif_size] + sub_test_plotting_data = test_plotting_data[test_plotting_data["motif_size"] == motif_size] + + sub_output_tables, sub_output_plots = self._construct_and_plot_data(sub_training_plotting_data, sub_test_plotting_data, motif_size=motif_size) + + output_tables.extend(sub_output_tables) + output_plots.extend(sub_output_plots) + + return output_tables, output_plots + def _construct_and_plot_data(self, training_plotting_data, test_plotting_data, motif_size=None): + training_combined_precision = self._get_combined_precision(training_plotting_data) + test_combined_precision = self._get_combined_precision(test_plotting_data) + + motif_size_suffix = f"_motif_size={motif_size}" if motif_size is not None else "" + motifs_name = f"motifs of length {motif_size}" if motif_size is not None else "motifs" + + output_tables = MotifPerformancePlotHelper.write_output_tables(self, training_plotting_data, test_plotting_data, training_combined_precision, test_combined_precision, motifs_name=motifs_name, file_suffix=motif_size_suffix) + output_plots = MotifPerformancePlotHelper.write_plots(self, training_plotting_data, test_plotting_data, training_combined_precision, test_combined_precision, training_tp_cutoff="auto", test_tp_cutoff="auto", motifs_name=motifs_name, file_suffix=motif_size_suffix) + + return output_tables, output_plots + + def _get_combined_precision(self, plotting_data): + return MotifPerformancePlotHelper.get_combined_precision(plotting_data, + min_points_in_window=self.min_points_in_window, + smoothing_constant1=self.smoothing_constant1, + smoothing_constant2=self.smoothing_constant2) + + def _write_output_tables(self, training_plotting_data, test_plotting_data, training_combined_precision, test_combined_precision, file_suffix=""): + results_table_name = "Confusion matrix and precision/recall scores for significant motifs on the {}" + combined_precision_table_name = "Combined precision scores of motifs on the {} for each TP value on the " + str(self.training_set_name) + + train_results_table = self._write_output_table(training_plotting_data, self.result_path / f"training_set_scores{file_suffix}.csv", results_table_name.format(self.training_set_name)) + test_results_table = self._write_output_table(test_plotting_data, self.result_path / f"test_set_scores{file_suffix}.csv", results_table_name.format(self.test_set_name)) + training_combined_precision_table = self._write_output_table(training_combined_precision, self.result_path / f"training_combined_precision{file_suffix}.csv", combined_precision_table_name.format(self.training_set_name)) + test_combined_precision_table = self._write_output_table(test_combined_precision, self.result_path / f"test_combined_precision{file_suffix}.csv", combined_precision_table_name.format(self.test_set_name)) + + return [table for table in [train_results_table, test_results_table, training_combined_precision_table, + test_combined_precision_table] if table is not None] + + def _plot_precision_per_tp(self, file_path, plotting_data, combined_precision, dataset_type, tp_cutoff, motifs_name="motifs"): + return MotifPerformancePlotHelper.plot_precision_per_tp(file_path, plotting_data, combined_precision, dataset_type, + training_set_name=self.training_set_name, + motifs_name=motifs_name, + tp_cutoff=tp_cutoff, + highlight_motifs_name=self.highlight_motifs_name) + + def _plot_precision_recall(self, file_path, plotting_data, min_recall=None, min_precision=None, dataset_type=None, motifs_name="motifs"): + return MotifPerformancePlotHelper.plot_precision_recall(file_path, plotting_data, min_recall=min_recall, min_precision=min_precision, + dataset_type=dataset_type, motifs_name=motifs_name, highlight_motifs_name=self.highlight_motifs_name) + + def _encode_test_data(self, test_dataset): + encoder = self._get_encoder() + params = EncoderParams(result_path=self.result_path / "encoded_test_dataset", + label_config=self._get_label_config(), + pool_size=self.number_of_processes, + learn_model=False) + + return encoder.encode(test_dataset, params) + + def _get_encoder(self): + encoder = MotifEncoder(label=self._get_label_name(), + name=f"motif_encoder_{self.name}") + + encoder.learned_motif_filepath = self.dataset.encoded_data.info["learned_motif_filepath"] + + return encoder + + + def _get_test_dataset_y_true(self, test_dataset): + label_name = self._get_label_name() + positive_class = self._get_positive_class() + + y_true = [sequence.get_attribute(label_name) == positive_class for sequence in test_dataset.get_data()] + + return np.array(y_true) + + def _get_motifs(self): + motif_names = self._get_motif_names() + return [PositionalMotifHelper.string_to_motif(name, "&", "-") for name in motif_names] + + def _get_motif_names(self): + return list(self.dataset.encoded_data.feature_annotations.feature_names) + + def _get_test_dataset(self): + self._set_result_path() + test_dataset = self._import_test_dataset() + self._check_test_dataset(test_dataset) + + return test_dataset + + def _check_test_dataset(self, test_dataset): + self._check_sequence_length(test_dataset) + self._check_dataset_label(test_dataset) + + def _check_sequence_length(self, test_dataset): + legal_length = self._get_legal_sequence_length() + + for sequence in test_dataset.get_data(): + assert len(sequence.get_sequence()) == legal_length, f"{MotifTestSetPerformance.__name__}: the length of the sequences in the test dataset is required to match the length of the original dataset ({legal_length}). Found sequence of length: {len(sequence.get_sequence())}" + + def _get_legal_sequence_length(self): + sequence = next(self.dataset.get_data()) + + return len(sequence.get_sequence()) + + def _check_dataset_label(self, test_dataset): + label_name = self._get_label_name() + label_values = set(self.dataset.encoded_data.labels[label_name]) + + error = f"{self.__class__.__name__}: expected only one label to be set for the test_dataset (label: {label_name}, with values: {', '.join(label_values)}). Instead found: {', '.join(test_dataset.get_label_names())}" + + assert len(test_dataset.get_label_names()) > 0, error + "no label set for the test_dataset." + assert len(test_dataset.get_label_names()) == 1, error + assert test_dataset.get_label_names() == [label_name], error + + test_dataset_label_values = set(test_dataset.get_metadata([label_name])[label_name]) + + assert label_values == test_dataset_label_values, error + f" with classes {', '.join(test_dataset_label_values)}" + + def _get_label_config(self): + return LabelConfiguration([self._get_label()]) + + def _get_label(self): + label_name = self._get_label_name() + label_values = list(set(self.dataset.encoded_data.labels[label_name])) + positive_class = self._get_positive_class() + + return Label(name=label_name, values=label_values, positive_class=positive_class) + + def _get_label_name(self): + return list(self.dataset.encoded_data.labels.keys())[0] + + def _get_positive_class(self): + return self.dataset.encoded_data.info["positive_class"] + + def _set_result_path(self): + self.test_dataset_import_params.result_path = self.result_path / f"test_dataset_{self.name}" + + def _import_test_dataset(self): + return ImportHelper.import_sequence_dataset(self.test_dataset_import_cls, self.test_dataset_import_params, f"test_dataset_{self.name}") + + def check_prerequisites(self) -> bool: + location = MotifTestSetPerformance.__name__ + + if self.dataset.encoded_data is None or self.dataset.encoded_data.info is None: + logging.warning(f"{location}: the dataset is not encoded, skipping this report...") + return False + elif self.dataset.encoded_data.encoding != MotifEncoder.__name__: + logging.warning( + f"{location}: the dataset encoding ({self.dataset.encoded_data.encoding}) " + f"does not match the required encoding ({MotifEncoder.__name__}), skipping this report...") + return False + elif self.dataset.encoded_data.feature_annotations is None: + logging.warning(f"{location}: missing feature annotations for {MotifEncoder.__name__}," + f"skipping this report...") + return False + else: + return True diff --git a/immuneML/reports/encoding_reports/NonMotifSequenceSimilarity.py b/immuneML/reports/encoding_reports/NonMotifSequenceSimilarity.py new file mode 100644 index 000000000..d022174fa --- /dev/null +++ b/immuneML/reports/encoding_reports/NonMotifSequenceSimilarity.py @@ -0,0 +1,173 @@ +from pathlib import Path + +import logging +import plotly.express as px +import pandas as pd +from multiprocessing import Pool +from functools import partial + + +from immuneML.data_model.dataset import SequenceDataset +from immuneML.encodings.motif_encoding.PositionalMotifHelper import PositionalMotifHelper +from immuneML.reports.ReportOutput import ReportOutput +from immuneML.reports.ReportResult import ReportResult +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.reports.encoding_reports.EncodingReport import EncodingReport +from immuneML.util.ParameterValidator import ParameterValidator +from immuneML.util.PathBuilder import PathBuilder + + +class NonMotifSequenceSimilarity(EncodingReport): + """ + Plots the similarity of positions outside the motifs of interest. This report can be used to investigate if the + motifs of interest as determined by the :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder` + have a tendency occur in sequences that are naturally very similar or dissimilar. + + For each motif, the subset of sequences containing the motif is selected, and the hamming distances are computed + between all sequences in this subset. Finally, a plot is created showing the distribution of hamming distances + between the sequences containing the motif. For motifs occurring in sets of very similar sequences, this distribution + will lean towards small hamming distances. Likewise, for motifs occurring in a very diverse set of sequences, the + distribution will lean towards containing more large hamming distances. + + + Specification arguments: + + motif_color_map (dict): An optional mapping between motif sizes and colors. If no mapping is given, default colors will be chosen. + + + YAML specification example: + + .. indent with spaces + .. code-block:: yaml + + my_motif_sim: + NonMotifSimilarity: + motif_color_map: + 3: "#66C5CC" + 4: "#F6CF71" + 5: "#F89C74" + + """ + + @classmethod + def build_object(cls, **kwargs): + if "motif_color_map" in kwargs: + ParameterValidator.assert_type_and_value(kwargs["motif_color_map"], dict, NonMotifSequenceSimilarity.__name__, "motif_color_map") + kwargs["motif_color_map"] = {str(key): value for key, value in kwargs["motif_color_map"].items()} + + return NonMotifSequenceSimilarity(**kwargs) + + def __init__(self, dataset: SequenceDataset = None, result_path: Path = None, + motif_color_map: dict = None, name: str = None, number_of_processes: int = 1): + super().__init__(dataset=dataset, result_path=result_path, name=name, number_of_processes=number_of_processes) + self.sequence_length = 0 + self.motif_color_map = motif_color_map + + def _generate(self): + PathBuilder.build(self.result_path) + + raw_counts = self.get_hamming_distance_counts() + plotting_data = self.get_plotting_data(raw_counts) + + table_1 = self._write_output_table(raw_counts, self.result_path / "sequence_hamming_distances_raw.tsv", + "Hamming distances between sequences sharing the same motif, raw counts.") + table_2 = self._write_output_table(plotting_data, self.result_path / "sequence_hamming_distances_percentage.tsv", + "Hamming distances between sequences sharing the same motif, expressed as percentage.") + + output_figure = self._safe_plot(plotting_data=plotting_data) + + return ReportResult( + name=self.name, + output_figures=[output_figure], + output_tables=[table_1, table_2], + ) + + def get_hamming_distance_counts(self): + np_sequences = PositionalMotifHelper.get_numpy_sequence_representation(self.dataset) + self.sequence_length = len(np_sequences[0]) + + raw_counts = pd.DataFrame([self._make_hamming_distance_hist_for_motif(motif_presence, np_sequences) + for motif_presence in self.dataset.encoded_data.examples.T]) + + ### Original code with multiprocessing (fails with bionumpy + pickle error?) + # with Pool(processes=self.number_of_processes) as pool: + # partial_func = partial(self._make_hamming_distance_hist_for_motif, np_sequences=np_sequences) + # raw_counts = pd.DataFrame(pool.map(partial_func, self.dataset.encoded_data.examples.T)) + + raw_counts["motif"] = self.dataset.encoded_data.feature_names + + return raw_counts + def _make_hamming_distance_hist_for_motif(self, motif_presence, np_sequences): + positive_sequences = np_sequences[motif_presence] + return self._make_hamming_distance_hist(positive_sequences) + def _make_hamming_distance_hist(self, sequences): + counts = {i: 0 for i in range(self.sequence_length)} + for dist in self._calculate_all_hamming_distances(sequences): + counts[dist] += 1 + + return counts + def _calculate_all_hamming_distances(self, sequences): + for i in range(len(sequences)): + for j in range(i + 1, len(sequences)): + yield sum(sequences[i] != sequences[j]) + + def get_plotting_data(self, raw_counts): + motif_col = raw_counts.loc[:,"motif"] + counts = raw_counts.loc[:, raw_counts.columns != "motif"] + + total = counts.apply(sum, axis=1) + plotting_data = counts.div(total, axis=0) + plotting_data["motif"] = motif_col + plotting_data = plotting_data[total > 0] + plotting_data = plotting_data.loc[::-1] + plotting_data = pd.melt(plotting_data, id_vars=["motif"], var_name="Hamming", value_name="Percentage") + + plotting_data["Hamming"] = plotting_data["Hamming"].astype(str) + plotting_data["motif_size"] = plotting_data["motif"].apply(lambda motif: len(motif.split("-")[0].split("&"))) + + return plotting_data + + def _plot(self, plotting_data) -> ReportOutput: + if self.motif_color_map is not None: + color_discrete_map = self.motif_color_map + color_discrete_sequence = None + else: + color_discrete_map = None + color_discrete_sequence = px.colors.sequential.Sunsetdark + + plotting_data["motif_size"] = plotting_data["motif_size"].astype(str) + + fig = px.line(plotting_data, x='Hamming', y='Percentage', markers=True, line_group='motif', color="motif_size", + template='plotly_white', color_discrete_sequence=color_discrete_sequence, color_discrete_map=color_discrete_map, + labels={"Percentage": "Percentage of sequence pairs containing motif", + "Hamming": "Hamming distance between sequences", + "motif_size": "Motif length
(number of
amino acids)"}) + + fig.layout.yaxis.tickformat = ',.0%' + + fig.update_traces(opacity=0.7) + + fig.update_layout( + font=dict( + size=14, + ) + ) + + output_path = self.result_path / "sequence_hamming_distances.html" + + fig.write_html(str(output_path)) + + return ReportOutput(path=output_path, name="Hamming distances between sequences sharing the same motif") + + def check_prerequisites(self): + valid_encodings = [MotifEncoder.__name__] + + if self.dataset.encoded_data is None or self.dataset.encoded_data.info is None: + logging.warning(f"{self.__class__.__name__}: the dataset is not encoded, skipping this report...") + return False + elif self.dataset.encoded_data.encoding not in valid_encodings: + logging.warning(f"{self.__class__.__name__}: the dataset encoding ({self.dataset.encoded_data.encoding}) was not in the list of valid " + f"encodings ({valid_encodings}), skipping this report...") + return False + else: + return True diff --git a/immuneML/reports/encoding_reports/PositionalMotifFrequencies.py b/immuneML/reports/encoding_reports/PositionalMotifFrequencies.py new file mode 100644 index 000000000..e082feae6 --- /dev/null +++ b/immuneML/reports/encoding_reports/PositionalMotifFrequencies.py @@ -0,0 +1,262 @@ +from pathlib import Path + +import logging +import plotly.express as px +import pandas as pd +from typing import List + +from immuneML.data_model.dataset import SequenceDataset +from immuneML.encodings.motif_encoding.PositionalMotifHelper import ( + PositionalMotifHelper, +) +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.SequenceType import SequenceType +from immuneML.reports.PlotlyUtil import PlotlyUtil +from immuneML.reports.ReportOutput import ReportOutput +from immuneML.reports.ReportResult import ReportResult +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.reports.encoding_reports.EncodingReport import EncodingReport +from immuneML.util.ParameterValidator import ParameterValidator +from immuneML.util.PathBuilder import PathBuilder + + +class PositionalMotifFrequencies(EncodingReport): + """ + This report must be used in combination with the :py:obj:`~immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder`. + Plots a stacked bar plot of amino acid occurrence at different indices in any given dataset, along with a plot + investigating motif continuity which displays a bar plot of the gap sizes between the amino acids in the motifs in + the given dataset. Note that a distance of 1 means that the amino acids are continuous (next to each other). + + Arguments: + + + YAML specification example: + + .. indent with spaces + .. code-block:: yaml + + my_pos_motif_report: PositionalMotifFrequencies + + """ + + @classmethod + def build_object(cls, **kwargs): + if "motif_color_map" in kwargs: + ParameterValidator.assert_type_and_value(kwargs["motif_color_map"], dict, PositionalMotifFrequencies.__name__, "motif_color_map") + kwargs["motif_color_map"] = {str(key): value for key, value in kwargs["motif_color_map"].items()} + + return PositionalMotifFrequencies(**kwargs) + + def __init__(self, dataset: SequenceDataset = None, result_path: Path = None, + motif_color_map: dict = None, name: str = None, number_of_processes: int = 1): + super().__init__(dataset=dataset, result_path=result_path, name=name, number_of_processes=number_of_processes) + self.motif_color_map = motif_color_map + + def get_sequence_length(self): + my_sequence = next(self.dataset.get_data()) + return len(my_sequence.get_sequence()) + + def _generate(self): + PathBuilder.build(self.result_path) + + motifs = self.dataset.encoded_data.feature_names + positional_aa_counts_df = self._get_positional_aa_counts(motifs) + max_gap_size_df = self._get_max_gap_sizes(motifs) + total_gap_size_df = self._get_total_gap_sizes(motifs) + + + positional_aa_counts_table = self._write_output_table(positional_aa_counts_df, + file_path=self.result_path / f"positional_aa_counts.csv", + name="Frequencies of amino acids found in the high-precision high-recall motifs") + + max_gap_size_table = self._write_output_table(max_gap_size_df, + file_path=self.result_path / f"max_gap_size_table.csv", + name="Maximum gap sizes within motifs") + + total_gap_size_table = self._write_output_table(total_gap_size_df, + file_path=self.result_path / f"total_gap_size_table.csv", + name="Total (summed) gap sizes within motifs") + + output_figure1 = self._safe_plot(positional_aa_counts_df=positional_aa_counts_df, plot_callable="_plot_positional_aa_counts") + output_figure2 = self._safe_plot(gap_size_df=max_gap_size_df, x="max_gap_size", plot_callable="_plot_gap_sizes") + output_figure3 = self._safe_plot(gap_size_df=total_gap_size_df, x="total_gap_size", plot_callable="_plot_gap_sizes") + + return ReportResult( + name=self.name, + output_figures=[fig for fig in [output_figure1, output_figure2, output_figure3] if fig is not None], + output_tables=[max_gap_size_table, total_gap_size_table, positional_aa_counts_table], + ) + + def _get_total_gap_sizes(self, motifs): + data = {"motif_size": [], + "total_gap_size": [], + "occurrence": []} + + gap_size_count = {} + + for motif in motifs: + motif_indices, amino_acids = PositionalMotifHelper.string_to_motif(motif, "&", "-") + total_gap_size = (max(motif_indices) - min(motif_indices)) - len(motif_indices) + 1 + motif_size = len(motif_indices) + + if motif_size not in gap_size_count: + gap_size_count[motif_size] = {total_gap_size: 1} + else: + if total_gap_size not in gap_size_count[motif_size]: + gap_size_count[motif_size][total_gap_size] = 1 + else: + gap_size_count[motif_size][total_gap_size] += 1 + + for motif_size, counts in gap_size_count.items(): + for total_gap_size, occurrence in counts.items(): + data["motif_size"].append(motif_size) + data["total_gap_size"].append(total_gap_size) + data["occurrence"].append(occurrence) + + return pd.DataFrame(data) + + def _get_max_gap_sizes(self, motifs): + gap_size_count = { + motif_size: { + gap_size: 0 for gap_size in range(0, self.get_sequence_length() - 1) + } + for motif_size in range(self.get_sequence_length()) + } + + for motif in motifs: + indices, amino_acids = PositionalMotifHelper.string_to_motif(motif, "&", "-") + motif_size = len(indices) + + if motif_size > 1: + gap_size = max([indices[i+1]-indices[i] -1 for i in range(motif_size-1)]) + gap_size_count[motif_size][gap_size] += 1 + + motif_sizes = list() + max_gap_sizes = list() + occurrence = list() + for motif_size in gap_size_count: + if sum(gap_size_count[motif_size].values()) != 0: + for gap_size in gap_size_count[motif_size]: + motif_sizes.append(str(motif_size)) + max_gap_sizes.append(gap_size) + occurrence.append(gap_size_count[motif_size][gap_size]) + + gap_size_df = pd.DataFrame() + gap_size_df["motif_size"] = motif_sizes + gap_size_df["max_gap_size"] = max_gap_sizes + gap_size_df["occurrence"] = occurrence + + return gap_size_df + + def _get_positional_aa_counts(self, motifs): + positional_aa_counts = { + aa: [0 for i in range(self.get_sequence_length())] + for aa in EnvironmentSettings.get_sequence_alphabet(SequenceType.AMINO_ACID) + } + + for motif in motifs: + indices, amino_acids = PositionalMotifHelper.string_to_motif( + motif, "&", "-" + ) + + for index, amino_acid in zip(indices, amino_acids): + positional_aa_counts[amino_acid][index] += 1 + + positional_aa_counts_df = pd.DataFrame(positional_aa_counts) + positional_aa_counts_df = positional_aa_counts_df.loc[ + :, (positional_aa_counts_df != 0).any(axis=0) + ] + + # start counting positions at 1 + positional_aa_counts_df.index = [idx+1 for idx in list(positional_aa_counts_df.index)] + + return positional_aa_counts_df + + def _plot_gap_sizes(self, gap_size_df, x): + file_path = self.result_path / f"{x}.html" + + gap_size_df["occurrence_total"] = gap_size_df.groupby("motif_size")["occurrence"].transform(sum) + gap_size_df["occurrence_percentage"] = gap_size_df["occurrence"] / gap_size_df["occurrence_total"] + + x_label = x.replace("_", " ").capitalize() + + if self.motif_color_map is not None: + color_discrete_map = self.motif_color_map + color_discrete_sequence = None + else: + color_discrete_map = None + color_discrete_sequence = self._get_color_discrete_sequence() + + gap_size_df = gap_size_df.sort_values(by=x) + gap_size_df["motif_size"] = gap_size_df["motif_size"].astype(str) + + gap_size_fig = px.line( + gap_size_df, + x=x, + y="occurrence_percentage", + color="motif_size", + color_discrete_map=color_discrete_map, + color_discrete_sequence=color_discrete_sequence, + category_orders=dict(motif_size=sorted([int(rate) for rate in gap_size_df["motif_size"].unique()])), + template="plotly_white", + markers=True, + labels={ + x: x_label, + "occurrence_percentage": "Percentage of motifs", + "motif_size": "Motif size", + }, + ) + gap_size_fig.layout.yaxis.tickformat = ',.0%' + + gap_size_fig.update_layout(font={"size": 14}, xaxis={"tickmode": "linear"}) + gap_size_fig.write_html(str(file_path)) + + return ReportOutput( + path=file_path, + name=f"Gap size between amino acids in high-precision high-recall motifs", + ) + + def _plot_positional_aa_counts(self, positional_aa_counts_df): + file_path = self.result_path / f"positional_motif_frequencies.html" + + # reverse sort column names makes amino acids stack alphabetically in bar chart + positional_aa_counts_df = positional_aa_counts_df[sorted(positional_aa_counts_df.columns)[::-1]] + + positional_aa_counts_fig = px.bar( + positional_aa_counts_df, + labels={ + "index": "Amino acid position", + "value": "Frequency across motifs", + }, + text="variable", + color_discrete_map=PlotlyUtil.get_amino_acid_color_map(), + template="plotly_white", + ) + positional_aa_counts_fig.update_layout( + showlegend=False, font={"size": 14}, xaxis={"tickmode": "linear"} + ) + positional_aa_counts_fig.write_html(str(file_path)) + return ReportOutput( + path=file_path, + name=f"Frequencies of amino acids found in the high-precision high-recall motifs", + ) + + def _get_color_discrete_sequence(self): + return px.colors.qualitative.Pastel[:-1] + px.colors.qualitative.Set3 + + def check_prerequisites(self): + valid_encodings = [MotifEncoder.__name__] + + if self.dataset.encoded_data is None or self.dataset.encoded_data.info is None: + logging.warning( + "PositonalMotifFrequencies: the dataset is not encoded, skipping this report..." + ) + return False + elif self.dataset.encoded_data.encoding not in valid_encodings: + logging.warning( + f"PositonalMotifFrequencies: the dataset encoding ({self.dataset.encoded_data.encoding}) was not in the list of valid " + f"encodings ({valid_encodings}), skipping this report..." + ) + return False + + return True diff --git a/immuneML/reports/ml_reports/BinaryFeaturePrecisionRecall.py b/immuneML/reports/ml_reports/BinaryFeaturePrecisionRecall.py new file mode 100644 index 000000000..5e76373b5 --- /dev/null +++ b/immuneML/reports/ml_reports/BinaryFeaturePrecisionRecall.py @@ -0,0 +1,158 @@ +import logging +import warnings +from pathlib import Path + +import numpy as np +import pandas as pd +import plotly.express as px + +from sklearn.metrics import precision_score, recall_score, accuracy_score, balanced_accuracy_score + +from immuneML.ml_methods.util.Util import Util +from immuneML.data_model.dataset.Dataset import Dataset +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.hyperparameter_optimization.HPSetting import HPSetting +from immuneML.ml_methods.BinaryFeatureClassifier import BinaryFeatureClassifier +from immuneML.reports.ReportOutput import ReportOutput +from immuneML.reports.ReportResult import ReportResult +from immuneML.reports.ml_reports.MLReport import MLReport +from immuneML.util.PathBuilder import PathBuilder + + + + +class BinaryFeaturePrecisionRecall(MLReport): + """ + + + Arguments: + + + + YAML specification: + + .. indent with spaces + .. code-block:: yaml + + my_report: BinaryFeaturePrecisionRecall + """ + + @classmethod + def build_object(cls, **kwargs): + + return BinaryFeaturePrecisionRecall(**kwargs) + + def __init__(self, train_dataset: Dataset = None, test_dataset: Dataset = None, + method: BinaryFeatureClassifier = None, result_path: Path = None, name: str = None, hp_setting: HPSetting = None, + label=None, number_of_processes: int = 1): + super().__init__(train_dataset=train_dataset, test_dataset=test_dataset, method=method, result_path=result_path, + name=name, hp_setting=hp_setting, label=label, number_of_processes=number_of_processes) + + def _generate(self): + PathBuilder.build(self.result_path) + + encoded_train_data, encoded_val_data = self._split_train_val_data(self.train_dataset.encoded_data) + encoded_test_data = self.test_dataset.encoded_data + + plotting_data_train = self._compute_plotting_data(encoded_train_data) + plotting_data_val = self._compute_plotting_data(encoded_val_data) + plotting_data_test = self._compute_plotting_data(encoded_test_data) + + train_table = self._write_plotting_data(plotting_data_train, dataset_type="training") + val_table = self._write_plotting_data(plotting_data_val, dataset_type="validation") + test_table = self._write_plotting_data(plotting_data_test, dataset_type="test") + + # train_table = self._write_output_table(plotting_data_train, self.result_path / "training_performance.tsv", name="Training set performance of every subset of binary features") + # test_table = self._write_output_table(plotting_data_test, self.result_path / "test_performance.tsv", name="Test set performance of every subset of binary features") + + train_fig = self._safe_plot(plotting_data=plotting_data_train, dataset_type="training") + val_fig = self._safe_plot(plotting_data=plotting_data_val, dataset_type="validation") + test_fig = self._safe_plot(plotting_data=plotting_data_test, dataset_type="test") + + return ReportResult(self.name, + info="Precision and recall scores for each subset of learned binary motifs", + output_tables=[table for table in [train_table, val_table, test_table] if table is not None], + output_figures=[fig for fig in [train_fig, val_fig, test_fig] if fig is not None]) + + def _split_train_val_data(self, encoded_train_val_data): + if self.method.train_indices and self.method.val_indices: + encoded_train_data = Util.subset_encoded_data(encoded_train_val_data, self.method.train_indices) + encoded_val_data = Util.subset_encoded_data(encoded_train_val_data, self.method.val_indices) + else: + encoded_train_data = encoded_train_val_data + encoded_val_data = None + + return encoded_train_data, encoded_val_data + + def _compute_plotting_data(self, encoded_data): + if encoded_data is None: + return None + + rule_tree_indices = self.method.rule_tree_indices + + data = {"n_rules": [], + "precision": [], + "recall": [], + "accuracy": [], + "balanced_accuracy": []} + + y_true_bool = np.array([cls == self.label.positive_class for cls in encoded_data.labels[self.label.name]]) + + if self.method.keep_all: + rules_range = range(len(rule_tree_indices), len(rule_tree_indices) + 1) + else: + rules_range = range(1, len(rule_tree_indices) + 1) + + for n_rules in rules_range: + rule_subtree = rule_tree_indices[:n_rules] + + y_pred_bool = self.method._get_rule_tree_predictions_bool(encoded_data, rule_subtree) + + data["n_rules"].append(n_rules) + data["precision"].append(precision_score(y_true_bool, y_pred_bool)) + data["recall"].append(recall_score(y_true_bool, y_pred_bool)) + data["accuracy"].append(accuracy_score(y_true_bool, y_pred_bool)) + data["balanced_accuracy"].append(balanced_accuracy_score(y_true_bool, y_pred_bool)) + + return pd.DataFrame(data) + + def _write_plotting_data(self, plotting_data, dataset_type): + if plotting_data is not None: + return self._write_output_table(plotting_data, self.result_path / f"{dataset_type}_performance.tsv", name=f"{dataset_type.title()} set performance of every subset of binary features") + + def _plot(self, plotting_data, dataset_type): + fig = px.line(plotting_data, x="recall", y="precision", + range_x=[0, 1.01], range_y=[0, 1.01], + template="plotly_white", + hover_data=["n_rules"], + color_discrete_sequence=px.colors.diverging.Tealrose, + markers=True) + + fig.update_traces(marker={'size': 4}) + + file_path = self.result_path / f"{dataset_type}_precision_recall.html" + + fig.write_html(str(file_path)) + + return ReportOutput(path=file_path, + name=f"Precision and recall scores on the {dataset_type} set for motif subsets") + + def check_prerequisites(self): + location = BinaryFeaturePrecisionRecall.__name__ + + run_report = True + + if not isinstance(self.method, BinaryFeatureClassifier): + logging.warning(f"{location} report can only be created for {BinaryFeatureClassifier.__name__}, but got " + f"{type(self.method).__name__} instead. {location} report will not be created.") + run_report = False + + if self.train_dataset.encoded_data is None or self.train_dataset.encoded_data.examples is None or self.train_dataset.encoded_data.feature_names is None or self.train_dataset.encoded_data.encoding != MotifEncoder.__name__: + warnings.warn( + f"{location}: this report can only be created for a dataset encoded with the {MotifEncoder.__name__}. Report {self.name} will not be created.") + run_report = False + + if hasattr(self.method, "keep_all") and self.method.keep_all: + warnings.warn(f"{location}: keep_all was set to True for ML method {self.method.name}, only one data point will be plotted. ") + + return run_report diff --git a/immuneML/reports/ml_reports/ROCCurve.py b/immuneML/reports/ml_reports/ROCCurve.py index 73e9efbfb..872588a8a 100644 --- a/immuneML/reports/ml_reports/ROCCurve.py +++ b/immuneML/reports/ml_reports/ROCCurve.py @@ -46,6 +46,7 @@ def _generate(self) -> ReportResult: fpr, tpr, _ = roc_curve(true_y, predicted_y) roc_auc = auc(fpr, tpr) + trace1 = go.Scatter(x=fpr, y=tpr, mode='lines', line=dict(color='darkorange', width=2), diff --git a/immuneML/reports/ml_reports/TrainingPerformance.py b/immuneML/reports/ml_reports/TrainingPerformance.py index 2339dbada..f31cdd7f2 100644 --- a/immuneML/reports/ml_reports/TrainingPerformance.py +++ b/immuneML/reports/ml_reports/TrainingPerformance.py @@ -75,6 +75,7 @@ def _generate(self) -> ReportResult: predicted_proba_y = self.method.predict_proba(X, self.label)[self.label.name][self.label.positive_class] true_y = self.train_dataset.encoded_data.labels[self.label.name] classes = self.method.get_classes() + example_weights = self.train_dataset.get_example_weights() PathBuilder.build(self.result_path) @@ -87,7 +88,7 @@ def _generate(self) -> ReportResult: for metric in self.metrics_set: _score = MetricUtil.score_for_metric(metric=ClassificationMetric.get_metric(metric), predicted_y=predicted_y, predicted_proba_y=predicted_proba_y, - true_y=true_y, classes=classes) + true_y=true_y, example_weights=example_weights, classes=classes) if metric == 'CONFUSION_MATRIX': self._generate_heatmap(classes, classes, _score, metric, output) diff --git a/immuneML/util/CompAIRRHelper.py b/immuneML/util/CompAIRRHelper.py index 31b42bdc9..c65875f79 100644 --- a/immuneML/util/CompAIRRHelper.py +++ b/immuneML/util/CompAIRRHelper.py @@ -7,6 +7,7 @@ from immuneML.data_model.receptor.RegionType import RegionType from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.SequenceType import SequenceType from immuneML.util.CompAIRRParams import CompAIRRParams @@ -60,7 +61,7 @@ def get_cmd_args(compairr_params: CompAIRRParams, input_file_list, result_path): command = '-m' if compairr_params.do_repertoire_overlap and not compairr_params.do_sequence_matching else '-x' return [str(compairr_params.compairr_path), command, "-d", str(compairr_params.differences), "-t", str(compairr_params.threads)] + \ - indels_args + frequency_args + ignore_genes + output_args + input_file_list + output_pairs + cdr3_indicator + indels_args + frequency_args + ignore_genes + output_args + [str(file) for file in input_file_list] + output_pairs + cdr3_indicator @staticmethod def write_repertoire_file(repertoire_dataset=None, filename=None, compairr_params=None, repertoires: list = None, @@ -84,6 +85,37 @@ def write_repertoire_file(repertoire_dataset=None, filename=None, compairr_param mode = "a" header = False + @staticmethod + def write_sequences_file(sequence_dataset, filename, compairr_params, repertoire_id="sequence_dataset"): + compairr_data = {"junction_aa": [], + "repertoire_id": [], + "sequence_id": []} + + if not compairr_params.ignore_genes: + compairr_data["v_call"] = [] + compairr_data["j_call"] = [] + + if not compairr_params.ignore_counts: + compairr_data["duplicate_count"] = [] + + for sequence in sequence_dataset.get_data(): + compairr_data["junction_aa"].append(sequence.get_sequence(sequence_type=SequenceType.AMINO_ACID)) + + assert sequence.identifier is not None, f"{CompAIRRHelper.__name__}: sequence identifiers must be set when exporting a sequence dataset for CompAIRR" + compairr_data["sequence_id"].append(sequence.identifier) + compairr_data["repertoire_id"].append(repertoire_id) + + if not compairr_params.ignore_genes: + compairr_data["v_call"].append(sequence.get_attribute("v_gene")) + compairr_data["j_call"].append(sequence.get_attribute("j_gene")) + + if not compairr_params.ignore_counts: + compairr_data["duplicate_count"].append(sequence.get_attribute("count")) + + df = pd.DataFrame(compairr_data) + + df.to_csv(filename, mode="w", header=True, index=False, sep="\t") + @staticmethod def get_repertoire_contents(repertoire, compairr_params, export_sequence_id=False): attributes = [EnvironmentSettings.get_sequence_type().value, "duplicate_count"] diff --git a/immuneML/util/EncoderHelper.py b/immuneML/util/EncoderHelper.py index 8c2565b23..208c5dc5a 100644 --- a/immuneML/util/EncoderHelper.py +++ b/immuneML/util/EncoderHelper.py @@ -1,12 +1,15 @@ import copy import pickle +import warnings from immuneML.IO.dataset_export.ImmuneMLExporter import ImmuneMLExporter from immuneML.caching.CacheHandler import CacheHandler from immuneML.data_model.dataset.Dataset import Dataset +from immuneML.data_model.dataset.ElementDataset import ElementDataset from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset from immuneML.encodings.EncoderParams import EncoderParams from immuneML.environment.Label import Label +from immuneML.environment.LabelConfiguration import LabelConfiguration from immuneML.pairwise_repertoire_comparison.ComparisonData import ComparisonData from immuneML.util.PathBuilder import PathBuilder @@ -29,6 +32,7 @@ def prepare_training_ids(dataset: Dataset, params: EncoderParams): @staticmethod def get_current_dataset(dataset, context): + '''Retrieves the full dataset (training+validation+test) if present in context, otherwise return the given dataset''' return dataset if context is None or "dataset" not in context else context["dataset"] @staticmethod @@ -69,5 +73,52 @@ def check_dataset_type_available_in_mapping(dataset, class_name): raise ValueError(f"{class_name.__name__}: this encoder is not defined for dataset of type {dataset.__class__.__name__}. " f"Valid dataset types for this encoder are: {', '.join(list(class_name.dataset_mapping.keys()))}") + @staticmethod + def encode_element_dataset_labels(dataset: ElementDataset, label_config: LabelConfiguration): + + labels = {name: [] for name in label_config.get_labels_by_name()} + + for sequence in dataset.get_data(): + for label_name in label_config.get_labels_by_name(): + label = sequence.get_attribute(label_name) + labels[label_name].append(label) + + return labels + + @staticmethod + def check_positive_class_labels(label_config: LabelConfiguration, location: str): + ''' + Performs checks for Encoders that explicitly predict a positive class. These Encoders can only be trained for a + single binary label at a time. + ''' + + labels = label_config.get_label_objects() + assert len(labels) == 1, f"{location}: this encoding works only for single label." + + label = labels[0] + + assert isinstance(label, Label) and label.positive_class is not None and label.positive_class != "", \ + f"{location}: positive_class parameter was not set for label {label}. It has to be set to determine the " \ + f"receptor sequences associated with the positive class. " \ + f"To use this encoder, in the label definition in the specification of the instruction, define " \ + f"the positive class for the label. See documentation for this encoder for more details." + + assert len(label.values) == 2, f"{location}: only binary classification (2 classes) is possible when extracting " \ + f"relevant sequences for the label, but got these classes for label {label.name} instead: {label.values}." + + @staticmethod + def get_example_weights_by_identifiers(dataset, example_identifiers): + weights = dataset.get_example_weights() + + if weights is not None: + weights_dict = dict(zip(dataset.get_example_ids(), weights)) + + return [weights_dict[identifier] for identifier in example_identifiers] + @staticmethod + def get_single_label_name_from_config(label_config: LabelConfiguration, location="EncoderHelper"): + assert label_config.get_label_count() != 0, f"{location}: the dataset does not contain labels, please specify a label under 'instructions'." + assert label_config.get_label_count() == 1, f"{location}: multiple labels were found: {', '.join(label_config.get_labels_by_name())}, expected a single label." + + return label_config.get_labels_by_name()[0] diff --git a/immuneML/util/ImportHelper.py b/immuneML/util/ImportHelper.py index 3a05e98e0..511d365e2 100644 --- a/immuneML/util/ImportHelper.py +++ b/immuneML/util/ImportHelper.py @@ -142,7 +142,7 @@ def load_repertoire_as_object(import_class, metadata_row, params: DatasetImportP return repertoire except Exception as exception: raise RuntimeError( - f"{ImportHelper.__name__}: error when importing file {metadata_row['filename']}.") from exception + f"{ImportHelper.__name__}: error when importing file {metadata_row['filename']}: {exception}") from exception @staticmethod def load_sequence_dataframe(filepath, params, alternative_load_func=None): @@ -293,11 +293,13 @@ def is_illegal_sequence(sequence, legal_alphabet) -> bool: def prepare_frame_type_list(params: DatasetImportParams) -> list: frame_type_list = [] if params.import_productive: - frame_type_list.append(SequenceFrameType.IN.name) + frame_type_list.append(SequenceFrameType.IN.value) + if params.import_unknown_productivity: + frame_type_list.append(SequenceFrameType.UNDEFINED.value) if params.import_out_of_frame: - frame_type_list.append(SequenceFrameType.OUT.name) + frame_type_list.append(SequenceFrameType.OUT.value) if params.import_with_stop_codon: - frame_type_list.append(SequenceFrameType.STOP.name) + frame_type_list.append(SequenceFrameType.STOP.value) return frame_type_list @staticmethod diff --git a/immuneML/util/MotifPerformancePlotHelper.py b/immuneML/util/MotifPerformancePlotHelper.py new file mode 100644 index 000000000..26facba8f --- /dev/null +++ b/immuneML/util/MotifPerformancePlotHelper.py @@ -0,0 +1,290 @@ + +import warnings +from scipy.stats import lognorm +import pandas as pd + +import plotly.express as px +import plotly.graph_objects as go + +from immuneML.encodings.motif_encoding.PositionalMotifHelper import PositionalMotifHelper +from immuneML.reports.ReportOutput import ReportOutput + + +class MotifPerformancePlotHelper(): + + @staticmethod + def get_plotting_data(training_encoded_data, test_encoded_data, highlight_motifs_path=None, highlight_motifs_name="highlight"): + training_feature_annotations = MotifPerformancePlotHelper._get_annotated_feature_annotations(training_encoded_data, highlight_motifs_path, highlight_motifs_name) + test_feature_annotations = MotifPerformancePlotHelper._get_annotated_feature_annotations(test_encoded_data, highlight_motifs_path, highlight_motifs_name) + + training_feature_annotations["training_TP"] = training_feature_annotations["TP"] + test_feature_annotations = MotifPerformancePlotHelper.merge_train_test_feature_annotations(training_feature_annotations, test_feature_annotations) + + return training_feature_annotations, test_feature_annotations + + @staticmethod + def _get_annotated_feature_annotations(encoded_data, highlight_motifs_path, highlight_motifs_name): + feature_annotations = encoded_data.feature_annotations.copy() + MotifPerformancePlotHelper._annotate_confusion_matrix(feature_annotations) + MotifPerformancePlotHelper._annotate_highlight(feature_annotations, highlight_motifs_path, highlight_motifs_name) + + return feature_annotations + + @staticmethod + def _annotate_confusion_matrix(feature_annotations): + feature_annotations["precision"] = feature_annotations.apply( + lambda row: 0 if row["TP"] == 0 else row["TP"] / (row["TP"] + row["FP"]), axis="columns") + + feature_annotations["recall"] = feature_annotations.apply( + lambda row: 0 if row["TP"] == 0 else row["TP"] / (row["TP"] + row["FN"]), axis="columns") + + @staticmethod + def _annotate_highlight(feature_annotations, highlight_motifs_path, highlight_motifs_name): + feature_annotations["highlight"] = MotifPerformancePlotHelper._get_highlight(feature_annotations, highlight_motifs_path, highlight_motifs_name) + + @staticmethod + def _get_highlight(feature_annotations, highlight_motifs_path, highlight_motifs_name): + if highlight_motifs_path is not None: + # highlight_motifs = [PositionalMotifHelper.motif_to_string(indices, amino_acids, motif_sep="-", newline=False) + # for indices, amino_acids in PositionalMotifHelper.read_motifs_from_file(highlight_motifs_path)] + + highlight_motifs = PositionalMotifHelper.read_motifs_from_file(highlight_motifs_path) + motifs = [PositionalMotifHelper.string_to_motif(motif, value_sep="&", motif_sep="-") for motif in feature_annotations["feature_names"]] + + return [highlight_motifs_name if MotifPerformancePlotHelper._is_highlight_motif(motif, highlight_motifs) else "Motif" + for motif in motifs] + else: + return ["Motif"] * len(feature_annotations) + + @staticmethod + def _is_highlight_motif(motif, highlight_motifs): + for highlight_motif in highlight_motifs: + if motif == highlight_motif: + return True + + if len(motif[0]) > len(highlight_motif[0]): + if MotifPerformancePlotHelper.is_sub_motif(highlight_motif, motif): + return True + + return False + + @staticmethod + def is_sub_motif(short_motif, long_motif): + assert len(long_motif[0]) > len(short_motif[0]) + + long_motif_dict = {long_motif[0][i]: long_motif[1][i] for i in range(len(long_motif[0]))} + + for idx, aa in zip(short_motif[0], short_motif[1]): + if idx in long_motif_dict.keys(): + if long_motif_dict[idx] != aa: + return False + else: + return False + + return True + + + @staticmethod + def merge_train_test_feature_annotations(training_feature_annotations, test_feature_annotations): + training_info_to_merge = training_feature_annotations[["feature_names", "training_TP"]].copy() + test_info_to_merge = test_feature_annotations.copy() + + merged_train_test_info = training_info_to_merge.merge(test_info_to_merge) + + return merged_train_test_info + + @staticmethod + def get_combined_precision(plotting_data, min_points_in_window, smoothing_constant1, smoothing_constant2): + group_by_tp = plotting_data.groupby("training_TP") + + combined_precision = group_by_tp["TP"].sum() / (group_by_tp["TP"].sum() + group_by_tp["FP"].sum()) + + df = pd.DataFrame({"training_TP": list(combined_precision.index), + "combined_precision": list(combined_precision)}) + + df["smooth_combined_precision"] = MotifPerformancePlotHelper._smooth_combined_precision(list(combined_precision.index), + list(combined_precision), + list(group_by_tp["TP"].count()), + min_points_in_window, + smoothing_constant1, + smoothing_constant2) + + return df + + @staticmethod + def _smooth_combined_precision(x, y, weights, min_points_in_window, smoothing_constant1, smoothing_constant2): + smoothed_y = [] + + for i in range(len(x)): + scale = MotifPerformancePlotHelper._get_lognorm_scale(x, i, weights, min_points_in_window, smoothing_constant1, smoothing_constant2) + + lognorm_for_this_x = lognorm.pdf(x, s=0.1, loc=x[i] - scale, scale=scale) + + smoothed_y.append(sum(lognorm_for_this_x * y) / sum(lognorm_for_this_x)) + + return smoothed_y + + @staticmethod + def _get_lognorm_scale(x, i, weights, min_points_in_window, smoothing_constant1, smoothing_constant2): + window_size = MotifPerformancePlotHelper._determine_window_size(x, i, weights, min_points_in_window) + return window_size * smoothing_constant1 + smoothing_constant2 + + @staticmethod + def _determine_window_size(x, i, weights, min_points_in_window): + x_rng = 0 + n_data_points = weights[i] + + if sum(weights) < min_points_in_window: + warnings.warn(f"{MotifPerformancePlotHelper.__name__}: min_points_in_window ({min_points_in_window}) is smaller than the total number of points in the plot ({sum(weights)}). Setting min_points_in_window to {sum(weights)} instead...") + min_points_in_window = sum(weights) + else: + min_points_in_window = min_points_in_window + + while n_data_points < min_points_in_window: + x_rng += 1 + + to_select = [j for j in range(len(x)) if (x[i] - x_rng) <= x[j] <= (x[i] + x_rng)] + lower_index = min(to_select) + upper_index = max(to_select) + + n_data_points = sum(weights[lower_index:upper_index + 1]) + + return x_rng + + @staticmethod + def plot_precision_per_tp(file_path, plotting_data, combined_precision, dataset_type, training_set_name, + tp_cutoff, motifs_name="motifs", highlight_motifs_name="highlight"): + # fig = px.scatter(plotting_data, + # y="precision", x="training_TP", hover_data=["feature_names"], + # range_y=[0, 1.01], color_discrete_sequence=["#74C4C4"], + # # stripmode="overlay", + # log_x=True, + # labels={ + # "precision": f"Precision ({dataset_type})", + # "feature_names": "Motif", + # "training_TP": f"True positive predictions ({training_set_name})" + # }, template="plotly_white") + + + # make 'base figure' with 1 point + fig = px.scatter(plotting_data, y=[0], x=[0], range_y=[-0.01, 1.01], log_x=True, + template="plotly_white") + + # hide 'base figure' point + fig.update_traces(marker=dict(size=12, opacity=0), selector=dict(mode='markers')) + + # add data points (needs to be separate trace to show up in legend) + fig.add_trace(go.Scatter(x=plotting_data["training_TP"], y=plotting_data["precision"], + mode='markers', name="Motif precision", + marker=dict(symbol="circle", color="#74C4C4")), + secondary_y=False) + + # add combined precision + fig.add_trace(go.Scatter(x=combined_precision["training_TP"], y=combined_precision["combined_precision"], + mode='markers+lines', name="Combined precision", + marker=dict(symbol="diamond", color=px.colors.diverging.Tealrose[0])), + secondary_y=False) + + # add highlighted motifs + plotting_data_highlight = plotting_data[plotting_data["highlight"] != "Motif"] + if len(plotting_data_highlight) > 0: + fig.add_trace(go.Scatter(x=plotting_data_highlight["training_TP"], y=plotting_data_highlight["precision"], + mode='markers', name=f"{highlight_motifs_name} precision", + marker=dict(symbol="circle", color="#F5C144")), + secondary_y=False) + + # add smoothed combined precision + if "smooth_combined_precision" in combined_precision: + fig.add_trace(go.Scatter(x=combined_precision["training_TP"], y=combined_precision["smooth_combined_precision"], + marker=dict(color=px.colors.diverging.Tealrose[-1]), + name="Combined precision, smoothed", + mode="lines", line_shape='spline', line={'smoothing': 1.3}), + secondary_y=False, ) + + # add vertical TP cutoff line + if tp_cutoff is not None: + if tp_cutoff == "auto": + tp_cutoff = min(plotting_data["training_TP"]) + + fig.add_vline(x=tp_cutoff, line_dash="dash") + + tickvals = MotifPerformancePlotHelper._get_log_x_axis_ticks(plotting_data, tp_cutoff) + fig.update_layout(xaxis=dict(tickvals=tickvals), + xaxis_title=f"True positive predictions ({training_set_name})", + yaxis_title=f"Precision ({dataset_type})", + showlegend=True) + + fig.write_html(str(file_path)) + + return ReportOutput( + path=file_path, + name=f"Precision scores on the {dataset_type} for {motifs_name} found at each true positive count of the {training_set_name}.", + ) + + @staticmethod + def _get_log_x_axis_ticks(plotting_data, tp_cutoff): + ticks = [] + + min_val, max_val = min(plotting_data["training_TP"]), max(plotting_data["training_TP"]) + + i = 1 + while i < max_val: + if i > min_val: + ticks.append(i) + i *= 10 + + ticks.append(min_val) + ticks.append(max_val) + + if tp_cutoff is not None: + ticks.append(tp_cutoff) + + return sorted(ticks) + + @staticmethod + def plot_precision_recall(file_path, plotting_data, min_recall=None, min_precision=None, dataset_type=None, motifs_name="motifs", + highlight_motifs_name="highlight"): + fig = px.scatter(plotting_data, + y="precision", x="recall", hover_data=["feature_names"], + range_x=[0, 1.01], range_y=[0, 1.01], color="highlight", + color_discrete_map={"Motif": px.colors.qualitative.Pastel[0], + highlight_motifs_name: px.colors.qualitative.Pastel[1]}, + labels={ + "precision": f"Precision ({dataset_type})", + "recall": f"Recall ({dataset_type})", + "feature_names": "Motif", + }, template="plotly_white") + + if min_precision is not None and min_precision > 0: + fig.add_hline(y=min_precision, line_dash="dash") + + if min_recall is not None and min_recall > 0: + fig.add_vline(x=min_recall, line_dash="dash") + + fig.write_html(str(file_path)) + + return ReportOutput( + path=file_path, + name=f"Precision versus recall of significant {motifs_name} on the {dataset_type}", + ) + + @staticmethod + def write_output_tables(report_obj, training_plotting_data, test_plotting_data, training_combined_precision, test_combined_precision, motifs_name="motifs", file_suffix=""): + results_table_name = f"Confusion matrix and precision/recall scores for significant {motifs_name}" + " on the {} set" + combined_precision_table_name = f"Combined precision scores of {motifs_name}" + " on the {} set for each TP value on the " + str(report_obj.training_set_name) + + train_results_table = report_obj._write_output_table(training_plotting_data, report_obj.result_path / f"training_set_scores{file_suffix}.csv", results_table_name.format(report_obj.training_set_name)) + test_results_table = report_obj._write_output_table(test_plotting_data, report_obj.result_path / f"test_set_scores{file_suffix}.csv", results_table_name.format(report_obj.test_set_name)) + training_combined_precision_table = report_obj._write_output_table(training_combined_precision, report_obj.result_path / f"training_combined_precision{file_suffix}.csv", combined_precision_table_name.format(report_obj.training_set_name)) + test_combined_precision_table = report_obj._write_output_table(test_combined_precision, report_obj.result_path / f"test_combined_precision{file_suffix}.csv", combined_precision_table_name.format(report_obj.test_set_name)) + + return [table for table in [train_results_table, test_results_table, training_combined_precision_table, test_combined_precision_table] if table is not None] + + @staticmethod + def write_plots(report_obj, training_plotting_data, test_plotting_data, training_combined_precision, test_combined_precision, training_tp_cutoff, test_tp_cutoff, motifs_name="motifs", file_suffix=""): + training_tp_plot = report_obj._safe_plot(plot_callable="_plot_precision_per_tp", plotting_data=training_plotting_data, combined_precision=training_combined_precision, dataset_type=report_obj.training_set_name, file_path=report_obj.result_path / f"training_precision_per_tp{file_suffix}.html", motifs_name=motifs_name, tp_cutoff=training_tp_cutoff) + test_tp_plot = report_obj._safe_plot(plot_callable="_plot_precision_per_tp", plotting_data=test_plotting_data, combined_precision=test_combined_precision, dataset_type=report_obj.test_set_name, file_path=report_obj.result_path / f"test_precision_per_tp{file_suffix}.html", motifs_name=motifs_name, tp_cutoff=test_tp_cutoff) + training_pr_plot = report_obj._safe_plot(plot_callable="_plot_precision_recall", plotting_data=training_plotting_data, dataset_type=report_obj.training_set_name, file_path=report_obj.result_path / f"training_precision_recall{file_suffix}.html", motifs_name=motifs_name) + test_pr_plot = report_obj._safe_plot(plot_callable="_plot_precision_recall", plotting_data=test_plotting_data, dataset_type=report_obj.test_set_name, file_path=report_obj.result_path / f"test_precision_recall{file_suffix}.html", motifs_name=motifs_name) + + return [plot for plot in [training_tp_plot, test_tp_plot, training_pr_plot, test_pr_plot] if plot is not None] diff --git a/immuneML/util/NumpyHelper.py b/immuneML/util/NumpyHelper.py index 43754d77e..0e76daf05 100644 --- a/immuneML/util/NumpyHelper.py +++ b/immuneML/util/NumpyHelper.py @@ -3,7 +3,6 @@ import numpy as np - class NumpyHelper: SIMPLE_TYPES = [str, int, float, bool, np.str_, np.int_, np.float_, np.bool_] diff --git a/immuneML/util/ParameterValidator.py b/immuneML/util/ParameterValidator.py index 7af40103a..24fd7e233 100644 --- a/immuneML/util/ParameterValidator.py +++ b/immuneML/util/ParameterValidator.py @@ -35,28 +35,26 @@ def assert_all_type_and_value(values, parameter_type, location: str, parameter_n def assert_type_and_value(value, parameter_type, location: str, parameter_name: str, min_inclusive=None, max_inclusive=None, min_exclusive=None, max_exclusive=None, exact_value=None): - assert isinstance(value, parameter_type), f"{location}: {value} is not a valid value for parameter {parameter_name}. " \ - f"It has to be of type {parameter_type.__name__}, but is now of type {type(value).__name__}." + type_name = " or ".join([t.__name__ for t in parameter_type]) if type(parameter_type) is tuple else parameter_type.__name__ + + base_mssg = f"{location}: {value} is not a valid value for parameter {parameter_name}. " + + assert isinstance(value, parameter_type), f"{base_mssg}It has to be of type {type_name}, but is now of type {type(value).__name__}." if min_inclusive is not None: - assert value >= min_inclusive, f"{location}: {value} is not a valid value for parameter {parameter_name}. " \ - f"It has to be greater or equal to {min_inclusive}." + assert value >= min_inclusive, base_mssg + f"It has to be greater or equal to {min_inclusive}." if max_inclusive is not None: - assert value <= max_inclusive, f"{location}: {value} is not a valid value for parameter {parameter_name}. " \ - f"It has to be less or equal to {max_inclusive}." + assert value <= max_inclusive, base_mssg + f"It has to be less or equal to {max_inclusive}." if min_exclusive is not None: - assert value > min_exclusive, f"{location}: {value} is not a valid value for parameter {parameter_name}. " \ - f"It has to be greater than {min_exclusive}." + assert value > min_exclusive, base_mssg + f"It has to be greater than {min_inclusive}." if max_exclusive is not None: - assert value < max_exclusive, f"{location}: {value} is not a valid value for parameter {parameter_name}. " \ - f"It has to be less than {max_exclusive}." + assert value < max_exclusive, base_mssg + f"It has to be less than {max_inclusive}." if exact_value is not None: - assert value == exact_value, f"{location}: {value} is not a valid value for parameter {parameter_name}. " \ - f"It has to be equal to {exact_value}." + assert value == exact_value, base_mssg + f"It has to be equal to {exact_value}." @staticmethod def assert_keys(keys, valid_keys, location: str, parameter_name: str, exclusive: bool = True): diff --git a/immuneML/workflows/instructions/MLProcess.py b/immuneML/workflows/instructions/MLProcess.py index de2e2e389..7ddc6eb4b 100644 --- a/immuneML/workflows/instructions/MLProcess.py +++ b/immuneML/workflows/instructions/MLProcess.py @@ -5,6 +5,7 @@ from immuneML.data_model.dataset.Dataset import Dataset from immuneML.environment.Label import Label from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy from immuneML.hyperparameter_optimization.HPSetting import HPSetting from immuneML.hyperparameter_optimization.core.HPUtil import HPUtil from immuneML.hyperparameter_optimization.states.HPItem import HPItem @@ -31,7 +32,7 @@ class MLProcess: def __init__(self, train_dataset: Dataset, test_dataset: Dataset, label: Label, metrics: set, optimization_metric: ClassificationMetric, path: Path, ml_reports: List[MLReport] = None, encoding_reports: list = None, data_reports: list = None, number_of_processes: int = 2, - label_config: LabelConfiguration = None, report_context: dict = None, hp_setting: HPSetting = None): + label_config: LabelConfiguration = None, report_context: dict = None, hp_setting: HPSetting = None, example_weighting: ExampleWeightingStrategy = None): self.train_dataset = train_dataset self.test_dataset = test_dataset self.label = label @@ -39,6 +40,7 @@ def __init__(self, train_dataset: Dataset, test_dataset: Dataset, label: Label, self.method = copy.deepcopy(hp_setting.ml_method) self.path = PathBuilder.build(path) if path is not None else None self.ml_details_path = path / "ml_details.yaml" if path is not None else None + self.ml_settings_export_path = path / "ml_settings_config" if path is not None else None self.ml_score_path = path / "ml_score.csv" if path is not None else None self.train_predictions_path = path / "train_predictions.csv" if path is not None else None self.test_predictions_path = path / "test_predictions.csv" if path is not None else None @@ -53,11 +55,13 @@ def __init__(self, train_dataset: Dataset, test_dataset: Dataset, label: Label, self.data_reports = data_reports if data_reports is not None else [] self.report_context = report_context self.hp_setting = copy.deepcopy(hp_setting) + self.example_weighting = example_weighting def _set_paths(self): if self.path is None: raise RuntimeError("MLProcess: path is not set, stopping execution...") self.ml_details_path = self.path / "ml_details.yaml" + self.ml_settings_export_path = self.path / "ml_settings_config" self.ml_score_path = self.path / "ml_score.csv" self.train_predictions_path = self.path / "train_predictions.csv" self.test_predictions_path = self.path / "test_predictions.csv" @@ -73,7 +77,10 @@ def run(self, split_index: int) -> HPItem: processed_dataset = HPUtil.preprocess_dataset(self.train_dataset, self.hp_setting.preproc_sequence, self.path / "preprocessed_train_dataset", self.report_context) - encoded_train_dataset = HPUtil.encode_dataset(processed_dataset, self.hp_setting, self.path / "encoded_datasets", learn_model=True, + weighted_dataset = HPUtil.weight_examples(dataset=processed_dataset, weighting_strategy=self.example_weighting, path=self.path / "weighted_train_datasets", + learn_model=True, number_of_processes=self.number_of_processes) + + encoded_train_dataset = HPUtil.encode_dataset(weighted_dataset, self.hp_setting, self.path / "encoded_datasets", learn_model=True, context=self.report_context, number_of_processes=self.number_of_processes, label_configuration=self.label_config) @@ -106,7 +113,12 @@ def _assess_on_test_dataset(self, encoded_train_dataset, encoding_train_results, if self.test_dataset is not None and self.test_dataset.get_example_count() > 0: processed_test_dataset = HPUtil.preprocess_dataset(self.test_dataset, self.hp_setting.preproc_sequence, self.path / "preprocessed_test_dataset") - encoded_test_dataset = HPUtil.encode_dataset(processed_test_dataset, self.hp_setting, self.path / "encoded_datasets", + + weighted_test_dataset = HPUtil.weight_examples(dataset=processed_test_dataset, weighting_strategy=self.example_weighting, + path=self.path / "weighted_test_datasets", learn_model=False, + number_of_processes=self.number_of_processes) + + encoded_test_dataset = HPUtil.encode_dataset(weighted_test_dataset, self.hp_setting, self.path / "encoded_datasets", learn_model=False, context=self.report_context, number_of_processes=self.number_of_processes, label_configuration=self.label_config) @@ -122,10 +134,11 @@ def _assess_on_test_dataset(self, encoded_train_dataset, encoding_train_results, test_predictions_path=self.test_predictions_path, ml_details_path=self.ml_details_path, train_dataset=self.train_dataset, test_dataset=self.test_dataset, split_index=split_index, model_report_results=model_report_results, encoding_train_results=encoding_train_results, encoding_test_results=encoding_test_results, performance=performance, - encoder=self.hp_setting.encoder) + encoder=self.hp_setting.encoder, ml_settings_export_path=self.ml_settings_export_path) else: hp_item = HPItem(method=method, hp_setting=self.hp_setting, train_predictions_path=self.train_predictions_path, test_predictions_path=None, ml_details_path=self.ml_details_path, train_dataset=self.train_dataset, - split_index=split_index, encoding_train_results=encoding_train_results, encoder=self.hp_setting.encoder) + split_index=split_index, encoding_train_results=encoding_train_results, encoder=self.hp_setting.encoder, + ml_settings_export_path=self.ml_settings_export_path) return hp_item diff --git a/immuneML/workflows/instructions/TrainMLModelInstruction.py b/immuneML/workflows/instructions/TrainMLModelInstruction.py index 11b69527a..49ced800e 100644 --- a/immuneML/workflows/instructions/TrainMLModelInstruction.py +++ b/immuneML/workflows/instructions/TrainMLModelInstruction.py @@ -1,11 +1,13 @@ from collections import Counter from pathlib import Path +import logging import pandas as pd from immuneML.IO.ml_method.MLExporter import MLExporter from immuneML.environment.Label import Label from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy from immuneML.hyperparameter_optimization.config.SplitConfig import SplitConfig from immuneML.hyperparameter_optimization.config.SplitType import SplitType from immuneML.hyperparameter_optimization.core.HPAssessment import HPAssessment @@ -53,6 +55,8 @@ class TrainMLModelInstruction(Instruction): - optimization_metric (Metric): a metric to use for optimization and assessment in the nested cross-validation. + - example_weighting: which example weighting strategy to use. Example weighting can be used to up-weight or down-weight the importance of each example in the dataset. These weights will be applied when computing (optimization) metrics, and are used by some encoders and ML methods. + - label_configuration (LabelConfiguration): a list of labels for which to train the classifiers. The goal of the nested CV is to find the setting which will have best performance in predicting the given label (e.g., if a subject has experienced an immune event or not). Performance and optimal settings will be reported for each label separately. If a label is binary, instead of specifying only its name, one @@ -67,6 +71,9 @@ class TrainMLModelInstruction(Instruction): - refit_optimal_model (bool): if the final combination of preprocessing-encoding-ML model should be refitted on the full dataset thus providing the final model to be exported from instruction; alternatively, train combination from one of the assessment folds will be used + - export_all_models (bool): if set to True, all trained models in the assessment split are exported as .zip files. + If False, only the optimal model is exported. By default, export_all_models is False. + YAML specification: @@ -117,25 +124,37 @@ class TrainMLModelInstruction(Instruction): number_of_processes: 4 # number of parallel processes to create (could speed up the computation) optimization_metric: balanced_accuracy # the metric to use for choosing the optimal model and during training refit_optimal_model: False # use trained model, do not refit on the full dataset + export_all_ml_settings: False # only export the optimal setting """ def __init__(self, dataset, hp_strategy: HPOptimizationStrategy, hp_settings: list, assessment: SplitConfig, selection: SplitConfig, metrics: set, optimization_metric: ClassificationMetric, label_configuration: LabelConfiguration, path: Path = None, context: dict = None, - number_of_processes: int = 1, reports: dict = None, name: str = None, refit_optimal_model: bool = False): + number_of_processes: int = 1, reports: dict = None, name: str = None, refit_optimal_model: bool = False, + export_all_ml_settings: bool = False, example_weighting: ExampleWeightingStrategy = None): self.state = TrainMLModelState(dataset, hp_strategy, hp_settings, assessment, selection, metrics, optimization_metric, label_configuration, path, context, number_of_processes, - reports if reports is not None else {}, name, refit_optimal_model) + reports if reports is not None else {}, name, refit_optimal_model, + export_all_ml_settings, example_weighting) def run(self, result_path: Path): self.state.path = result_path self.state = HPAssessment.run_assessment(self.state) + self._export_all_ml_settings() self._compute_optimal_hp_item_per_label() self.state.report_results = HPUtil.run_hyperparameter_reports(self.state, self.state.path / "reports") self.print_performances(self.state) self._export_all_performances_to_csv() return self.state + def _export_all_ml_settings(self): + if self.state.export_all_ml_settings: + for label in self.state.label_configuration.get_label_objects(): + for assessment_state in self.state.assessment_states: + for name, hp_item in assessment_state.label_states[label.name].assessment_items.items(): + zip_path = MLExporter.export_zip(hp_item=hp_item, path=hp_item.ml_settings_export_path, label_name=label.name) + logging.info(f"TrainMLModelInstruction: config for {name} for label {label.name} was exported to: {zip_path}") + def _compute_optimal_hp_item_per_label(self): n_labels = self.state.label_configuration.get_label_count() @@ -150,8 +169,9 @@ def _compute_optimal_item(self, label: Label, index_repr: str): if self.state.refit_optimal_model: print_log(f"TrainMLModel: retraining optimal model for label {label.name} {index_repr}.\n", include_datetime=True) self.state.optimal_hp_items[label.name] = MLProcess(self.state.dataset, None, label, self.state.metrics, self.state.optimization_metric, - self.state.path / f"optimal_{label.name}", number_of_processes=self.state.number_of_processes, - label_config=self.state.label_configuration, hp_setting=optimal_hp_setting).run(0) + self.state.path / f"optimal_{label.name}", number_of_processes=self.state.number_of_processes, + label_config=self.state.label_configuration, hp_setting=optimal_hp_setting, + example_weighting=self.state.example_weighting).run(0) print_log(f"TrainMLModel: finished retraining optimal model for label {label.name} {index_repr}.\n", include_datetime=True) else: diff --git a/immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisInstruction.py b/immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisInstruction.py index 6c7eae4b5..995a418bd 100644 --- a/immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisInstruction.py +++ b/immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisInstruction.py @@ -10,6 +10,8 @@ from immuneML.workflows.instructions.exploratory_analysis.ExploratoryAnalysisUnit import ExploratoryAnalysisUnit from immuneML.workflows.steps.DataEncoder import DataEncoder from immuneML.workflows.steps.DataEncoderParams import DataEncoderParams +from immuneML.workflows.steps.DataWeighter import DataWeighter +from immuneML.workflows.steps.DataWeighterParams import DataWeighterParams class ExploratoryAnalysisInstruction(Instruction): @@ -29,6 +31,8 @@ class ExploratoryAnalysisInstruction(Instruction): - preprocessing_sequence: which preprocessings to use on the dataset, this item is optional and does not have to be specified. + - example_weighting: which example weighting strategy to use before encoding the data, this item is optional and does not have to be specified. + - encoding: how to encode the dataset before running the report, this item is optional and does not have to be specified. - labels: if encoding is specified, the relevant labels should be specified here. @@ -91,6 +95,7 @@ def run(self, result_path: Path): def run_unit(self, unit: ExploratoryAnalysisUnit, result_path: Path) -> ReportResult: unit.dataset = self.preprocess_dataset(unit, result_path / "preprocessed_dataset") + unit.dataset = self.weight_examples(unit, result_path / "weighted_dataset") unit.dataset = self.encode(unit, result_path / "encoded_dataset") if unit.dim_reduction is not None: @@ -116,12 +121,23 @@ def preprocess_dataset(self, unit: ExploratoryAnalysisUnit, result_path: Path) - dataset = unit.dataset return dataset + def weight_examples(self, unit: ExploratoryAnalysisUnit, result_path: Path): + if unit.example_weighting is not None: + weighted_dataset = DataWeighter.run(DataWeighterParams(dataset=unit.dataset, weighting_strategy=unit.example_weighting, + weighting_params=ExampleWeightingParams(result_path=result_path, + pool_size=unit.number_of_processes, + learn_model=True), + )) + else: + weighted_dataset = unit.dataset + + return weighted_dataset + def encode(self, unit: ExploratoryAnalysisUnit, result_path: Path) -> Dataset: if unit.encoder is not None: encoded_dataset = DataEncoder.run(DataEncoderParams(dataset=unit.dataset, encoder=unit.encoder, encoder_params=EncoderParams(result_path=result_path, label_config=unit.label_config, - filename="encoded_dataset.pkl", pool_size=unit.number_of_processes, learn_model=True, encode_labels=unit.label_config is not None), diff --git a/immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisUnit.py b/immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisUnit.py index ad06ec3be..f1685eb32 100644 --- a/immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisUnit.py +++ b/immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisUnit.py @@ -6,6 +6,7 @@ from immuneML.encodings.DatasetEncoder import DatasetEncoder from immuneML.environment.LabelConfiguration import LabelConfiguration from immuneML.ml_methods.dim_reduction.DimRedMethod import DimRedMethod +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy from immuneML.reports.Report import Report from immuneML.reports.ReportResult import ReportResult @@ -16,6 +17,7 @@ class ExploratoryAnalysisUnit: report: Report preprocessing_sequence: list = None encoder: DatasetEncoder = None + example_weighting: ExampleWeightingStrategy = None label_config: LabelConfiguration = None number_of_processes: int = 1 report_result: ReportResult = None diff --git a/immuneML/workflows/instructions/quickstart.py b/immuneML/workflows/instructions/quickstart.py index 20bed00d9..fc0f531fe 100644 --- a/immuneML/workflows/instructions/quickstart.py +++ b/immuneML/workflows/instructions/quickstart.py @@ -56,7 +56,7 @@ def create_specfication(self, path: Path): }, "hprep": "MLSettingsPerformance", "coef": "Coefficients" - } + }, }, "instructions": { "machine_learning_instruction": { diff --git a/immuneML/workflows/steps/DataWeighter.py b/immuneML/workflows/steps/DataWeighter.py new file mode 100644 index 000000000..ca65de155 --- /dev/null +++ b/immuneML/workflows/steps/DataWeighter.py @@ -0,0 +1,29 @@ +import datetime + +from immuneML.workflows.steps.DataWeighterParams import DataWeighterParams +from immuneML.workflows.steps.Step import Step +from immuneML.workflows.steps.StepParams import StepParams + + +class DataWeighter(Step): + + @staticmethod + def run(input_params: StepParams = None): + assert isinstance(input_params, DataWeighterParams), \ + "DataWeighter step: input_params have to be an instance of DataWeighterParams class." + + dataset = input_params.dataset.clone() + weighting_strategy = input_params.weighting_strategy + weighting_params = input_params.weighting_params + + if weighting_strategy is None: + return dataset + + print(f"{datetime.datetime.now()}: Computing example weights...") + + example_weights = weighting_strategy.compute_weights(dataset, weighting_params) + dataset.set_example_weights(example_weights) + + print(f"{datetime.datetime.now()}: Example weights computed.") + + return dataset diff --git a/immuneML/workflows/steps/DataWeighterParams.py b/immuneML/workflows/steps/DataWeighterParams.py new file mode 100644 index 000000000..94ecb9135 --- /dev/null +++ b/immuneML/workflows/steps/DataWeighterParams.py @@ -0,0 +1,14 @@ +from dataclasses import dataclass + +from immuneML.data_model.dataset.Dataset import Dataset +from immuneML.example_weighting.ExampleWeightingParams import ExampleWeightingParams +from immuneML.example_weighting.ExampleWeightingStrategy import ExampleWeightingStrategy +from immuneML.workflows.steps.StepParams import StepParams + + +@dataclass +class DataWeighterParams(StepParams): + + dataset: Dataset + weighting_strategy: ExampleWeightingStrategy + weighting_params: ExampleWeightingParams diff --git a/immuneML/workflows/steps/MLMethodAssessment.py b/immuneML/workflows/steps/MLMethodAssessment.py index 308102b64..2e972843f 100644 --- a/immuneML/workflows/steps/MLMethodAssessment.py +++ b/immuneML/workflows/steps/MLMethodAssessment.py @@ -3,6 +3,7 @@ import numpy as np import pandas as pd +from immuneML.ml_metrics.MetricUtil import MetricUtil from immuneML.environment.Label import Label from immuneML.ml_methods.classifiers.MLMethod import MLMethod @@ -22,6 +23,7 @@ def run(input_params: MLMethodAssessmentParams = None): predicted_y = input_params.method.predict(X, input_params.label) predicted_proba_y_per_class = input_params.method.predict_proba(X, input_params.label) true_y = input_params.dataset.encoded_data.labels + example_weights = input_params.dataset.get_example_weights() example_ids = input_params.dataset.get_example_ids() @@ -32,13 +34,14 @@ def run(input_params: MLMethodAssessmentParams = None): scores = MLMethodAssessment._score(metrics_list=input_params.metrics, optimization_metric=input_params.optimization_metric, label=input_params.label, split_index=input_params.split_index, predicted_y=predicted_y, - predicted_proba_y_per_class=predicted_proba_y_per_class, true_y=true_y, method=input_params.method, + predicted_proba_y_per_class=predicted_proba_y_per_class, true_y=true_y, + example_weights=example_weights, method=input_params.method, ml_score_path=input_params.ml_score_path) return scores @staticmethod - def _score(metrics_list: set, optimization_metric: ClassificationMetric, label: Label, predicted_y, predicted_proba_y_per_class, true_y, ml_score_path: Path, + def _score(metrics_list: set, optimization_metric: ClassificationMetric, label: Label, predicted_y, predicted_proba_y_per_class, true_y, example_weights, ml_score_path: Path, split_index: int, method: MLMethod): results = {} scores = {} @@ -55,9 +58,9 @@ def _score(metrics_list: set, optimization_metric: ClassificationMetric, label: score = MetricUtil.score_for_metric(metric=metric, predicted_y=predicted_y[label.name], true_y=true_y[label.name], + example_weights=example_weights, classes=label.values, predicted_proba_y=predicted_proba_y) - results[f"{label.name}_{metric.name.lower()}"] = score scores[metric.name.lower()] = score diff --git a/immuneML/workflows/steps/MLMethodTrainer.py b/immuneML/workflows/steps/MLMethodTrainer.py index febbeaf00..cb807e980 100644 --- a/immuneML/workflows/steps/MLMethodTrainer.py +++ b/immuneML/workflows/steps/MLMethodTrainer.py @@ -34,7 +34,10 @@ def _fit_method(input_params: MLMethodTrainerParams): cores_for_training=input_params.cores_for_training, optimization_metric=input_params.optimization_metric) else: - method.fit(encoded_data=input_params.dataset.encoded_data, label=input_params.label, cores_for_training=input_params.cores_for_training) + method.fit(encoded_data=input_params.dataset.encoded_data, + label=input_params.label, + cores_for_training=input_params.cores_for_training, + optimization_metric=input_params.optimization_metric) return method diff --git a/requirements_KerasSequenceCNN.txt b/requirements_KerasSequenceCNN.txt new file mode 100644 index 000000000..01525c9b2 --- /dev/null +++ b/requirements_KerasSequenceCNN.txt @@ -0,0 +1,3 @@ +keras>=2.3.1 +tensorflow>=2.2.0 +numpy>=1.18.5,<=1.23.2 diff --git a/setup.py b/setup.py index 4fcfba651..a86d53e77 100644 --- a/setup.py +++ b/setup.py @@ -27,7 +27,8 @@ def import_requirements(filename) -> list: extras_require={ "TCRdist": ["tcrdist3>=0.1.6"], "gen_models": ['olga', 'sonnia', 'torch'], - "ligo": ['olga', 'stitchr', 'IMGTgeneDL'] + "ligo": ['olga', 'stitchr', 'IMGTgeneDL'], + "KerasSequenceCNN": ["keras==2.11.0", "tensorflow==2.11.0"] }, classifiers=[ "Programming Language :: Python :: 3", diff --git a/test/IO/dataset_export/test_AIRRExporter.py b/test/IO/dataset_export/test_AIRRExporter.py index f905822f0..754007368 100644 --- a/test/IO/dataset_export/test_AIRRExporter.py +++ b/test/IO/dataset_export/test_AIRRExporter.py @@ -39,7 +39,7 @@ def create_dummy_repertoire(self, path): j_call="TRAJ2", chain=Chain.ALPHA, duplicate_count=15, - frame_type=SequenceFrameType.UNDEFINED, + frame_type=SequenceFrameType.OUT, region_type="IMGT_CDR3", custom_params={"d_call": "TRAD2", "custom_test": "cust2", diff --git a/test/IO/dataset_import/test_AIRRImport.py b/test/IO/dataset_import/test_AIRRImport.py index f8d9b031d..c0b1ce3a1 100644 --- a/test/IO/dataset_import/test_AIRRImport.py +++ b/test/IO/dataset_import/test_AIRRImport.py @@ -49,7 +49,8 @@ def test_import_repertoire_dataset(self): column_mapping = self.get_column_mapping() params = {"is_repertoire": True, "result_path": path, "path": path, "metadata_file": path / "metadata.csv", "import_out_of_frame": False, "import_with_stop_codon": False, "import_illegal_characters": False, - "import_productive": True, "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, + "import_productive": True, "import_unknown_productivity": True, + "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, "column_mapping": column_mapping, "separator": "\t"} @@ -78,7 +79,8 @@ def test_sequence_dataset(self): column_mapping = self.get_column_mapping() params = {"is_repertoire": False, "result_path": path, "path": path, "import_out_of_frame": False, "import_with_stop_codon": False, - "import_productive": True, "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, + "import_productive": True, "import_unknown_productivity": True, + "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, "column_mapping": column_mapping, "import_illegal_characters": False, "separator": "\t", "sequence_file_size": 1} @@ -111,7 +113,9 @@ def test_receptor_dataset(self): params = {"is_repertoire": False, "result_path": path, "path": path, "paired": True, "import_illegal_characters": False, "import_out_of_frame": False, "import_with_stop_codon": False, - "import_productive": True, "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, + "import_productive": True, "import_unknown_productivity": True, + "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, + "import_empty_aa_sequences": False, "column_mapping": column_mapping, "receptor_chains": "IGH_IGL", "separator": "\t", "sequence_file_size": 1} @@ -133,7 +137,8 @@ def test_import_exported_dataset(self): column_mapping = self.get_column_mapping() params = {"is_repertoire": True, "result_path": path / 'imported', "path": path / 'initial', "metadata_file": path / "initial/metadata.csv", "import_out_of_frame": False, "import_with_stop_codon": False, - "import_productive": True, "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, + "import_productive": True, "import_unknown_productivity": True, + "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, "column_mapping": column_mapping, "import_illegal_characters": False, "separator": "\t"} @@ -175,7 +180,8 @@ def test_minimal_dataset(self): params = {"is_repertoire": True, "result_path": path, "path": path, "metadata_file": path / "metadata.csv", "import_out_of_frame": False, "import_with_stop_codon": False, - "import_productive": True, "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, + "import_productive": True,"import_unknown_productivity": True, + "region_type": "IMGT_CDR3", "import_empty_nt_sequences": True, "import_empty_aa_sequences": False, "column_mapping": column_mapping, "import_illegal_characters": False, "separator": "\t"} diff --git a/test/dsl/instruction_parsers/test_exploratoryAnalysisParser.py b/test/dsl/instruction_parsers/test_exploratoryAnalysisParser.py index 791f56739..d4b602bec 100644 --- a/test/dsl/instruction_parsers/test_exploratoryAnalysisParser.py +++ b/test/dsl/instruction_parsers/test_exploratoryAnalysisParser.py @@ -1,6 +1,7 @@ import os import shutil from unittest import TestCase +import pandas as pd from immuneML.caching.CacheType import CacheType from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset @@ -10,6 +11,7 @@ from immuneML.encodings.reference_encoding.MatchedSequencesEncoder import MatchedSequencesEncoder from immuneML.environment.Constants import Constants from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.example_weighting.predefined_weighting.PredefinedWeighting import PredefinedWeighting from immuneML.preprocessing.SubjectRepertoireCollector import SubjectRepertoireCollector from immuneML.reports.data_reports.SequenceLengthDistribution import SequenceLengthDistribution from immuneML.reports.encoding_reports.Matches import Matches @@ -27,7 +29,7 @@ def test_parse(self): path = EnvironmentSettings.tmp_test_path / "explanalysisparser/" PathBuilder.remove_old_and_build(path) - dataset = self.prepare_dataset(path) + dataset, weights_path = self.prepare_dataset(path) report1 = SequenceLengthDistribution() file_content = """complex.id Gene CDR3 V J Species MHC A MHC B MHC class Epitope Epitope gene Epitope species Reference Method Meta CDR3fix Score @@ -42,6 +44,7 @@ def test_parse(self): report2 = Matches.build_object() encoding = MatchedSequencesEncoder p1 = [SubjectRepertoireCollector()] + weighting = PredefinedWeighting instruction = { "type": "ExploratoryAnalysis", @@ -49,7 +52,7 @@ def test_parse(self): "analyses": { "1": {"dataset": "d1", "report": "r1", "preprocessing_sequence": "p1"}, "2": {"dataset": "d1", "report": "r2", "encoding": "e1", }, - "3": {"dataset": "d1", "report": "r2", "encoding": "e1", "labels": ["l1"]} + "3": {"dataset": "d1", "report": "r2", "encoding": "e1", "labels": ["l1"], "example_weighting": "w1"}, } } @@ -65,6 +68,10 @@ def test_parse(self): "normalize": False }}) symbol_table.add("p1", SymbolType.PREPROCESSING, p1) + symbol_table.add("w1", SymbolType.WEIGHTING, weighting, {"example_weighting_params": { + "file_path": weights_path, + "separator": "," + }}) process = ExploratoryAnalysisParser().parse("a", instruction, symbol_table) @@ -78,7 +85,9 @@ def test_parse(self): self.assertTrue(isinstance(list(process.state.exploratory_analysis_units.values())[2].report, Matches)) self.assertTrue(isinstance(list(process.state.exploratory_analysis_units.values())[2].encoder, MatchedSequencesEncoder)) + self.assertTrue(isinstance(list(process.state.exploratory_analysis_units.values())[2].example_weighting, PredefinedWeighting)) self.assertEqual(1, len(list(process.state.exploratory_analysis_units.values())[2].encoder.reference_sequences)) + self.assertEqual(weights_path, list(process.state.exploratory_analysis_units.values())[2].example_weighting.file_path) self.assertEqual("l1", list(process.state.exploratory_analysis_units.values())[2].label_config.get_labels_by_name()[0]) self.assertEqual(32, process.state.exploratory_analysis_units["2"].number_of_processes) @@ -89,4 +98,9 @@ def prepare_dataset(self, path: str): {"l1": [1, 1, 1, 0, 0, 0], "l2": [2, 3, 2, 3, 2, 3]}) dataset = RepertoireDataset(repertoires=repertoires, labels={"l1": [0, 1], "l2": [2, 3]}, metadata_file=metadata) - return dataset + + weights_path = path / "mock_weights.tsv" + df = pd.DataFrame({"identifier": dataset.get_example_ids(), "example_weight": [1 for i in range(dataset.get_example_count())]}) + df.to_csv(weights_path, index=False) + + return dataset, weights_path diff --git a/test/dsl/test_exampleWeightingParser.py b/test/dsl/test_exampleWeightingParser.py new file mode 100644 index 000000000..addf8f9cd --- /dev/null +++ b/test/dsl/test_exampleWeightingParser.py @@ -0,0 +1,36 @@ +import os +from unittest import TestCase + +from immuneML.caching.CacheType import CacheType +from immuneML.dsl.definition_parsers.ExampleWeightingParser import ExampleWeightingParser +from immuneML.dsl.symbol_table.SymbolTable import SymbolTable +from immuneML.environment.Constants import Constants +from immuneML.example_weighting.predefined_weighting.PredefinedWeighting import PredefinedWeighting + + +class TestExampleWeightingParser(TestCase): + + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def test_parse_example_weightings(self): + + params = { + "w3": { + "PredefinedWeighting": { + "file_path": "example/path.csv" + } + } + } + + symbol_table = SymbolTable() + symbol_table, desc = ExampleWeightingParser.parse(params, symbol_table) + + self.assertEqual(PredefinedWeighting, symbol_table.get("w3")) + + self.assertEqual(symbol_table.get("w3"), PredefinedWeighting) + self.assertEqual(symbol_table.get_config("w3"), {'example_weighting_params': {'file_path': 'example/path.csv', + 'separator': '\t', + 'name': 'w3'}}) + + diff --git a/test/dsl/test_immuneMLParser.py b/test/dsl/test_immuneMLParser.py index b5fcb1e62..07020c324 100644 --- a/test/dsl/test_immuneMLParser.py +++ b/test/dsl/test_immuneMLParser.py @@ -8,6 +8,8 @@ from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset from immuneML.dsl.ImmuneMLParser import ImmuneMLParser from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.ml_methods.classifiers.LogisticRegression import LogisticRegression +from immuneML.reports.data_reports.SequenceLengthDistribution import SequenceLengthDistribution from immuneML.util.PathBuilder import PathBuilder from immuneML.util.RepertoireBuilder import RepertoireBuilder @@ -51,6 +53,12 @@ def test_parse_iml_yaml_file(): "reports": { "rep1": "SequenceLengthDistribution" + }, + "example_weightings": { + "w1": { + "PredefinedWeighting": + {"file_path": "test"} + } } }, "instructions": {} @@ -65,6 +73,8 @@ def test_parse_iml_yaml_file(): assert all([symbol_table.contains(key) for key in ["simpleLR", "rep1", "a1", "d1"]]) assert isinstance(symbol_table.get("d1"), RepertoireDataset) + assert isinstance(symbol_table.get("rep1"), SequenceLengthDistribution) + assert isinstance(symbol_table.get("simpleLR2"), LogisticRegression) with pytest.raises(YAMLError): with specs_filename.open("r") as file: diff --git a/test/encodings/distance_encoding/test_compAIRRDistanceEncoder.py b/test/encodings/distance_encoding/test_compAIRRDistanceEncoder.py index 0439f0f18..8eea30158 100644 --- a/test/encodings/distance_encoding/test_compAIRRDistanceEncoder.py +++ b/test/encodings/distance_encoding/test_compAIRRDistanceEncoder.py @@ -57,7 +57,7 @@ def _run_test(self, compairr_path): encoded = enc.encode(dataset, EncoderParams(result_path=path, label_config=LabelConfiguration( [Label("l1", [0, 1]), Label("l2", [2, 3])]), - pool_size=4, filename="dataset.pkl")) + pool_size=4)) self.assertEqual(8, encoded.encoded_data.examples.shape[0]) self.assertEqual(8, encoded.encoded_data.examples.shape[1]) diff --git a/test/encodings/distance_encoding/test_distanceEncoder.py b/test/encodings/distance_encoding/test_distanceEncoder.py index 5da4b44a8..efe426e0a 100644 --- a/test/encodings/distance_encoding/test_distanceEncoder.py +++ b/test/encodings/distance_encoding/test_distanceEncoder.py @@ -43,7 +43,7 @@ def test_encode(self): enc.set_context({"dataset": dataset}) encoded = enc.encode(dataset, EncoderParams(result_path=path, label_config=LabelConfiguration([Label("l1", [0, 1]), Label("l2", [2, 3])]), - pool_size=4, filename="dataset.pkl")) + pool_size=4)) self.assertEqual(8, encoded.encoded_data.examples.shape[0]) self.assertEqual(8, encoded.encoded_data.examples.shape[1]) diff --git a/test/encodings/kmer_frequency/test_kmerFreqReceptorEncoder.py b/test/encodings/kmer_frequency/test_kmerFreqReceptorEncoder.py index c221a12d3..a4b2bf872 100644 --- a/test/encodings/kmer_frequency/test_kmerFreqReceptorEncoder.py +++ b/test/encodings/kmer_frequency/test_kmerFreqReceptorEncoder.py @@ -62,7 +62,6 @@ def test(self): pool_size=2, learn_model=True, model={}, - filename="dataset.csv", encode_labels=False )) diff --git a/test/encodings/kmer_frequency/test_kmerFreqSequenceEncoder.py b/test/encodings/kmer_frequency/test_kmerFreqSequenceEncoder.py index 66357196c..1cc81ca3d 100644 --- a/test/encodings/kmer_frequency/test_kmerFreqSequenceEncoder.py +++ b/test/encodings/kmer_frequency/test_kmerFreqSequenceEncoder.py @@ -67,7 +67,6 @@ def test(self): pool_size=2, learn_model=True, model={}, - filename="dataset.csv" )) self.assertEqual(9, encoded_dataset.encoded_data.examples.shape[0]) diff --git a/test/encodings/kmer_frequency/test_kmerFrequencyEncoder.py b/test/encodings/kmer_frequency/test_kmerFrequencyEncoder.py index ce2029d4a..126b351a3 100644 --- a/test/encodings/kmer_frequency/test_kmerFrequencyEncoder.py +++ b/test/encodings/kmer_frequency/test_kmerFrequencyEncoder.py @@ -59,7 +59,6 @@ def test_encode(self): label_config=lc, learn_model=True, model={}, - filename="dataset.pkl" )) encoder = KmerFrequencyEncoder.build_object(dataset, **{ @@ -76,7 +75,6 @@ def test_encode(self): pool_size=2, learn_model=True, model={}, - filename="dataset.csv" )) encoder3 = KmerFrequencyEncoder.build_object(dataset, **{ @@ -92,7 +90,6 @@ def test_encode(self): label_config=lc, learn_model=True, model={}, - filename="dataset.pkl" )) shutil.rmtree(path) diff --git a/test/encodings/motif_encoding/__init__.py b/test/encodings/motif_encoding/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/test/encodings/motif_encoding/test_MotifEncoder.py b/test/encodings/motif_encoding/test_MotifEncoder.py new file mode 100644 index 000000000..6185c03f0 --- /dev/null +++ b/test/encodings/motif_encoding/test_MotifEncoder.py @@ -0,0 +1,128 @@ +import os +import shutil +from unittest import TestCase + + +from immuneML.caching.CacheType import CacheType +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.data_model.receptor.receptor_sequence.ReceptorSequence import ReceptorSequence +from immuneML.data_model.receptor.receptor_sequence.SequenceMetadata import SequenceMetadata +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.util.PathBuilder import PathBuilder + + +class TestMotifEncoder(TestCase): + + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def _prepare_dataset(self, path): + sequences = [ReceptorSequence(sequence_aa="AACC", sequence_id="1", + metadata=SequenceMetadata(custom_params={"l1": 1})), + ReceptorSequence(sequence_aa="AGDD", sequence_id="2", + metadata=SequenceMetadata(custom_params={"l1": 1})), + ReceptorSequence(sequence_aa="AAEE", sequence_id="3", + metadata=SequenceMetadata(custom_params={"l1": 1})), + ReceptorSequence(sequence_aa="AGFF", sequence_id="4", + metadata=SequenceMetadata(custom_params={"l1": 1})), + ReceptorSequence(sequence_aa="CCCC", sequence_id="5", + metadata=SequenceMetadata(custom_params={"l1": 2})), + ReceptorSequence(sequence_aa="DDDD", sequence_id="6", + metadata=SequenceMetadata(custom_params={"l1": 2})), + ReceptorSequence(sequence_aa="EEEE", sequence_id="7", + metadata=SequenceMetadata(custom_params={"l1": 2})), + ReceptorSequence(sequence_aa="FFFF", sequence_id="8", + metadata=SequenceMetadata(custom_params={"l1": 2}))] + + + PathBuilder.build(path) + return SequenceDataset.build_from_objects(sequences, 100, PathBuilder.build(path / 'data'), 'd2') + + def test(self): + path = EnvironmentSettings.tmp_test_path / "significant_motif_sequence_encoder_test/" + dataset = self._prepare_dataset(path) + + lc = LabelConfiguration() + lc.add_label("l1", [1, 2], positive_class=1) + + encoder = MotifEncoder.build_object(dataset, **{ + "min_positions": 1, + "max_positions": 2, + "min_precision": 0.9, + "min_recall": 0.5, + "min_true_positives": 1, + }) + + encoded_dataset = encoder.encode(dataset, EncoderParams( + result_path=path / "encoder_result/", + label_config=lc, + pool_size=4, + learn_model=True, + model={}, + )) + + self.assertEqual(8, encoded_dataset.encoded_data.examples.shape[0]) + self.assertTrue(all(identifier in encoded_dataset.encoded_data.example_ids + for identifier in ['1', '2', '3', '4', '5', '6', '7', '8'])) + + self.assertListEqual(['0-A', '1-A', '1-G', '0&1-A&A', '0&1-A&G'], encoded_dataset.encoded_data.feature_names) + + self.assertListEqual([True, True, False, True, False], list(encoded_dataset.encoded_data.examples[0])) + self.assertListEqual([True, False, True, False, True], list(encoded_dataset.encoded_data.examples[1])) + self.assertListEqual([True, True, False, True, False], list(encoded_dataset.encoded_data.examples[2])) + self.assertListEqual([True, False, True, False, True], list(encoded_dataset.encoded_data.examples[3])) + self.assertListEqual([False, False, False, False, False], list(encoded_dataset.encoded_data.examples[4])) + self.assertListEqual([False, False, False, False, False], list(encoded_dataset.encoded_data.examples[5])) + self.assertListEqual([False, False, False, False, False], list(encoded_dataset.encoded_data.examples[6])) + self.assertListEqual([False, False, False, False, False], list(encoded_dataset.encoded_data.examples[7])) + + shutil.rmtree(path) + + def _disabled_test_generalized(self): + ''' + Old test, disabled as generalized_motifs option does not have a clear purpose as of now. + ''' + + path = EnvironmentSettings.tmp_test_path / "significant_motif_sequence_encoder_generalized/" + dataset = self._prepare_dataset(path) + + lc = LabelConfiguration() + lc.add_label("l1", [1, 2], positive_class=1) + + encoder = MotifEncoder.build_object(dataset, **{ + "min_positions": 1, + "max_positions": 2, + "min_precision": 0.9, + "min_recall": 0.5, + "generalized_motifs": True, + "min_true_positives": 1, + }) + + encoded_dataset = encoder.encode(dataset, EncoderParams( + result_path=path / "encoder_result/", + label_config=lc, + pool_size=2, + learn_model=True, + model={}, + )) + + self.assertEqual(8, encoded_dataset.encoded_data.examples.shape[0]) + self.assertTrue(all(identifier in encoded_dataset.encoded_data.example_ids + for identifier in ['1', '2', '3', '4', '5', '6', '7', '8'])) + + self.assertListEqual(['0-A', '1-A', '1-G', '0&1-A&A', '0&1-A&G', '0&1-A&AG'], encoded_dataset.encoded_data.feature_names) + + self.assertListEqual([True, True, False, True, False, True], list(encoded_dataset.encoded_data.examples[0])) + self.assertListEqual([True, False, True, False, True, True], list(encoded_dataset.encoded_data.examples[1])) + self.assertListEqual([True, True, False, True, False, True], list(encoded_dataset.encoded_data.examples[2])) + self.assertListEqual([True, False, True, False, True, True], list(encoded_dataset.encoded_data.examples[3])) + self.assertListEqual([False, False, False, False, False, False], list(encoded_dataset.encoded_data.examples[4])) + self.assertListEqual([False, False, False, False, False, False], list(encoded_dataset.encoded_data.examples[5])) + self.assertListEqual([False, False, False, False, False, False], list(encoded_dataset.encoded_data.examples[6])) + self.assertListEqual([False, False, False, False, False, False], list(encoded_dataset.encoded_data.examples[7])) + + shutil.rmtree(path) \ No newline at end of file diff --git a/test/encodings/motif_encoding/test_PositionalMotifHelper.py b/test/encodings/motif_encoding/test_PositionalMotifHelper.py new file mode 100644 index 000000000..ae269992b --- /dev/null +++ b/test/encodings/motif_encoding/test_PositionalMotifHelper.py @@ -0,0 +1,225 @@ +import os +import shutil +import numpy as np +from unittest import TestCase + + +from immuneML.caching.CacheType import CacheType +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.data_model.receptor.receptor_sequence.ReceptorSequence import ReceptorSequence +from immuneML.data_model.receptor.receptor_sequence.SequenceMetadata import SequenceMetadata +from immuneML.encodings.motif_encoding.PositionalMotifHelper import PositionalMotifHelper +from immuneML.encodings.motif_encoding.PositionalMotifParams import PositionalMotifParams +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.util.PathBuilder import PathBuilder + + +class TestPositionalMotifHelper(TestCase): + + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def _prepare_dataset(self, path): + sequences = [ReceptorSequence(sequence_aa="AA", sequence_id="1", + metadata=SequenceMetadata(custom_params={"l1": 1})), + ReceptorSequence(sequence_aa="CC", sequence_id="2", + metadata=SequenceMetadata(custom_params={"l1": 1})), + ReceptorSequence(sequence_aa="AC", sequence_id="3", + metadata=SequenceMetadata(custom_params={"l1": 1})), + ReceptorSequence(sequence_aa="CA", sequence_id="4", + metadata=SequenceMetadata(custom_params={"l1": 1}))] + + PathBuilder.build(path) + return SequenceDataset.build_from_objects(sequences, 100, PathBuilder.build(path / 'data'), 'd2') + + def test_get_numpy_sequence_representation(self): + path = EnvironmentSettings.tmp_test_path / "positional_motif_sequence_encoder/test_np/" + dataset = self._prepare_dataset(path = path) + output = PositionalMotifHelper.get_numpy_sequence_representation(dataset) + + expected = np.asarray(['A' 'A', 'C' 'C', 'A' 'C', 'C' 'A']).view('U1').reshape(4, -1) + + self.assertEqual(output.shape, expected.shape) + + for i in range(len(output)): + self.assertListEqual(list(output[i]), list(expected[i])) + + for j in range(len(output[i])): + self.assertEqual(type(output[i][j]), type(expected[i][j])) + + shutil.rmtree(path) + + def test_test_aa(self): + sequence_array = np.asarray(['A' 'A', 'B' 'B', 'A' 'B', 'B' 'A']).view('U1').reshape(4, -1) + + self.assertListEqual(list(PositionalMotifHelper.test_aa(sequence_array, 0, "A")), [True, False, True, False]) + self.assertListEqual(list(PositionalMotifHelper.test_aa(sequence_array, 1, "A")), [True, False, False, True]) + self.assertListEqual(list(PositionalMotifHelper.test_aa(sequence_array, 0, "B")), [False, True, False, True]) + self.assertListEqual(list(PositionalMotifHelper.test_aa(sequence_array, 1, "B")), [False, True, True, False]) + + def test_test_position(self): + sequence_array = np.asarray(['A' 'A', 'B' 'B', 'A' 'B', 'C' 'A']).view('U1').reshape(4, -1) + + self.assertListEqual(list(PositionalMotifHelper.test_position(sequence_array, 0, "A")), [True, False, True, False]) + self.assertListEqual(list(PositionalMotifHelper.test_position(sequence_array, 0, "AB")), [True, True, True, False]) + self.assertListEqual(list(PositionalMotifHelper.test_position(sequence_array, 0, "BC")), [False, True, False, True]) + self.assertListEqual(list(PositionalMotifHelper.test_position(sequence_array, 0, "ABC")), [True, True, True, True]) + + def test_test_motif(self): + sequence_array = np.asarray(['A' 'A', 'B' 'B', 'A' 'B', 'B' 'A']).view('U1').reshape(4, -1) + + self.assertListEqual(list(PositionalMotifHelper.test_motif(sequence_array, (0, 1), ("A", "B"))), [False, False, True, False]) + self.assertListEqual(list(PositionalMotifHelper.test_motif(sequence_array, (0, 1), ("E", "E"))), [False, False, False, False]) + self.assertListEqual(list(PositionalMotifHelper.test_motif(sequence_array, (0, 1), ("DE", "DE"))), [False, False, False, False]) + self.assertListEqual(list(PositionalMotifHelper.test_motif(sequence_array, (0, 1), ("A", "BA"))), [True, False, True, False]) + self.assertListEqual(list(PositionalMotifHelper.test_motif(sequence_array, (0, 1), ("AB", "AB"))), [True, True, True, True]) + self.assertListEqual(list(PositionalMotifHelper.test_motif(sequence_array, (0, 1), ("C", "AB"))), [False, False, False, False]) + self.assertListEqual(list(PositionalMotifHelper.test_motif(sequence_array, (0, 1), ("AB", "C"))), [False, False, False, False]) + + def test_extend_motif(self): + np_sequences = np.asarray(['A' 'A', 'C' 'C', 'A' 'C', 'C' 'A']).view('U1').reshape(4, -1) + + outcome = PositionalMotifHelper.extend_motif([[0], ["A"]], np_sequences, {0: ["A", "C"], 1: ["C"]}, count_threshold=1) + self.assertListEqual(outcome, [[[0, 1], ['A', 'C']]]) + + outcome = PositionalMotifHelper.extend_motif([[0], ["A"]], np_sequences, {0: ["A", "C"], 1: ["A", "C", "D"]}, count_threshold=1) + self.assertListEqual(outcome, [[[0, 1], ['A', 'A']], [[0, 1], ['A', 'C']]]) + + outcome = PositionalMotifHelper.extend_motif([[0], ["A"]], np_sequences, {0: ["A", "C"], 1: ["A", "C", "D"]}, count_threshold=0) + self.assertListEqual(outcome, [[[0, 1], ['A', 'A']], [[0, 1], ['A', 'C']], [[0, 1], ['A', 'D']]]) + + def test_identify_legal_positional_aas(self): + np_sequences = np.asarray(['A' 'A', 'C' 'C', 'A' 'C', 'C' 'D']).view('U1').reshape(4, -1) + + outcome = PositionalMotifHelper.identify_legal_positional_aas(np_sequences, count_threshold=1) + expected = {0: ["A", "C"], 1: ["A", "C", "D"]} + self.assertDictEqual(expected, outcome) + + outcome = PositionalMotifHelper.identify_legal_positional_aas(np_sequences, count_threshold=2) + expected = {0: ["A", "C"], 1: ["C"]} + self.assertDictEqual(expected, outcome) + + def test_compute_all_candidate_motifs(self): + np_sequences = np.asarray(['A' 'A', 'A' 'A', 'C' 'C']).view('U1').reshape(3, -1) + + outcome = PositionalMotifHelper.compute_all_candidate_motifs(np_sequences, params=PositionalMotifParams(max_positions=1, min_positions=1, count_threshold=2,)) + expected = [[[0], ["A"]], [[1], ["A"]]] + self.assertListEqual(outcome, expected) + + outcome = PositionalMotifHelper.compute_all_candidate_motifs(np_sequences, params=PositionalMotifParams(max_positions=2, min_positions=1, count_threshold=2)) + expected = [[[0], ["A"]], [[1], ["A"]], [[0, 1], ["A", "A"]]] + self.assertListEqual(outcome, expected) + + outcome = PositionalMotifHelper.compute_all_candidate_motifs(np_sequences, params=PositionalMotifParams(max_positions=1, min_positions=1, count_threshold=1)) + expected = [[[0], ["A"]], [[0], ["C"]], [[1], ["A"]], [[1], ["C"]]] + self.assertListEqual(outcome, expected) + + outcome = PositionalMotifHelper.compute_all_candidate_motifs(np_sequences, params=PositionalMotifParams(max_positions=2, min_positions=1, count_threshold=2)) + expected = [[[0], ["A"]], [[1], ["A"]], [[0, 1], ["A", "A"]]] + self.assertListEqual(outcome, expected) + + np_sequences = np.asarray(['A' 'A', 'A' 'C']).view('U1').reshape(2, -1) + + def _disabled_test_compute_all_candidate_motifs_negative_aas(self): + np_sequences = np.asarray(['A' 'A', 'A' 'A', 'C' 'C']).view('U1').reshape(3, -1) + + outcome = PositionalMotifHelper.compute_all_candidate_motifs(np_sequences, params=PositionalMotifParams(max_positions=2, min_positions=2, count_threshold=1, allow_negative_aas=True)) + expected = [[[0, 1], ["A", "A"]], [[0, 1], ["A", "C"]], [[0, 1], ["A", "a"]], [[0, 1], ["A", "c"]]] + self.assertListEqual(outcome, expected) + + np_sequences = np.asarray(['A' 'D', 'A' 'D', 'A' 'C', 'A' 'F']).view('U1').reshape(4, -1) + + outcome = PositionalMotifHelper.compute_all_candidate_motifs(np_sequences, params=PositionalMotifParams(max_positions=2, min_positions=2, count_threshold=2, allow_negative_aas=True)) + expected = [[[0, 1], ["A", "D"]], [[0, 1], ["A", "d"]]] + self.assertListEqual(outcome, expected) + + def test_add_position_to_base_motif(self): + base_motif = [[0, 5], ["A", "C"]] + + result = PositionalMotifHelper.add_position_to_base_motif(base_motif, 2, "D") + + self.assertListEqual(result[0], [0, 2, 5]) + self.assertListEqual(result[1], ["A", "D", "C"]) + self.assertListEqual(base_motif[0], [0, 5]) + self.assertListEqual(base_motif[1], ["A", "C"]) + + def test_readwrite(self): + path = EnvironmentSettings.tmp_test_path / "positional_motif_sequence_encoder/test_readwrite/" + + original_motifs = [([0], ["A"]), ([1], ["A"]), ([0, 1], ["A", "A"])] + PositionalMotifHelper.write_motifs_to_file(original_motifs, filepath=path / "motifs.tsv") + motifs = PositionalMotifHelper.read_motifs_from_file(filepath=path / "motifs.tsv") + + self.assertListEqual(original_motifs, motifs) + + shutil.rmtree(path) + + def test_get_generalized_motifs(self): + motifs = [[[2, 3, 5], ["A", "A", "A"]], [[2, 3, 5], ["A", "A", "D"]], [[2, 3, 6], ["A", "A", "C"]]] + + result = PositionalMotifHelper.get_generalized_motifs(motifs) + expected = [[[2, 3, 5], ["A", "A", "AD"]]] + + self.assertListEqual(result, expected) + + motifs = [[[2, 3, 5], ["A", "A", "A"]], [[2, 3, 7], ["A", "A", "D"]], [[2, 3, 6], ["A", "A", "C"]]] + + result = PositionalMotifHelper.get_generalized_motifs(motifs) + expected = [] + + self.assertListEqual(result, expected) + + motifs = [[[2, 3, 5], ["A", "A", "a"]], [[2, 3, 5], ["A", "A", "D"]], [[2, 3, 5], ["A", "A", "C"]]] + + result = PositionalMotifHelper.get_generalized_motifs(motifs) + expected = [[[2, 3, 5], ["A", "A", "CD"]]] + + self.assertListEqual(result, expected) + + motifs = [[[2, 3, 5], ["A", "A", "a"]], [[2, 3, 5], ["A", "A", "D"]]] + + result = PositionalMotifHelper.get_generalized_motifs(motifs) + expected = [] + + self.assertListEqual(result, expected) + + + + def test__sort_motifs_by_index(self): + motifs = [[[1,2], ["A", "A"]], [[1, 2], ["A", "F"]], [[1, 2], ["G", "D"]], [[5, 6], ["A", "A"]], [[6, 7], ["A", "A"]]] + result = PositionalMotifHelper.sort_motifs_by_index(motifs) + expected = {(1,2): [["A", "A"], ["A", "F"], ["G", "D"]], + (5, 6): [["A", "A"]], + (6, 7): [["A", "A"]]} + + self.assertDictEqual(result, expected) + + def test_get_generalized_motifs_for_index(self): + indices = [2, 3, 5] + all_motif_amino_acids = [["A", "A", "A"], ["A", "A", "C"], ["A", "A", "D"], ["D", "A", "D"]] + + result = list(PositionalMotifHelper.get_generalized_motifs_for_index(indices, all_motif_amino_acids)) + expected = [[[2, 3, 5], ["AD", "A", "D"]], [[2, 3, 5], ["A", "A", "AC"]], [[2, 3, 5], ["A", "A", "AD"]], + [[2, 3, 5], ["A", "A", "CD"]], [[2, 3, 5], ["A", "A", "ACD"]]] + + self.assertListEqual(result, expected) + + def test_get_flex_aa_sets(self): + amino_acids = ["A", "B", "C", "D"] + + result = PositionalMotifHelper.get_flex_aa_sets(amino_acids) + expected = ["AB", "AC", "AD", "BC", "BD", "CD", "ABC", "ABD", "ACD", "BCD", "ABCD"] + + self.assertListEqual(result, expected) + + # def test_identify_n_possible_motifs(self): + # np_sequences = np.asarray(['A' 'A' 'A', 'C' 'C' 'C', 'A' 'C' 'A', 'C' 'D' 'C']).view('U1').reshape(4, -1) + # + # expected = {1: 7, 2: 10, 3: 4, 4: 0} + # + # result = PositionalMotifHelper.identify_n_possible_motifs(np_sequences, 1, [1,2,3,4]) + # + # self.assertDictEqual(result, expected) + # + # # problem: current approach is not looking at combinations of positions occurring at least once, just all motifs made up of individual positions! diff --git a/test/encodings/motif_encoding/test_SimilarToPositiveSequenceEncoder.py b/test/encodings/motif_encoding/test_SimilarToPositiveSequenceEncoder.py new file mode 100644 index 000000000..df77c5f79 --- /dev/null +++ b/test/encodings/motif_encoding/test_SimilarToPositiveSequenceEncoder.py @@ -0,0 +1,85 @@ +import os +import shutil +from pathlib import Path +from unittest import TestCase + + +from immuneML.caching.CacheType import CacheType +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.data_model.receptor.receptor_sequence.ReceptorSequence import ReceptorSequence +from immuneML.data_model.receptor.receptor_sequence.SequenceMetadata import SequenceMetadata +from immuneML.dsl.DefaultParamsLoader import DefaultParamsLoader +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.SimilarToPositiveSequenceEncoder import SimilarToPositiveSequenceEncoder +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.util.PathBuilder import PathBuilder + + +class TestSimilarToPositiveSequenceEncoder(TestCase): + + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def _prepare_dataset(self, path): + sequences = [ReceptorSequence(sequence_aa="AACC", sequence_id="5", + metadata=SequenceMetadata(custom_params={"l1": "yes"})), + ReceptorSequence(sequence_aa="AGDD", sequence_id="3", + metadata=SequenceMetadata(custom_params={"l1": "yes"})), + ReceptorSequence(sequence_aa="AAEE", sequence_id="4", + metadata=SequenceMetadata(custom_params={"l1": "yes"})), + ReceptorSequence(sequence_aa="CCCC", sequence_id="1", + metadata=SequenceMetadata(custom_params={"l1": "no"})), + ReceptorSequence(sequence_aa="AGDE", sequence_id="2", + metadata=SequenceMetadata(custom_params={"l1": "no"})), + ReceptorSequence(sequence_aa="EEEE", sequence_id="6", + metadata=SequenceMetadata(custom_params={"l1": "no"}))] + + + PathBuilder.build(path) + dataset = SequenceDataset.build_from_objects(sequences, 100, PathBuilder.build(path / 'data'), 'd2') + + lc = LabelConfiguration() + lc.add_label("l1", ["yes", "no"], positive_class="yes") + + return dataset, lc + + def _get_encoder_params(self, path, lc): + return EncoderParams( + result_path=path / "encoder_result/", + label_config=lc, + pool_size=4, + learn_model=True, + model={}, + ) + + def test_generate(self, compairr_path=None): + path_suffix = "compairr" if compairr_path else "no_compairr" + path = EnvironmentSettings.tmp_test_path / f"significant_motif_sequence_encoder_test_{path_suffix}/" + dataset, lc = self._prepare_dataset(path) + + default_params = DefaultParamsLoader.load(EnvironmentSettings.default_params_path / "encodings/", "similar_to_positive_sequence") + + encoder = SimilarToPositiveSequenceEncoder.build_object(dataset, **{**default_params, **{"hamming_distance": 1, + "compairr_path": compairr_path, + "ignore_genes": True}}) + + encoded_dataset = encoder.encode(dataset, self._get_encoder_params(path, lc)) + + self.assertEqual(6, encoded_dataset.encoded_data.examples.shape[0]) + self.assertTrue(all(identifier in encoded_dataset.encoded_data.example_ids + for identifier in ["1", "2", "3", "4", "5", "6"])) + + self.assertListEqual(["similar_to_positive_sequence"], encoded_dataset.encoded_data.feature_names) + + self.assertListEqual([True, True, True, False, True, False], list(encoded_dataset.encoded_data.examples)) + + shutil.rmtree(path) + + def test_generate_with_compairr(self): + compairr_paths = [Path("/usr/local/bin/compairr"), Path("./compairr/src/compairr")] + + for compairr_path in compairr_paths: + if compairr_path.exists(): + self.test_generate(str(compairr_path)) diff --git a/test/encodings/onehot/test_oneHotEncoder.py b/test/encodings/onehot/test_oneHotEncoder.py index af473ed7d..0ef566f7f 100644 --- a/test/encodings/onehot/test_oneHotEncoder.py +++ b/test/encodings/onehot/test_oneHotEncoder.py @@ -103,7 +103,6 @@ def test_positional(self): pool_size=1, learn_model=True, model={}, - filename="dataset.pkl" )) self.assertTrue(isinstance(encoded_data, RepertoireDataset)) @@ -182,8 +181,7 @@ def test_nucleotide(self): encoder = OneHotEncoder.build_object(dataset, **{"use_positional_info": False, "distance_to_seq_middle": None, "flatten": False, 'sequence_type': 'nucleotide'}) - encoded_dataset = encoder.encode(dataset, EncoderParams(result_path=path, label_config=lc, pool_size=1, learn_model=True, model={}, - filename="dataset.pkl")) + encoded_dataset = encoder.encode(dataset, EncoderParams(result_path=path, label_config=lc, pool_size=1, learn_model=True, model={})) self.assertTrue(isinstance(encoded_dataset, RepertoireDataset)) self.assertEqual((2, 3, 4, 4), encoded_dataset.encoded_data.examples.shape) @@ -213,7 +211,6 @@ def test_repertoire_flattened(self): pool_size=1, learn_model=True, model={}, - filename="dataset.pkl" )) self.assertTrue(isinstance(encoded_dataset, RepertoireDataset)) diff --git a/test/encodings/onehot/test_oneHotReceptorEncoder.py b/test/encodings/onehot/test_oneHotReceptorEncoder.py index 3a60113fa..03c22ee1c 100644 --- a/test/encodings/onehot/test_oneHotReceptorEncoder.py +++ b/test/encodings/onehot/test_oneHotReceptorEncoder.py @@ -45,7 +45,6 @@ def test(self): label_config=lc, learn_model=True, model={}, - filename="dataset.pkl" )) self.assertTrue(isinstance(encoded_data, ReceptorDataset)) @@ -97,7 +96,6 @@ def test_receptor_flattened(self): pool_size=1, learn_model=True, model={}, - filename="dataset.pkl" )) self.assertTrue(isinstance(encoded_data, ReceptorDataset)) diff --git a/test/encodings/onehot/test_oneHotSequenceEncoder.py b/test/encodings/onehot/test_oneHotSequenceEncoder.py index 813b2f819..e423586bc 100644 --- a/test/encodings/onehot/test_oneHotSequenceEncoder.py +++ b/test/encodings/onehot/test_oneHotSequenceEncoder.py @@ -42,7 +42,6 @@ def test(self): label_config=lc, learn_model=True, model={}, - filename="dataset.pkl" )) self.assertTrue(isinstance(encoded_data, SequenceDataset)) @@ -87,7 +86,6 @@ def test_sequence_flattened(self): pool_size=1, learn_model=True, model={}, - filename="dataset.pkl" )) self.assertTrue(isinstance(encoded_data, SequenceDataset)) diff --git a/test/encodings/reference_encoding/test_matchedReceptorsEncoder.py b/test/encodings/reference_encoding/test_matchedReceptorsEncoder.py index 422359709..6fe105ba5 100644 --- a/test/encodings/reference_encoding/test_matchedReceptorsEncoder.py +++ b/test/encodings/reference_encoding/test_matchedReceptorsEncoder.py @@ -88,7 +88,6 @@ def test__encode_new_dataset(self): encoded = encoder.encode(dataset, EncoderParams( result_path=path, label_config=label_config, - filename="dataset.csv" )) expected_outcome = expected_outcomes[reads][normalize] @@ -131,7 +130,6 @@ def test__encode_new_dataset_sum(self): encoded = encoder.encode(dataset, EncoderParams( result_path=path, label_config=label_config, - filename="dataset.csv" )) expected_outcome = expected_outcomes[reads][normalize] diff --git a/test/encodings/reference_encoding/test_matchedRegexEncoder.py b/test/encodings/reference_encoding/test_matchedRegexEncoder.py index 92dd8e908..44cc7e47e 100644 --- a/test/encodings/reference_encoding/test_matchedRegexEncoder.py +++ b/test/encodings/reference_encoding/test_matchedRegexEncoder.py @@ -74,7 +74,6 @@ def test_encode_no_v_all(self): encoded = encoder.encode(dataset, EncoderParams( result_path=path, label_config=label_config, - filename="dataset.csv" )) expected_outcome = [[20, 10, 0, 0], [0, 0, 10, 0], [0, 0, 0, 5]] @@ -83,7 +82,7 @@ def test_encode_no_v_all(self): self.assertListEqual(list(encoded.encoded_data.examples[index]), expected_outcome[index]) self.assertListEqual(["1_IGL", "1_IGH", "2_IGH", "3_IGL"], encoded.encoded_data.feature_names) - self.assertListEqual(["subject_1", "subject_2", "subject_3"], encoded.encoded_data.example_ids) + self.assertListEqual(dataset.get_example_ids(), encoded.encoded_data.example_ids) shutil.rmtree(path) @@ -101,7 +100,6 @@ def test_encode_no_v_unique(self): encoded = encoder.encode(dataset, EncoderParams( result_path=path, label_config=label_config, - filename="dataset.csv" )) expected_outcome = [[2, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]] @@ -110,7 +108,7 @@ def test_encode_no_v_unique(self): self.assertListEqual(list(encoded.encoded_data.examples[index]), expected_outcome[index]) self.assertListEqual(["1_IGL", "1_IGH", "2_IGH", "3_IGL"], encoded.encoded_data.feature_names) - self.assertListEqual(["subject_1", "subject_2", "subject_3"], encoded.encoded_data.example_ids) + self.assertListEqual(dataset.get_example_ids(), encoded.encoded_data.example_ids) shutil.rmtree(path) @@ -128,7 +126,6 @@ def test_encode_with_v_all(self): encoded = encoder.encode(dataset, EncoderParams( result_path=path, label_config=label_config, - filename="dataset.csv" )) expected_outcome = [[10, 10, 0, 0, 0], [0, 0, 10, 0, 0], [0, 0, 0, 0, 5]] @@ -137,6 +134,6 @@ def test_encode_with_v_all(self): self.assertListEqual(list(encoded.encoded_data.examples[index]), expected_outcome[index]) self.assertListEqual(["1_IGL", "1_IGH", "2_IGH", "3_IGL", "4_IGL"], encoded.encoded_data.feature_names) - self.assertListEqual(["subject_1", "subject_2", "subject_3"], encoded.encoded_data.example_ids) + self.assertListEqual(dataset.get_example_ids(), encoded.encoded_data.example_ids) shutil.rmtree(path) diff --git a/test/encodings/reference_encoding/test_matchedSequencesEncoder.py b/test/encodings/reference_encoding/test_matchedSequencesEncoder.py index 68f99d257..bee2dfc7b 100644 --- a/test/encodings/reference_encoding/test_matchedSequencesEncoder.py +++ b/test/encodings/reference_encoding/test_matchedSequencesEncoder.py @@ -75,7 +75,6 @@ def test__encode_new_dataset(self): encoded = encoder.encode(dataset, EncoderParams( result_path=path, label_config=label_config, - filename="dataset.csv" )) expected_outcome = expected_outcomes[reads][normalize] @@ -115,7 +114,6 @@ def test__encode_new_dataset_sum(self): encoded = encoder.encode(dataset, EncoderParams( result_path=path, label_config=label_config, - filename="dataset.csv" )) expected_outcome = expected_outcomes[reads][normalize] diff --git a/test/encodings/word2vec/test_word2VecEncoder.py b/test/encodings/word2vec/test_word2VecEncoder.py index 8594acc1e..638aa7822 100644 --- a/test/encodings/word2vec/test_word2VecEncoder.py +++ b/test/encodings/word2vec/test_word2VecEncoder.py @@ -44,7 +44,6 @@ def test_encode_repertoire(self): learn_model=True, result_path=test_path, label_config=label_configuration, - filename="dataset.pkl" ) encoder = Word2VecEncoder.build_object(dataset, **{ @@ -80,7 +79,6 @@ def test_encode_sequences(self): learn_model=True, result_path=test_path / 'encoded', label_config=label_configuration, - filename="dataset.pkl" ) encoder = Word2VecEncoder.build_object(dataset, **{ diff --git a/test/example_weighting/__init__.py b/test/example_weighting/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/test/example_weighting/importance_weighting/__init__.py b/test/example_weighting/importance_weighting/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/test/example_weighting/importance_weighting/test_predefinedWeighting.py b/test/example_weighting/importance_weighting/test_predefinedWeighting.py new file mode 100644 index 000000000..900689b65 --- /dev/null +++ b/test/example_weighting/importance_weighting/test_predefinedWeighting.py @@ -0,0 +1,50 @@ +import os +import shutil +from unittest import TestCase +import pandas as pd + +from immuneML.caching.CacheType import CacheType +from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.example_weighting.ExampleWeightingParams import ExampleWeightingParams +from immuneML.example_weighting.predefined_weighting.PredefinedWeighting import PredefinedWeighting +from immuneML.util.RepertoireBuilder import RepertoireBuilder + + +class TestPredefinedWeighting(TestCase): + + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def _prepare_dataset(self, path: str): + repertoires, metadata = RepertoireBuilder.build([["AAA"], ["AAAC"], ["ACA"], ["CAAA"], ["AAAC"], ["AAA"]], path, + {"l1": [1, 1, 1, 0, 0, 0], "l2": [2, 3, 2, 3, 2, 3]}) + + dataset = RepertoireDataset(repertoires=repertoires, labels={"l1": [0, 1], "l2": [2, 3]}, + metadata_file=metadata) + + weights_path = path / "mock_weights.tsv" + + df = pd.DataFrame({"identifier": dataset.get_example_ids() + ["missing1", "missing2"], + "example_weight": [i for i in range(8)]}) + df.to_csv(weights_path, index=False) + + return dataset, weights_path + + def test_compute_weights(self): + path = EnvironmentSettings.tmp_test_path / "positional_motif_sequence_encoder/test/" + dataset, weights_path = self._prepare_dataset(path) + + importance_weighter = PredefinedWeighting.build_object(dataset, + **{"separator": ",", + "file_path": weights_path} + ) + + w = importance_weighter.compute_weights(dataset, ExampleWeightingParams(result_path=path)) + + self.assertEqual(importance_weighter.file_path, weights_path) + self.assertEqual(w, [i for i in range(6)]) + + shutil.rmtree(path) + diff --git a/test/ml_methods/test_BinaryFeatureClassifier.py b/test/ml_methods/test_BinaryFeatureClassifier.py new file mode 100644 index 000000000..f921fcfce --- /dev/null +++ b/test/ml_methods/test_BinaryFeatureClassifier.py @@ -0,0 +1,270 @@ +import os +import random +import shutil +from pathlib import Path +from unittest import TestCase + +import numpy as np + +from immuneML.caching.CacheType import CacheType +from immuneML.data_model.encoded_data.EncodedData import EncodedData +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.Label import Label +from immuneML.ml_methods.BinaryFeatureClassifier import BinaryFeatureClassifier +from immuneML.util.PathBuilder import PathBuilder + + +class TestBinaryFeatureClassifier(TestCase): + + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def get_enc_data(self): + enc_data = EncodedData(encoding=MotifEncoder.__name__, + example_ids=["1", "2", "3", "4", "5", "6", "7", "8"], + feature_names=["useless_rule", "rule1", "rule2", "rule3"], + examples=np.array([[False, True, False, False], + [True, True, False, False], + [False, False, True, True], + [True, False, True, True], + [False, False, False, True], + [True, False, False, True], + [False, False, False, False], + [True, False, False, False]]), + labels={"l1": ["True", "True", "True", "True", "False", "False", "False", "False"]}) + + label = Label("l1", values=[True, False], positive_class=True) + return enc_data, label + + def get_fitted_classifier(self, path, enc_data, label, max_features=None): + motif_classifier = BinaryFeatureClassifier(training_percentage=0.7, + random_seed=1, + max_features=max_features, + patience=10, + min_delta=0, + keep_all=False, + result_path=path) + + random.seed(1) + motif_classifier.fit(encoded_data=enc_data, label=label, + optimization_metric="accuracy") + + return motif_classifier + + def test_fit(self): + path = PathBuilder.build(EnvironmentSettings.tmp_test_path / "binary_feature_classifier_fit") + + enc_data, label = self.get_enc_data() + motif_classifier = self.get_fitted_classifier(path, enc_data, label) + + predictions = motif_classifier.predict(enc_data, label) + + self.assertListEqual(sorted(motif_classifier.rule_tree_features), ["rule1", "rule2"]) + self.assertDictEqual(motif_classifier.class_mapping, {0: "False", 1: "True"}) + + self.assertListEqual(list(predictions.keys()), ["l1"]) + self.assertListEqual(list(predictions["l1"]), ["True", "True", "True", "True", "False", "False", "False", "False"]) + + with open(path / "selected_features.txt", "r") as file: + lines = file.readlines() + self.assertEqual(sorted(lines), ['rule1\n', 'rule2\n']) + + shutil.rmtree(path) + + # def test_learn_all(self): + # path = PathBuilder.build(EnvironmentSettings.tmp_test_path / "binary_feature_classifier_fit_learn_all") + # + # enc_data, label = self.get_enc_data() + # motif_classifier = self.get_fitted_classifier(path, enc_data, label, learn_all=True, max_features=2) + # + # predictions = motif_classifier.predict(enc_data, label) + # + # self.assertEqual(motif_classifier.max_features, 4) + # + # self.assertListEqual(motif_classifier.rule_tree_features, ["rule1", "rule2", "rule3", "useless_rule"]) + # self.assertListEqual(motif_classifier.rule_tree_indices, [1, 2, 3, 0]) + # + # self.assertListEqual(list(predictions.keys()), ["l1"]) + # self.assertListEqual(list(predictions["l1"]), ["True", "True", "True", "True", "True", "True", "False", "True"]) + # + # with open(path / "selected_features.txt", "r") as file: + # lines = file.readlines() + # self.assertEqual(sorted(lines), ["rule1\n", "rule2\n", "rule3\n", "useless_rule\n"]) + # + # shutil.rmtree(path) + + def test_load_store(self): + path = PathBuilder.build(EnvironmentSettings.tmp_test_path / "binary_feature_classifier_load_store") + + enc_data, label = self.get_enc_data() + motif_classifier = self.get_fitted_classifier(path, enc_data, label) + + motif_classifier.store(path / "model_storage") + + motif_classifier2 = BinaryFeatureClassifier() + motif_classifier2.load(path / "model_storage") + + motif_classifier2_vars = vars(motif_classifier2) + cnn_vars = vars(motif_classifier) + + for item, value in cnn_vars.items(): + if isinstance(value, Label): + self.assertDictEqual(vars(value), (vars(motif_classifier2_vars[item]))) + elif isinstance(value, Path): + pass + else: + self.assertEqual(value, motif_classifier2_vars[item]) + + self.assertEqual(motif_classifier.rule_tree_indices, motif_classifier2.rule_tree_indices) + + predictions = motif_classifier.predict(enc_data, label) + predictions2 = motif_classifier2.predict(enc_data, label) + + self.assertListEqual(list(predictions.keys()), ["l1"]) + self.assertListEqual(list(predictions["l1"]), list(predictions2["l1"])) + + # self.assertListEqual(), ["True", "True", "True", "True", "False", "False", "False", "False"]) + # + # self.assertListEqual(list(predictions2.keys()), ["l1"]) + # self.assertListEqual(list(predictions2["l1"]), ["True", "True", "True", "True", "False", "False", "False", "False"]) + + shutil.rmtree(path) + + def test_recursively_select_rules(self): + motif_classifier = BinaryFeatureClassifier(max_features = 100, + min_delta = 0, + patience = 10) + motif_classifier.optimization_metric = "accuracy" + motif_classifier.class_mapping = {0: False, 1: True} + motif_classifier.label = Label("l1", values=[True, False], positive_class=True) + motif_classifier.feature_names = ["rule"] + + enc_data = EncodedData(encoding=MotifEncoder.__name__, + example_ids=["1", "2", "3", "4"], + feature_names=["rule"], + examples=np.array([[False], + [True], + [False], + [True]]), + labels={"l1": [False, True, False, True]}) + + result_no_improvement_on_training = motif_classifier._recursively_select_rules(enc_data, + enc_data, + prev_val_scores=[1], + prev_rule_indices=[0], + prev_train_predictions=np.array([False, True, False, True]), + prev_val_predictions=np.array([False, True, False, True]), + index_candidates=[0], + cores_for_training=2) + + self.assertListEqual(result_no_improvement_on_training, [0]) + + enc_data = EncodedData(encoding=MotifEncoder.__name__, + example_ids=["1", "2", "3", "4"], + feature_names=["rule1", "rule2", "rule3"], + examples=np.array([[True, False, False], + [False, True, False], + [False, False, False], + [False, False, False]]), + labels={"l1": [True, True, True, True]}) + + motif_classifier.feature_names = ["rule1", "rule2", "rule3"] + + result_add_one_rule = motif_classifier._recursively_select_rules(enc_data, enc_data, + prev_val_scores=[0], + prev_rule_indices=[0], + prev_train_predictions=np.array([True, False, False, False]), + prev_val_predictions=np.array([True, False, False, False]), + index_candidates=[0, 1, 2], + cores_for_training=2) + self.assertListEqual(result_add_one_rule, [0, 1]) + + motif_classifier.max_features = 1 + + result_max_motifs_reached = motif_classifier._recursively_select_rules(enc_data, enc_data, + prev_val_scores=[], + prev_rule_indices=[], + prev_train_predictions=np.array([False, False, False, False]), + prev_val_predictions=np.array([False, False, False, False]), + index_candidates=[0, 1, 2], + cores_for_training=2) + self.assertListEqual(result_max_motifs_reached, [0]) + + motif_classifier.max_features = 2 + + result_max_motifs_reached = motif_classifier._recursively_select_rules(enc_data, enc_data, + prev_val_scores=[], + prev_rule_indices=[], + prev_train_predictions=np.array([False, False, False, False]), + prev_val_predictions=np.array([False, False, False, False]), + index_candidates=[0, 1, 2], + cores_for_training=2) + self.assertListEqual(result_max_motifs_reached, [0, 1]) + + + def test_get_rule_tree_features_from_indices(self): + motif_classifier = BinaryFeatureClassifier() + features = motif_classifier._get_rule_tree_features_from_indices([0, 2], ["A", "B", "C"]) + + self.assertListEqual(features, ["A", "C"]) + + def test_test_is_improvement(self): + motif_classifier = BinaryFeatureClassifier() + + result = motif_classifier._test_is_improvement([0.0, 0.1, 0.5, 1], 0.1) + self.assertListEqual(result, [True, False, True, True]) + + result = motif_classifier._test_is_improvement([0, 0, 0, 1], 0) + self.assertListEqual(result, [True, False, False, True]) + + result = motif_classifier._test_is_improvement([0], 0) + self.assertListEqual(result, [True]) + + def test_test_earlystopping(self): + motif_classifier = BinaryFeatureClassifier(patience=5) + + self.assertEqual(motif_classifier._test_earlystopping([]), False) + self.assertEqual(motif_classifier._test_earlystopping([False, False, False]), False) + self.assertEqual(motif_classifier._test_earlystopping([True, True, True]), False) + self.assertEqual(motif_classifier._test_earlystopping([True, True, True, True, True]), False) + self.assertEqual(motif_classifier._test_earlystopping([False, False, False, False, False]), True) + self.assertEqual(motif_classifier._test_earlystopping([True, True, True, False, False, False, False, False]), True) + + def test_get_optimal_indices(self): + motif_classifier = BinaryFeatureClassifier(patience=3) + + + result = motif_classifier._get_optimal_indices([1,2,3,4,5,6,7,8,9,10], [True, True, True, False, False]) + self.assertListEqual(result, [1,2,3]) + + result = motif_classifier._get_optimal_indices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [True]) + self.assertListEqual(result, [1]) + + result = motif_classifier._get_optimal_indices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [True, False, False, False, True]) + self.assertListEqual(result, [1, 2, 3, 4, 5]) + + def test_get_rule_tree_predictions(self): + enc_data = EncodedData(encoding=MotifEncoder.__name__, + example_ids=["1", "2", "3", "4"], + feature_names=["rule1", "rule2", "rule3"], + examples=np.array([[True, False, False], + [False, True, False], + [False, False, False], + [False, False, False]]), + labels={"l1": [True, True, True, True]}) + + motif_classifier = BinaryFeatureClassifier() + motif_classifier.feature_names = ["rule1", "rule2", "rule3"] + + result = motif_classifier._get_rule_tree_predictions_bool(enc_data, [0]) + self.assertListEqual(list(result), [True, False, False, False]) + + result = motif_classifier._get_rule_tree_predictions_bool(enc_data, [0, 1]) + self.assertListEqual(list(result), [True, True, False, False]) + + result = motif_classifier._get_rule_tree_predictions_bool(enc_data, [0, 1, 2]) + self.assertListEqual(list(result), [True, True, False, False]) + + diff --git a/test/ml_methods/test_kerasSequenceCNN.py b/test/ml_methods/test_kerasSequenceCNN.py new file mode 100644 index 000000000..038dda884 --- /dev/null +++ b/test/ml_methods/test_kerasSequenceCNN.py @@ -0,0 +1,98 @@ +import os +import shutil +from unittest import TestCase + +from immuneML.caching.CacheType import CacheType +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.kmer_frequency.KmerFreqSequenceEncoder import KmerFreqSequenceEncoder +from immuneML.encodings.onehot.OneHotReceptorEncoder import OneHotReceptorEncoder +from immuneML.encodings.onehot.OneHotSequenceEncoder import OneHotSequenceEncoder +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.Label import Label +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.ml_methods.KerasSequenceCNN import KerasSequenceCNN +from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator +from immuneML.util.PathBuilder import PathBuilder + + + + +class TestKerasSequenceCNN(TestCase): + + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def test_if_keras_installed(self): + try: + import keras + from keras.optimizers import Adam + self._test_fit() + except ImportError as e: + print("Test ignored since keras is not installed.") + + + def _test_fit(self): + import keras + path = PathBuilder.build(EnvironmentSettings.tmp_test_path / "keras_cnn") + + dataset = RandomDatasetGenerator.generate_sequence_dataset(sequence_count=500, length_probabilities={5: 1}, + labels={"CMV": {"yes": 0.5, "no": 0.5}}, path=path / "dataset") + + label = Label("CMV", values=["yes", "no"], positive_class="yes") + encoder = OneHotSequenceEncoder(False, None, False, "enc1") + enc_dataset = encoder.encode(dataset, EncoderParams(path / "result", LabelConfiguration([label]))) + + cnn = KerasSequenceCNN(units_per_layer=[['CONV', 400, 3, 1], + ['DROP', 0.5], + ['POOL', 2, 1], + ['FLAT'], + ['DENSE', 50]], + activation="relu", + training_percentage=0.7) + + cnn.check_encoder_compatibility(encoder) + self.assertRaises(ValueError, lambda: cnn.check_encoder_compatibility(OneHotReceptorEncoder(use_positional_info=False, distance_to_seq_middle=1, flatten=False))) + self.assertRaises(ValueError, lambda: cnn.check_encoder_compatibility(KmerFreqSequenceEncoder(normalization_type=None, reads=None, sequence_encoding=None))) + self.assertRaises(AssertionError, lambda: cnn.check_encoder_compatibility(OneHotSequenceEncoder(use_positional_info=True, distance_to_seq_middle=1, flatten=False))) + self.assertRaises(AssertionError, lambda: cnn.check_encoder_compatibility(OneHotSequenceEncoder(use_positional_info=False, distance_to_seq_middle=1, flatten=True))) + + cnn.fit(encoded_data=enc_dataset.encoded_data, label=label) + + predictions = cnn.predict(enc_dataset.encoded_data, label) + self.assertEqual(500, len(predictions["CMV"])) + self.assertEqual(500, len([pred for pred in predictions["CMV"]])) + + predictions_proba = cnn.predict_proba(enc_dataset.encoded_data, label) + self.assertEqual(500 * [1], list(predictions_proba["CMV"]["yes"] + predictions_proba["CMV"]["no"])) + self.assertEqual(500, predictions_proba["CMV"]["yes"].shape[0]) + self.assertEqual(500, predictions_proba["CMV"]["no"].shape[0]) + + self.assertListEqual(list(predictions_proba["CMV"]["yes"] > 0.5), [pred == "yes" for pred in list(predictions["CMV"])]) + + cnn.store(path / "model_storage") + + cnn2 = KerasSequenceCNN() + cnn2.load(path / "model_storage") + + cnn2_vars = vars(cnn2) + del cnn2_vars["CNN"] + cnn_vars = vars(cnn) + del cnn_vars["CNN"] + + for item, value in cnn_vars.items(): + if isinstance(value, Label): + self.assertDictEqual(vars(value), (vars(cnn2_vars[item]))) + elif not isinstance(value, keras.Sequential): + self.assertEqual(value, cnn2_vars[item]) + + predictions_proba2 = cnn2.predict_proba(enc_dataset.encoded_data, label) + + print(predictions_proba2) + + self.assertTrue(all(predictions_proba["CMV"]["yes"] == predictions_proba2["CMV"]["yes"])) + self.assertTrue(all(predictions_proba["CMV"]["no"] == predictions_proba2["CMV"]["no"])) + + shutil.rmtree(path) + + diff --git a/test/presentation/html/test_ExploratoryAnalysisHTMLBuilder.py b/test/presentation/html/test_ExploratoryAnalysisHTMLBuilder.py index 90ddd3a2c..1915e9a30 100644 --- a/test/presentation/html/test_ExploratoryAnalysisHTMLBuilder.py +++ b/test/presentation/html/test_ExploratoryAnalysisHTMLBuilder.py @@ -60,7 +60,6 @@ def test_build(self): encoded = encoder.encode(dataset, EncoderParams( result_path=path, label_config=label_config, - filename="dataset.csv" )) units = {"named_analysis_1": ExploratoryAnalysisUnit(dataset=dataset, report=SequenceLengthDistribution(), number_of_processes=16), diff --git a/test/reports/data_reports/test_MotifGeneralizationAnalysis.py b/test/reports/data_reports/test_MotifGeneralizationAnalysis.py new file mode 100644 index 000000000..8f64cc659 --- /dev/null +++ b/test/reports/data_reports/test_MotifGeneralizationAnalysis.py @@ -0,0 +1,71 @@ +import os +import shutil +import pandas as pd +from unittest import TestCase + +from immuneML.dsl.DefaultParamsLoader import DefaultParamsLoader +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.reports.data_reports.MotifGeneralizationAnalysis import MotifGeneralizationAnalysis +from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator +from immuneML.util.PathBuilder import PathBuilder + + +class TestMotifGeneralizationAnalysis(TestCase): + def test_generate(self): + path = PathBuilder.build(EnvironmentSettings.tmp_test_path / "significant_motif_overlap/") + + dataset = RandomDatasetGenerator.generate_sequence_dataset(100, {10: 1}, {"l1": {"A": 0.5, "B": 0.5}}, path / "dataset") + + identifiers = [seq.identifier for seq in dataset.get_data()] + training_set_identifiers = identifiers[::2] + + with open(path / "training_ids.txt", "w") as identifiers_file: + identifiers_file.writelines("example_id\n") + identifiers_file.writelines([identifier + "\n" for identifier in training_set_identifiers]) + + params = DefaultParamsLoader.load(EnvironmentSettings.default_params_path / "reports/", "MotifGeneralizationAnalysis") + params["training_set_identifier_path"] = str(path / "training_ids.txt") + params["min_positions"] = 1 + params["max_positions"] = 1 + params["min_precision"] = 0.8 + params["split_by_motif_size"] = True + params["random_seed"] = 1 + params["dataset"] = dataset + params["result_path"] = path / "result" + params["label"] = {"l1": {"positive_class": "A"}} + + report = MotifGeneralizationAnalysis.build_object(**params) + + report._generate() + + + self.assertTrue(os.path.isdir(path / "result/datasets/train")) + self.assertTrue(os.path.isdir(path / "result/datasets/test")) + self.assertTrue(os.path.isdir(path / "result/encoded_data")) + + self.assertTrue(os.path.isfile(path / "result/training_set_scores_motif_size=1.csv")) + self.assertTrue(os.path.isfile(path / "result/test_set_scores_motif_size=1.csv")) + self.assertTrue(os.path.isfile(path / "result/training_combined_precision_motif_size=1.csv")) + self.assertTrue(os.path.isfile(path / "result/test_combined_precision_motif_size=1.csv")) + + self.assertTrue(os.path.isfile(path / "result/training_precision_per_tp_motif_size=1.html")) + self.assertTrue(os.path.isfile(path / "result/test_precision_per_tp_motif_size=1.html")) + + self.assertTrue(os.path.isfile(path / "result/training_precision_recall_motif_size=1.html")) + self.assertTrue(os.path.isfile(path / "result/test_precision_recall_motif_size=1.html")) + + self.assertTrue(os.path.isfile(path / "result/tp_recall_cutoffs.tsv")) + + shutil.rmtree(path) + + + def test_set_tp_cutoff(self): + test_df = pd.DataFrame({"training_TP": [1, 2, 3, 4, 5, 6, 7, 8], "combined_precision": [0.1, 0.2, 0.3, 0.4, 0.8, 0.6, 0.7, 0.8]}) + ma = MotifGeneralizationAnalysis() + + ma.test_precision_threshold = 0.7 + self.assertEqual(ma._determine_tp_cutoff(test_df), 7) + + ma.test_precision_threshold = 1 + self.assertEqual(ma._determine_tp_cutoff(test_df), None) + diff --git a/test/reports/data_reports/test_WeightsDistribution.py b/test/reports/data_reports/test_WeightsDistribution.py new file mode 100644 index 000000000..e229aa9e4 --- /dev/null +++ b/test/reports/data_reports/test_WeightsDistribution.py @@ -0,0 +1,122 @@ +import plotly.express as px +import warnings +from pathlib import Path + +import pandas as pd + +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.reports.ReportOutput import ReportOutput +from immuneML.reports.ReportResult import ReportResult +from immuneML.reports.data_reports.DataReport import DataReport +from immuneML.util.PathBuilder import PathBuilder +from immuneML.dsl.instruction_parsers.LabelHelper import LabelHelper + +import os +import shutil +import pandas as pd +from unittest import TestCase + +from immuneML.caching.CacheType import CacheType +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.data_model.receptor.receptor_sequence.ReceptorSequence import ReceptorSequence +from immuneML.data_model.receptor.receptor_sequence.SequenceMetadata import SequenceMetadata +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.reports.data_reports.WeightsDistribution import WeightsDistribution +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.reports.ReportResult import ReportResult +from immuneML.util.PathBuilder import PathBuilder + +class TestWeightsDistribution(TestCase): + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def _create_dummy_encoded_data(self, path): + sequences = [ + ReceptorSequence( + sequence_aa="AACC", + sequence_id="1", + metadata=SequenceMetadata(custom_params={"l1": 1}), + ), + ReceptorSequence( + sequence_aa="AGDD", + sequence_id="2", + metadata=SequenceMetadata(custom_params={"l1": 1}), + ), + ReceptorSequence( + sequence_aa="AAEE", + sequence_id="3", + metadata=SequenceMetadata(custom_params={"l1": 1}), + ), + ReceptorSequence( + sequence_aa="AGFF", + sequence_id="4", + metadata=SequenceMetadata(custom_params={"l1": 1}), + ), + ReceptorSequence( + sequence_aa="CCCC", + sequence_id="5", + metadata=SequenceMetadata(custom_params={"l1": 2}), + ), + ReceptorSequence( + sequence_aa="DDDD", + sequence_id="6", + metadata=SequenceMetadata(custom_params={"l1": 2}), + ), + ReceptorSequence( + sequence_aa="EEEE", + sequence_id="7", + metadata=SequenceMetadata(custom_params={"l1": 2}), + ), + ReceptorSequence( + sequence_aa="FFFF", + sequence_id="8", + metadata=SequenceMetadata(custom_params={"l1": 2}), + ), + ] + + PathBuilder.build(path) + + dataset = SequenceDataset.build_from_objects( + sequences, 100, PathBuilder.build(path / "data"), "d1" + ) + + lc = LabelConfiguration() + lc.add_label("l1", [1, 2], positive_class=1) + + dataset.set_example_weights([i+1 for i in range(dataset.get_example_count())]) + + + return dataset + + def test_generate(self): + path = EnvironmentSettings.tmp_test_path / "weight_distribution/" + PathBuilder.build(path) + + encoded_dataset = self._create_dummy_encoded_data(path) + + label = "is_binding" + weight_thresholds = [0.001, 0.01, 0.1] + split_classes = True + + report = WeightsDistribution.build_object( + **{"dataset": encoded_dataset, "result_path": path, "label": "l1"} + ) + + self.assertTrue(report.check_prerequisites()) + + result = report._generate() + + self.assertIsInstance(result, ReportResult) + + # self.assertEqual(result.output_figures[0].path, path / "gap_size_for_motif_size_2.html") + + # content = pd.read_csv(path / "gap_size_table_motif_size_2.csv") + # self.assertEqual((list(content.columns))[1], "Gap size, occurrence") + # + # content = pd.read_csv(path / "positional_aa_counts.csv") + # self.assertEqual(list(content.index), [i for i in range(4)]) + + shutil.rmtree(path) \ No newline at end of file diff --git a/test/reports/encoding_reports/test_GroundTruthMotifOverlap.py b/test/reports/encoding_reports/test_GroundTruthMotifOverlap.py new file mode 100644 index 000000000..13d5238fe --- /dev/null +++ b/test/reports/encoding_reports/test_GroundTruthMotifOverlap.py @@ -0,0 +1,70 @@ +import os +from unittest import TestCase +import pandas as pd +import shutil + +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.reports.encoding_reports.GroundTruthMotifOverlap import GroundTruthMotifOverlap +from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator +from immuneML.util.PathBuilder import PathBuilder + + +class TestGroundTruthMotifOverlap(TestCase): + + def _get_encoded_dataset(self, path): + dataset = RandomDatasetGenerator.generate_sequence_dataset(10, {10: 1}, {"is_binder": {"yes": 0.5, "no": 0.5}}, + path / "input_dataset") + + lc = LabelConfiguration() + lc.add_label("is_binder", ["yes", "no"], positive_class="yes") + + encoder = MotifEncoder.build_object(dataset, **{ + "min_positions": 1, + "max_positions": 1, + "min_precision": 0.1, + "min_recall": 0, + "min_true_positives": 1, + }) + + encoded_dataset = encoder.encode(dataset, EncoderParams( + result_path=path / "encoder_result/", + label_config=lc, + pool_size=4, + learn_model=True, + model={}, + )) + + return encoded_dataset + + def _write_groundtruth_motifs_file(self, path): + file_path = path / "gt_motifs.tsv" + with open(file_path, "w") as file: + file.writelines(["indices\tamino_acids\tn_sequences\n", "1\tI\t6\n", "5\tN\t10\n", "0\tA\t4\n", "4&7\t0&1\t30\n"]) + + return file_path + + def test_generate(self): + path = PathBuilder.build(EnvironmentSettings.tmp_test_path / "motif_test_set_performance/") + + report = GroundTruthMotifOverlap.build_object(**{"groundtruth_motifs_path": str(self._write_groundtruth_motifs_file(path))}) + + report.dataset = self._get_encoded_dataset(path) + report.result_path = path / "result_path" + + self.assertTrue(report.check_prerequisites()) + + report._generate() + + self.assertTrue(os.path.isfile(path / "result_path/ground_truth_motif_overlap.tsv")) + + df = pd.read_csv(path / "result_path/ground_truth_motif_overlap.tsv", sep="\t") + + if len(df) > 0: + self.assertTrue(os.path.isfile(path / "result_path/motif_overlap.html")) + + shutil.rmtree(path) + + diff --git a/test/reports/encoding_reports/test_Matches.py b/test/reports/encoding_reports/test_Matches.py index 2368b1884..4e6615678 100644 --- a/test/reports/encoding_reports/test_Matches.py +++ b/test/reports/encoding_reports/test_Matches.py @@ -254,12 +254,12 @@ def test_generate_for_matchedregex(self): self.assertTrue(os.path.isfile(path / "report_results/repertoire_sizes.csv")) self.assertTrue(os.path.isdir(path / "report_results/paired_matches")) - self.assertTrue(os.path.isfile(path / "report_results/paired_matches/example_subject_1_label_yes_subject_id_subject_1.csv")) - self.assertTrue(os.path.isfile(path / "report_results/paired_matches/example_subject_2_label_no_subject_id_subject_2.csv")) - self.assertTrue(os.path.isfile(path / "report_results/paired_matches/example_subject_3_label_no_subject_id_subject_3.csv")) + self.assertTrue(os.path.isfile(path / f"report_results/paired_matches/example_{encoded_data.get_example_ids()[0]}_label_yes_subject_id_subject_1.csv")) + self.assertTrue(os.path.isfile(path / f"report_results/paired_matches/example_{encoded_data.get_example_ids()[1]}_label_no_subject_id_subject_2.csv")) + self.assertTrue(os.path.isfile(path / f"report_results/paired_matches/example_{encoded_data.get_example_ids()[2]}_label_no_subject_id_subject_3.csv")) matches = pd.read_csv(path / "report_results/complete_match_count_table.csv") - subj1_results = pd.read_csv(path / "report_results/paired_matches/example_subject_1_label_yes_subject_id_subject_1.csv") + subj1_results = pd.read_csv(path / f"report_results/paired_matches/example_{encoded_data.get_example_ids()[0]}_label_yes_subject_id_subject_1.csv") self.assertListEqual(list(matches["1_TRA"]), [20, 0, 0]) self.assertListEqual(list(matches["1_TRB"]), [10, 0, 0]) diff --git a/test/reports/encoding_reports/test_MotifTestSetPerformance.py b/test/reports/encoding_reports/test_MotifTestSetPerformance.py new file mode 100644 index 000000000..555ac1d1f --- /dev/null +++ b/test/reports/encoding_reports/test_MotifTestSetPerformance.py @@ -0,0 +1,89 @@ +import os +import shutil +from unittest import TestCase + +from immuneML.IO.dataset_export.AIRRExporter import AIRRExporter +from immuneML.dsl.DefaultParamsLoader import DefaultParamsLoader +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.reports.encoding_reports.MotifTestSetPerformance import MotifTestSetPerformance +from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator + + +class TestMotifTestSetPerformance(TestCase): + + def _get_exported_test_dataset(self, path): + test_dataset = RandomDatasetGenerator.generate_sequence_dataset(50, {10: 1}, {"is_binder": {"yes": 0.5, "no": 0.5}}, + path / "test_random_dataset") + + export_path = path / "test_airr_dataset" + AIRRExporter.export(dataset=test_dataset, path=export_path) + + return export_path + + def _get_encoded_dataset(self, path): + dataset = RandomDatasetGenerator.generate_sequence_dataset(10, {10: 1}, {"is_binder": {"yes": 0.5, "no": 0.5}}, + path / "input_dataset") + + lc = LabelConfiguration() + lc.add_label("is_binder", ["yes", "no"], positive_class="yes") + + encoder = MotifEncoder.build_object(dataset, **{ + "min_positions": 1, + "max_positions": 1, + "min_precision": 0.1, + "min_recall": 0, + "min_true_positives": 1, + }) + + encoded_dataset = encoder.encode(dataset, EncoderParams( + result_path=path / "encoder_result/", + label_config=lc, + pool_size=4, + learn_model=True, + model={}, + )) + + return encoded_dataset + + def _write_highlight_motifs_file(self, path): + file_path = path / "highlight_motifs.tsv" + with open(file_path, "w") as file: + file.writelines(["indices\tamino_acids\n", "1\tI\n", "5\tN\n", "0\tA\n", "4&7\t0&1\n"]) + + return file_path + + + def test_generate(self): + path = EnvironmentSettings.tmp_test_path / "motif_test_set_performance/" + + test_dataset_path = self._get_exported_test_dataset(path) + + params = DefaultParamsLoader.load(EnvironmentSettings.default_params_path / "reports/", + "MotifTestSetPerformance") + params["test_dataset"] = {"format": "AIRR", + "params": {"path": str(test_dataset_path), + "metadata_column_mapping": {"is_binder": "is_binder"}}} + params["name"] = "motif_set_perf" + params["highlight_motifs_path"] = str(self._write_highlight_motifs_file(path)) + + report = MotifTestSetPerformance.build_object(**params) + + report.dataset = self._get_encoded_dataset(path) + report.result_path = path / "result_path" + + self.assertTrue(report.check_prerequisites()) + + report._generate() + + self.assertTrue(os.path.isfile(path / "result_path/training_set_scores_motif_size=1.csv")) + self.assertTrue(os.path.isfile(path / "result_path/training_combined_precision_motif_size=1.csv")) + self.assertTrue(os.path.isfile(path / "result_path/test_combined_precision_motif_size=1.csv")) + self.assertTrue(os.path.isfile(path / "result_path/training_precision_per_tp_motif_size=1.html")) + self.assertTrue(os.path.isfile(path / "result_path/test_precision_per_tp_motif_size=1.html")) + + shutil.rmtree(path) + + diff --git a/test/reports/encoding_reports/test_NonMotifSequenceSimilarity.py b/test/reports/encoding_reports/test_NonMotifSequenceSimilarity.py new file mode 100644 index 000000000..8ede1c1ad --- /dev/null +++ b/test/reports/encoding_reports/test_NonMotifSequenceSimilarity.py @@ -0,0 +1,73 @@ +import os +import shutil +from unittest import TestCase + +from immuneML.caching.CacheType import CacheType +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.reports.encoding_reports.NonMotifSequenceSimilarity import NonMotifSequenceSimilarity +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.reports.ReportResult import ReportResult +from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator +from immuneML.util.PathBuilder import PathBuilder + + +class TestNonMotifSequenceSimilarity(TestCase): + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def _create_dummy_encoded_data(self, path): + dataset = RandomDatasetGenerator.generate_sequence_dataset(10, {10: 1}, {"l1": {"A": 0.5, "B": 0.5}}, + path / "dataset") + + lc = LabelConfiguration() + lc.add_label("l1", ["A", "B"], positive_class="A") + + encoder = MotifEncoder.build_object( + dataset, + **{ + "max_positions": 1, + "min_positions": 1, + "min_precision": 0.1, + "min_recall": 0, + "min_true_positives": 1, + } + ) + + encoded_dataset = encoder.encode( + dataset, + EncoderParams( + result_path=path / "encoded_data/", + label_config=lc, + pool_size=2, + learn_model=True, + model={}, + ), + ) + + return encoded_dataset + + def test_generate(self): + path = EnvironmentSettings.tmp_test_path / "non_motif_sequence_similarity/" + PathBuilder.build(path) + + encoded_dataset = self._create_dummy_encoded_data(path) + + report = NonMotifSequenceSimilarity.build_object( + **{"dataset": encoded_dataset, "result_path": path, + "motif_color_map": {1: "#66C5CC", 2: "#F6CF71", 3: "#F89C74"}} + ) + + self.assertTrue(report.check_prerequisites()) + + result = report._generate() + + self.assertIsInstance(result, ReportResult) + + self.assertTrue(os.path.isfile(path / "sequence_hamming_distances.html")) + self.assertTrue(os.path.isfile(path / "sequence_hamming_distances_percentage.tsv")) + self.assertTrue(os.path.isfile(path / "sequence_hamming_distances_raw.tsv")) + + shutil.rmtree(path) diff --git a/test/reports/encoding_reports/test_PositionalMotifFrequencies.py b/test/reports/encoding_reports/test_PositionalMotifFrequencies.py new file mode 100644 index 000000000..9d2911d0b --- /dev/null +++ b/test/reports/encoding_reports/test_PositionalMotifFrequencies.py @@ -0,0 +1,136 @@ +import os +import shutil +import pandas as pd +from unittest import TestCase + +from immuneML.caching.CacheType import CacheType +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.data_model.receptor.receptor_sequence.ReceptorSequence import ReceptorSequence +from immuneML.data_model.receptor.receptor_sequence.SequenceMetadata import SequenceMetadata +from immuneML.encodings.EncoderParams import EncoderParams +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.reports.encoding_reports.PositionalMotifFrequencies import PositionalMotifFrequencies +from immuneML.environment.LabelConfiguration import LabelConfiguration +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.reports.ReportResult import ReportResult +from immuneML.util.PathBuilder import PathBuilder + + +class TestPositionalMotifFrequencies(TestCase): + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def _create_dummy_encoded_data(self, path): + sequences = [ + ReceptorSequence( + sequence_aa="AACC", + sequence_id="1", + metadata=SequenceMetadata(custom_params={"l1": 1}), + ), + ReceptorSequence( + sequence_aa="AGDD", + sequence_id="2", + metadata=SequenceMetadata(custom_params={"l1": 1}), + ), + ReceptorSequence( + sequence_aa="AAEE", + sequence_id="3", + metadata=SequenceMetadata(custom_params={"l1": 1}), + ), + ReceptorSequence( + sequence_aa="AGFF", + sequence_id="4", + metadata=SequenceMetadata(custom_params={"l1": 1}), + ), + ReceptorSequence( + sequence_aa="CCCC", + sequence_id="5", + metadata=SequenceMetadata(custom_params={"l1": 2}), + ), + ReceptorSequence( + sequence_aa="DDDD", + sequence_id="6", + metadata=SequenceMetadata(custom_params={"l1": 2}), + ), + ReceptorSequence( + sequence_aa="EEEE", + sequence_id="7", + metadata=SequenceMetadata(custom_params={"l1": 2}), + ), + ReceptorSequence( + sequence_aa="FFFF", + sequence_id="8", + metadata=SequenceMetadata(custom_params={"l1": 2}), + ), + ] + + PathBuilder.build(path) + + dataset = SequenceDataset.build_from_objects( + sequences, 100, PathBuilder.build(path / "data"), "d1" + ) + + lc = LabelConfiguration() + lc.add_label("l1", [1, 2], positive_class=1) + + encoder = MotifEncoder.build_object( + dataset, + **{ + "min_positions": 1, + "max_positions": 2, + "min_precision": 0.9, + "min_recall": 0.5, + "min_true_positives": 1, + } + ) + + encoded_dataset = encoder.encode( + dataset, + EncoderParams( + result_path=path / "encoded_data/", + label_config=lc, + pool_size=2, + learn_model=True, + model={}, + ), + ) + + return encoded_dataset + + def test_generate(self): + path = EnvironmentSettings.tmp_test_path / "positional_motif_frequencies/" + PathBuilder.build(path) + + encoded_dataset = self._create_dummy_encoded_data(path) + + report = PositionalMotifFrequencies.build_object( + **{"dataset": encoded_dataset, "result_path": path, + "motif_color_map": {1: "#66C5CC", 2: "#F6CF71", 3: "#F89C74"}} + ) + + self.assertTrue(report.check_prerequisites()) + + result = report._generate() + + self.assertIsInstance(result, ReportResult) + + self.assertTrue(os.path.isfile(path / "max_gap_size.html")) + self.assertTrue(os.path.isfile(path / "total_gap_size.html")) + self.assertTrue(os.path.isfile(path / "positional_motif_frequencies.html")) + self.assertTrue(os.path.isfile(path / "max_gap_size_table.csv")) + self.assertTrue(os.path.isfile(path / "total_gap_size_table.csv")) + self.assertTrue(os.path.isfile(path / "positional_aa_counts.csv")) + + content = pd.read_csv(path / "max_gap_size_table.csv") + self.assertEqual((list(content.columns))[1], "max_gap_size") + self.assertEqual((list(content.columns))[2], "occurrence") + + content = pd.read_csv(path / "total_gap_size_table.csv") + self.assertEqual((list(content.columns))[1], "total_gap_size") + self.assertEqual((list(content.columns))[2], "occurrence") + + content = pd.read_csv(path / "positional_aa_counts.csv") + self.assertEqual(list(content.index), [i for i in range(4)]) + + shutil.rmtree(path) diff --git a/test/reports/ml_reports/test_BinaryFeaturePrecisionRecall.py b/test/reports/ml_reports/test_BinaryFeaturePrecisionRecall.py new file mode 100644 index 000000000..5eae64419 --- /dev/null +++ b/test/reports/ml_reports/test_BinaryFeaturePrecisionRecall.py @@ -0,0 +1,117 @@ +import os +import shutil +from unittest import TestCase + +import numpy as np +import random + +from immuneML.caching.CacheType import CacheType +from immuneML.data_model.dataset.SequenceDataset import SequenceDataset +from immuneML.data_model.encoded_data.EncodedData import EncodedData +from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder +from immuneML.environment.Constants import Constants +from immuneML.environment.EnvironmentSettings import EnvironmentSettings +from immuneML.environment.Label import Label +from immuneML.ml_methods.BinaryFeatureClassifier import BinaryFeatureClassifier +from immuneML.reports.ReportResult import ReportResult +from immuneML.reports.ml_reports.BinaryFeaturePrecisionRecall import BinaryFeaturePrecisionRecall +from immuneML.util.PathBuilder import PathBuilder + + + +class TestBinaryFeaturePrecisionRecall(TestCase): + + def setUp(self) -> None: + os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name + + def _create_report(self, path, keep_all): + enc_data_train = EncodedData(encoding=MotifEncoder.__name__, + example_ids=["1", "2", "3", "4", "5", "6", "7", "8"], + feature_names=["useless_rule", "rule1", "rule2", "rule3"], + examples=np.array([[False, True, False, False], + [True, True, False, False], + [False, False, True, True], + [True, False, True, True], + [False, False, False, True], + [True, False, False, True], + [False, False, False, False], + [True, False, False, False]]), + labels={"l1": ["yes", "yes", "yes", "yes", "no", "no", "no", "no"]}) + + enc_data_test = EncodedData(encoding=MotifEncoder.__name__, + example_ids=["9", "10"], + feature_names=["useless_rule", "rule1", "rule2", "rule3"], + examples=np.array([[True, False, False, False], + [True, False, True, False]]), + labels={"l1": ["yes", "no"]}) + + label = Label("l1", values=["yes", "no"], positive_class="yes") + + motif_classifier = BinaryFeatureClassifier(training_percentage=0.7, + random_seed=1, + max_features=100, + patience=10, + min_delta=0, + keep_all=keep_all, + result_path=path) + + random.seed(1) + motif_classifier.fit(encoded_data=enc_data_train, label=label, + optimization_metric="accuracy") + random.seed(None) + + report = BinaryFeaturePrecisionRecall.build_object(**{}) + + report.method = motif_classifier + report.label = label + report.result_path = path + report.train_dataset = SequenceDataset(buffer_type="NA", dataset_file="") + report.test_dataset = SequenceDataset(buffer_type="NA", dataset_file="") + report.train_dataset.encoded_data = enc_data_train + report.test_dataset.encoded_data = enc_data_test + + return report + + + def test_generate_keep_all_false(self): + path = EnvironmentSettings.root_path / "test/tmp/binary_feature_precision_recall" + PathBuilder.build(path) + + report = self._create_report(path, keep_all=False) + + self.assertTrue(report.check_prerequisites()) + + result = report._generate() + + self.assertIsInstance(result, ReportResult) + + self.assertTrue(os.path.isfile(path / "training_performance.tsv")) + self.assertTrue(os.path.isfile(path / "validation_performance.tsv")) + self.assertTrue(os.path.isfile(path / "test_performance.tsv")) + self.assertTrue(os.path.isfile(path / "training_precision_recall.html")) + self.assertTrue(os.path.isfile(path / "validation_precision_recall.html")) + self.assertTrue(os.path.isfile(path / "test_precision_recall.html")) + + shutil.rmtree(path) + + def test_generate_keep_all_true(self): + path = EnvironmentSettings.root_path / "test/tmp/binary_feature_precision_recall_keep_all" + PathBuilder.build(path) + + report = self._create_report(path, keep_all=True) + + self.assertTrue(report.check_prerequisites()) + + result = report._generate() + + self.assertIsInstance(result, ReportResult) + + self.assertTrue(os.path.isfile(path / "training_performance.tsv")) + self.assertTrue(os.path.isfile(path / "test_performance.tsv")) + self.assertTrue(os.path.isfile(path / "training_precision_recall.html")) + self.assertTrue(os.path.isfile(path / "test_precision_recall.html")) + + shutil.rmtree(path) + + + diff --git a/test/workflows/instructions/test_MLApplicationInstruction.py b/test/workflows/instructions/test_MLApplicationInstruction.py index ced6c0041..957a9e272 100644 --- a/test/workflows/instructions/test_MLApplicationInstruction.py +++ b/test/workflows/instructions/test_MLApplicationInstruction.py @@ -40,7 +40,7 @@ def test_run(self): label = Label("l1", [1, 2]) label_config = LabelConfiguration([label]) - enc_dataset = encoder.encode(dataset, EncoderParams(result_path=path, label_config=label_config, filename="tmp_enc_dataset.pickle", pool_size=4)) + enc_dataset = encoder.encode(dataset, EncoderParams(result_path=path, label_config=label_config, pool_size=4)) ml_method.fit(enc_dataset.encoded_data, label) hp_setting = HPSetting(encoder, {"normalization_type": "relative_frequency", "reads": "unique", "sequence_encoding": "continuous_kmer", diff --git a/test/workflows/instructions/test_trainMLModelInstruction.py b/test/workflows/instructions/test_trainMLModelInstruction.py index ace6d1ce3..b3fdaa378 100644 --- a/test/workflows/instructions/test_trainMLModelInstruction.py +++ b/test/workflows/instructions/test_trainMLModelInstruction.py @@ -74,7 +74,8 @@ def test_run(self): process = TrainMLModelInstruction(dataset, GridSearch(hp_settings), hp_settings, SplitConfig(SplitType.RANDOM, 1, 0.5, reports=ReportConfig(data_splits={"seqlen": report})), SplitConfig(SplitType.RANDOM, 1, 0.5, reports=ReportConfig(data_splits={"seqlen": report})), - {ClassificationMetric.BALANCED_ACCURACY}, ClassificationMetric.BALANCED_ACCURACY, label_config, path) + {ClassificationMetric.BALANCED_ACCURACY}, ClassificationMetric.BALANCED_ACCURACY, label_config, path, + export_all_ml_settings=True) state = process.run(result_path=path) diff --git a/test/workflows/steps/test_dataEncoder.py b/test/workflows/steps/test_dataEncoder.py index 931c14a57..6d0d4c85c 100644 --- a/test/workflows/steps/test_dataEncoder.py +++ b/test/workflows/steps/test_dataEncoder.py @@ -53,7 +53,6 @@ def test_run(self): pool_size=2, label_config=lc, result_path=path, - filename="dataset.csv" ) ))