uio-bmi · pavlovicmilena · Nov 12, 2024 · Oct 29, 2023 · Oct 29, 2023 · Oct 30, 2023
diff --git a/README.md b/README.md
@@ -30,7 +30,7 @@ Useful links:
 
 
 We recommend installing immuneML inside a virtual environment.
-immuneML uses Python 3.8 or later. If using immuneML simulation, Python 3.11 is recommended.
+immuneML uses Python 3.8 or later. If using immuneML simulation, Python 3.11 or later is recommended.
 
 immuneML can be [installed directly using a package manager](<https://docs.immuneml.uio.no/latest/installation/install_with_package_manager.html#>) such as pip or conda,
 or [set up via docker](<https://docs.immuneml.uio.no/latest/installation/installation_docker.html>).

diff --git a/docs/source/_static/images/reports/amino_acid_frequency.png b/docs/source/_static/images/reports/amino_acid_frequency.png
diff --git a/docs/source/_static/images/reports/amino_acid_frequency_change.png b/docs/source/_static/images/reports/amino_acid_frequency_change.png
diff --git a/docs/source/_static/images/reports/feature_distribution.png b/docs/source/_static/images/reports/feature_distribution.png
diff --git a/docs/source/_static/images/reports/feature_value_barplot.png b/docs/source/_static/images/reports/feature_value_barplot.png
diff --git a/docs/source/installation/install_with_package_manager.rst b/docs/source/installation/install_with_package_manager.rst
@@ -17,7 +17,11 @@ Install immuneML with pip
 
 0. To install immuneML with pip, make sure to have Python version 3.7 or later installed.
 
-1. Create a virtual environment where immuneML will be installed. It is possible to install immuneML as a global package, but it is not recommended as there might be conflicting versions of different packages. For more details, see `the official documentation on creating virtual environments with Python <https://docs.python.org/3/library/venv.html>`_. To create an environment, run the following in the terminal (for Windows-specific commands, see the virtual environment documentation linked above):
+1. Create a virtual environment where immuneML will be installed. It is possible to install immuneML as a global
+package, but it is not recommended as there might be conflicting versions of different packages. For more details,
+see `the official documentation on creating virtual environments with Python <https://docs.python.org/3/library/venv.html>`_.
+To create an environment, run the following in the terminal (for Windows-specific commands, see the virtual
+environment documentation linked above):
 
 .. code-block:: console
 
@@ -29,7 +33,10 @@ Install immuneML with pip
 
   source ./immuneml_venv/bin/activate
 
-Note: when creating a python virtual environment, it will automatically use the same Python version as the environment it was created in. To ensure that the preferred Python version (3.8) is used, it is possible to instead make a conda environment (see :ref:`Install immuneML with conda` steps 0-3) and proceed to install immuneML with pip inside the conda environment.
+Note: when creating a python virtual environment, it will automatically use the same Python version as the environment
+it was created in. To ensure that the preferred Python version (3.8) is used, it is possible to instead make a conda
+environment (see :ref:`Install immuneML with conda` steps 0-3) and proceed to install immuneML with pip inside the
+conda environment.
 
 
 3. If not already up-to-date, update pip:
@@ -38,13 +45,8 @@ Note: when creating a python virtual environment, it will automatically use the
 
   python3 -m pip install --upgrade pip
 
-4. If not already installed, install the wheel package. If it is not installed, the installation of some of the dependencies might default to legacy 'setup.py install'.
 
-.. code-block:: console
-
-  pip install wheel
-
-5. To install `immuneML from PyPI <https://pypi.org/project/immuneML/>`_ in this virtual environment, run the following:
+4. To install `immuneML from PyPI <https://pypi.org/project/immuneML/>`_ in this virtual environment, run the following:
 
 .. code-block:: console
 
@@ -64,12 +66,12 @@ Install immuneML with conda
   mkdir immuneML/
   cd immuneML/
 
-2. Create a virtual environment using conda. immuneML has been tested extensively with Python versions 3.7, 3.8 and 3.11.
-   To create a conda virtual environment with Python version 3.8, use:
+2. Create a virtual environment using conda. immuneML has been tested extensively with Python version 3.11.
+   To create a conda virtual environment with Python version 3.11, use:
 
 .. code-block:: console
 
-  conda create --prefix immuneml_env/ python=3.8
+  conda create --prefix immuneml_env/ python=3.11
 
 3. Activate the created environment:
 
@@ -118,27 +120,33 @@ To install the DeepRC dependencies, run:
 See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`
 
 
-Keras-based sequence CNN
+Deep learning methods
 ************************
 
-In order to use the :ref:`KerasSequenceCNN`, optional dependencies :code:`keras` and :code:`tensorflow` need to be installed.
-By default, version 2.11.0 of both dependencies are used.
-Other versions may work as well, as long as the used versions of :code:`keras` and :code:`tensorflow` are compatible with eachother.
-
-To install the default versions of these packages, you can include the optional extra :code:`KerasSequenceCNN`:
+In order to use any of the supported deep learning models (KerasSequenceCNN or others), install DL optional dependencies:
 
 .. code-block:: console
 
-  pip install immuneML[KerasSequenceCNN]
+  pip install immuneML[DL]
 
-Or install the dependencies manually using the :download:`requirements_KerasSequenceCNN.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_KerasSequenceCNN.txt>` file:
+Fisher's exact test
+**********************
+
+For using ProbabilisticBinaryClassifier or any of the abundance encoders (following Emerson et al. 2017 publication),
+please install 'fisher' optional dependencies:
 
 .. code-block:: console
 
-  pip install -r requirements_KerasSequenceCNN.txt
+  pip install immuneML[fisher]
+
+Full immuneML installation
+******************************
 
+To install all optional dependencies and have access to the full set of immuneML features, use the following installation command:
+
+.. code-block:: console
 
-The :ref:`KerasSequenceCNN` uses CPU, it does *not* rely on GPU.
+  pip install immuneML[all]
 
 CompAIRR
 ********

diff --git a/docs/source/installation/installation_docker.rst b/docs/source/installation/installation_docker.rst
@@ -40,6 +40,12 @@ To exit the Docker container, use the following command:
 
   exit
 
+.. note:: Available data
+
+  Please note that the Docker container only has access to the data that was explicitly mounted to the container. This
+  means that if you followed the example above, immuneML running the Docker container will only have access to files in
+  and under the current working directory and will see it under /data path.
+
 Using the Docker container for longer immuneML runs
 ----------------------------------------------------
 Ï

diff --git a/immuneML/IO/dataset_export/AIRRExporter.py b/immuneML/IO/dataset_export/AIRRExporter.py
@@ -1,5 +1,6 @@
 import logging
 import math
+import shutil
 from dataclasses import fields
 from enum import Enum
 from multiprocessing import Pool
@@ -8,18 +9,11 @@
 
 import airr
 import pandas as pd
-from olga.utils import nt2aa
 
 from immuneML.IO.dataset_export.DataExporter import DataExporter
-from immuneML.data_model.dataset import Dataset
-from immuneML.data_model.dataset.ReceptorDataset import ReceptorDataset
-from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset
-from immuneML.data_model.receptor.Receptor import Receptor
-from immuneML.data_model.receptor.RegionType import RegionType
-from immuneML.data_model.receptor.receptor_sequence.Chain import Chain
-from immuneML.data_model.receptor.receptor_sequence.ReceptorSequence import ReceptorSequence
-from immuneML.data_model.receptor.receptor_sequence.SequenceFrameType import SequenceFrameType
-from immuneML.data_model.repertoire.Repertoire import Repertoire
+from immuneML.data_model.datasets.Dataset import Dataset
+from immuneML.data_model.datasets.ElementDataset import ElementDataset
+from immuneML.data_model.datasets.RepertoireDataset import RepertoireDataset
 from immuneML.environment.Constants import Constants
 from immuneML.util.NumpyHelper import NumpyHelper
 from immuneML.util.PathBuilder import PathBuilder
@@ -41,211 +35,25 @@ class AIRRExporter(DataExporter):
     def export(dataset: Dataset, path: Path, number_of_processes: int = 1, omit_columns: list = None):
         PathBuilder.build(path)
 
-        if isinstance(dataset, RepertoireDataset):
-            repertoire_folder = "repertoires/"
-            repertoire_path = PathBuilder.build(path / repertoire_folder)
+        try:
 
-            with Pool(processes=number_of_processes) as pool:
-                arguments = [(repertoire, repertoire_path, dataset.labels, omit_columns)
-                             for repertoire in dataset.repertoires]
-                pool.starmap(AIRRExporter.export_repertoire, arguments)
+            if isinstance(dataset, RepertoireDataset):
+                repertoire_folder = "repertoires/"
+                repertoire_path = PathBuilder.build(path / repertoire_folder)
 
-            AIRRExporter.export_updated_metadata(dataset, path, repertoire_folder)
-        else:
+                for repertoire in dataset.repertoires:
+                    shutil.copyfile(repertoire.data_filename, repertoire_path / repertoire.data_filename.name)
+                    shutil.copyfile(repertoire.metadata_filename, repertoire_path / repertoire.metadata_filename.name)
 
-            index = 1
-            file_count = math.ceil(dataset.get_example_count() / dataset.file_size)
+                shutil.copyfile(dataset.metadata_file, path / dataset.metadata_file.name)
+                if dataset.dataset_file and dataset.dataset_file.is_file():
+                    shutil.copyfile(dataset.dataset_file, path / dataset.dataset_file.name)
 
-            for batch in dataset.get_batch():
-                filename = path / f"batch{''.join(['0' for i in range(1, len(str(file_count)) - len(str(index)) + 1)])}{index}.tsv"
+            elif isinstance(dataset, ElementDataset):
+                shutil.copyfile(dataset.filename, path / dataset.filename.name)
+                shutil.copyfile(dataset.dataset_file, path / dataset.dataset_file.name)
 
-                if isinstance(dataset, ReceptorDataset):
-                    df = AIRRExporter._receptors_to_dataframe(batch)
-                else:
-                    df = AIRRExporter._sequences_to_dataframe(batch)
+        except shutil.SameFileError as e:
+            logging.warning(f"AIRRExporter: target and input path are the same. Skipping the copy operation...")
 
-                df = AIRRExporter._postprocess_dataframe(df, dataset.labels, omit_columns)
-                airr.dump_rearrangement(df, str(filename))
-
-                index += 1
-
-    @staticmethod
-    def export_repertoire(repertoire: Repertoire, repertoire_path: Path, dataset_labels: dict, omit_columns: list = None):
-        df = AIRRExporter._repertoire_to_dataframe(repertoire)
-        df = AIRRExporter._postprocess_dataframe(df, dataset_labels, omit_columns)
-        output_file = repertoire_path / f"{repertoire.data_filename.stem if 'subject_id' not in repertoire.metadata else repertoire.metadata['subject_id']}.tsv"
-        airr.dump_rearrangement(df, str(output_file))
-
-    @staticmethod
-    def get_sequence_field(region_type):
-        if region_type == RegionType.IMGT_CDR3:
-            return "cdr3"
-        elif region_type == RegionType.IMGT_JUNCTION:
-            return "junction"
-        else:
-            return "sequence"
-
-    @staticmethod
-    def get_sequence_aa_field(region_type):
-        return f"{AIRRExporter.get_sequence_field(region_type)}_aa"
-
-    @staticmethod
-    def export_updated_metadata(dataset: RepertoireDataset, result_path: Path, repertoire_folder: str):
-        df = pd.read_csv(dataset.metadata_file, comment=Constants.COMMENT_SIGN)
-        identifiers = df["identifier"].values.tolist() if "identifier" in df.columns else dataset.get_example_ids()
-        df["filename"] = [f"{repertoire.data_filename.stem if 'subject_id' not in repertoire.metadata else repertoire.metadata['subject_id']}.tsv"
-                          for repertoire in dataset.get_data()]
-        df['identifier'] = identifiers
-        df.to_csv(result_path / "metadata.csv", index=False)
-
-    @staticmethod
-    def _repertoire_to_dataframe(repertoire: Repertoire):
-        rep_data = repertoire.load_bnp_data()
-        df = pd.DataFrame({field.name: getattr(rep_data, field.name).tolist() for field in fields(rep_data)})
-
-        region_type = repertoire.get_region_type()
-
-        # rename mandatory fields for airr-compliance
-        mapper = {"chain": "locus", "sequence": AIRRExporter.get_sequence_field(region_type),
-                  "sequence_aa": AIRRExporter.get_sequence_aa_field(region_type)}
-
-        df = df.rename(mapper=mapper, axis="columns")
-        df.drop(columns=['region_type'], inplace=True)
-
-        return df
-
-    @staticmethod
-    def add_full_length_seq(df, species, unique_chains):
-        if unique_chains is not None and len(unique_chains) <= 2 and all(chain in [Chain.ALPHA.value, Chain.BETA.value] for chain in unique_chains):
-            try:
-                from Stitchr import stitchr as st
-                from Stitchr import stitchrfunctions as fxn
-
-                tcr_dat, functionality, partial = {}, {}, {}
-
-                for chain in unique_chains:
-                    tcr_dat[chain], functionality[chain], partial[chain] = fxn.get_imgt_data(chain, st.gene_types, species.upper())
-
-                codons = fxn.get_optimal_codons('', species)
-
-                df['full_sequence'] = df.apply(lambda row: stitch_wrapper(row, st, fxn, species, tcr_dat, functionality, partial, codons), axis=1)
-
-                df['full_sequence_aa'] = df.apply(lambda row: nt2aa(row['full_sequence']), axis=1)
-
-            except Exception as e:
-                logging.warning(f"An error occurred while exporting full length sequence. Only CDR3/JUNCTION region "
-                                f"is exported instead.\nFull error: {e}")
-
-    @staticmethod
-    def _receptors_to_dataframe(receptors: List[Receptor]):
-        sequences = [(receptor.get_chain(receptor.get_chains()[0]), receptor.get_chain(receptor.get_chains()[1])) for receptor in receptors]
-        sequences = [item for sublist in sequences for item in sublist]
-        receptor_ids = [(receptor.identifier, receptor.identifier) for receptor in receptors]
-        receptor_ids = [item for sublist in receptor_ids for item in sublist]
-
-        df = AIRRExporter._sequences_to_dataframe(sequences)
-        df["cell_id"] = receptor_ids
-        return df
-
-    @staticmethod
-    def _get_sequence_list_region_type(sequences: List[ReceptorSequence]):
-        region_types = set([sequence.get_attribute("region_type") for sequence in sequences])
-
-        assert len(region_types) == 1, f"AIRRExporter: expected one region_type, found: {region_types}"
-
-        return RegionType(region_types.pop())
-
-    @staticmethod
-    def _sequences_to_dataframe(sequences: List[ReceptorSequence]):
-        region_type = AIRRExporter._get_sequence_list_region_type(sequences)
-        sequence_field = AIRRExporter.get_sequence_field(region_type)
-        sequence_aa_field = AIRRExporter.get_sequence_aa_field(region_type)
-
-        main_data_dict = {"sequence_id": [], sequence_field: [], sequence_aa_field: []}
-        attributes_dict = {"chain": [], "v_call": [], "j_call": [], "duplicate_count": [], "cell_id": [], "frame_type": []}
-
-        for i, sequence in enumerate(sequences):
-            main_data_dict["sequence_id"].append(sequence.sequence_id)
-            main_data_dict[sequence_field].append(sequence.sequence)
-            main_data_dict[sequence_aa_field].append(sequence.sequence_aa)
-
-            # add custom params of this receptor sequence to attributes dict
-            if sequence.metadata is not None and sequence.metadata.custom_params is not None:
-                for custom_param in sequence.metadata.custom_params:
-                    if custom_param not in attributes_dict:
-                        attributes_dict[custom_param] = ['' for i in range(i)]
-
-            for attribute in attributes_dict.keys():
-                try:
-                    attr_value = sequence.get_attribute(attribute)
-                    if isinstance(attr_value, Enum):
-                        attr_value = attr_value.value
-                    attributes_dict[attribute].append(attr_value)
-                except KeyError:
-                    attributes_dict[attribute].append('')
-
-        df = pd.DataFrame({**attributes_dict, **main_data_dict})
-
-        df.rename(columns={"chain": "locus"}, inplace=True)
-
-        return df
-
-    @staticmethod
-    def update_gene_columns(df, allele_name, gene_name):
-        for index, row in df.iterrows():
-            for gene in ['v', 'j']:
-                if NumpyHelper.is_nan_or_empty(row[f"{gene}_{allele_name}"]) and not NumpyHelper.is_nan_or_empty(row[f"{gene}_{gene_name}"]):
-                    df.at[index, f"{gene}_{allele_name}"] = row[f"{gene}_{gene_name}"]
-
-    @staticmethod
-    def _postprocess_dataframe(df, dataset_labels: dict, omit_columns: list = None):
-        if "locus" in df.columns:
-            df["locus"] = [Chain.get_chain(chain).value if chain and Chain.get_chain(chain) else '' for chain in df["locus"]]
-        else:
-            df['locus'] = df.apply(lambda row: Chain.get_chain(row['v_call'][:3]).value, axis=1)
-
-        if "frame_type" in df.columns:
-            AIRRExporter._enums_to_strings(df, "frame_type")
-
-            df["productive"] = df["frame_type"] == SequenceFrameType.IN.value
-            df.loc[df["frame_type"].isnull(), "productive"] = ""
-            df.loc[df["frame_type"] == "", "productive"] = ""
-            df.loc[df["frame_type"] == SequenceFrameType.UNDEFINED.value, "productive"] = ""
-
-            df["vj_in_frame"] = df["productive"]
-
-            df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.value
-            df.loc[df["frame_type"].isnull(), "stop_codon"] = ''
-
-            df.drop(columns=["frame_type"], inplace=True)
-
-        if "region_type" in df.columns:
-            df.drop(columns=["region_type"], inplace=True)
-
-        if omit_columns is not None:
-            df.drop(columns=omit_columns, inplace=True)
-
-        AIRRExporter.add_full_length_seq(df, dataset_labels.get('species', None) if dataset_labels else None, list(set(df['locus'].values.tolist())))
-
-        return df
-
-    @staticmethod
-    def _enums_to_strings(df, field):
-        df.loc[:, field] = [field_value.value if isinstance(field_value, Enum) else field_value for field_value in df.loc[:, field]]
-
-
-def stitch_wrapper(row, st, fxn, species, tcr_dat, functionality, partial, codons):
-    full_sequence = ""
-
-    try:
-        full_sequence = st.stitch({'v': row['v_call'], 'j': row['j_call'], 'cdr3': row['junction_aa'],
-                   'skip_c_checks': False, '5_prime_seq': '', '3_prime_seq': '', 'name': '',
-                   'c': fxn.autofill_input({'c': None, 'species': species.upper(), 'j': row['j_call'],
-                                            'l': row['v_call']}, row['locus'])['c'],
-                   'species': species.upper(), 'l': row['v_call']},
-                  tcr_dat[row['locus']], functionality[row['locus']], partial[row['locus']], codons, 3, '')[1]
-
-    except Exception as e:
-        logging.warning(f"An error occurred while constructing full sequence from row: \n{row}. Error log: \n{e}")
-
-    return full_sequence
+        # TODO: add here export of full sequence if possible
diff --git a/immuneML/IO/dataset_export/DataExporter.py b/immuneML/IO/dataset_export/DataExporter.py
@@ -3,8 +3,7 @@
 import abc
 from pathlib import Path
 
-from immuneML.data_model.dataset.Dataset import Dataset
-from immuneML.data_model.receptor.RegionType import RegionType
+from immuneML.data_model.datasets.Dataset import Dataset
 
 
 class DataExporter(metaclass=abc.ABCMeta):