Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge in short motif code #161

Merged
merged 284 commits into from
Dec 1, 2023

Conversation

LonnekeScheffer
Copy link
Collaborator

  • Added short motif code: MotifEncoder, a sequence dataset encoder where each feature is a motif (position-specific, possibly gapped), 1 denotes the motif is present and 0 absent. Motifs are filtered by precision and recall thresholds. Several related reports are included (to calibrate recall threshold or analyse properties of learned motifs).
  • Added SimilarToPositiveSequenceEncoder, a baseline method with a single feature. The encoder memorises all positive sequences in the training set, and the feature is 1 if a sequence is within a given hamming distance from ANY positive sequence in the training set, and 0 otherwise.
  • Added BinaryFeatureClassifier which takes a binary encoder (such as MotifEncoder or SimilarToPositiveSequenceEncoder) and classifies an example as positive if it has any positive features. For SimilarToPositiveSequenceEncoder there is anyways just 1 feature. For MotifEncoder, optionally a subset of complementary motifs can be learned which together can be used to classify the sequence dataset.
  • Added KerasSequenceCNN which is the Mason CNN for sequence datasets.
  • Also contains example weighting code, to up-weight or down-weight individual examples. This is by default supported by most ML libraries like scikit-learn, pytorch. While this code is currently not actively in use, it does work and we may use it later. For now I removed the weighting strategy that I initially used for short motif (it only made sense for mutagenesis data), but kept that code on a separate branch. This PR now only contains the 'predefined weighting' strategy where example weights are supplied in a file. Other weighting strategies can be added similarly to encodings, ML methods, etc.
  • Updated documentation: installation (optional dependency: Mason CNN) and added 'motif recovery' tutorial using the short motif method.
  • Update version to 2.3.0

LonnekeScheffer and others added 30 commits November 1, 2022 19:23
- added MotifGeneralizationAnalysis
- do not store learned motifs as parameter of MotifEncoder, instead read from file
- added random seed option to get_train_val_indices
- MotifPrecisionTP may soon be deprecated or must be refactored to share code with MotifGeneralizationAnalysis
-
…classifier

# Conflicts:
#	immuneML/reports/data_reports/WeightsDistribution.py
- do not ignore sequences with TP=0 in test set
- include FP scores when computing combined precision
# Conflicts:
#	immuneML/config/default_params/reports/amino_acid_frequency_distribution_params.yaml
#	immuneML/data_model/receptor/receptor_sequence/ReceptorSequence.py
#	immuneML/encodings/abundance_encoding/CompAIRRSequenceAbundanceEncoder.py
#	immuneML/hyperparameter_optimization/core/HPAssessment.py
#	immuneML/hyperparameter_optimization/core/HPSelection.py
#	immuneML/hyperparameter_optimization/core/HPUtil.py
#	immuneML/ml_methods/DeepRC.py
#	immuneML/ml_methods/SklearnMethod.py
#	immuneML/ml_metrics/MetricUtil.py
#	immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py
#	immuneML/reports/data_reports/SequenceLengthDistribution.py
#	immuneML/reports/ml_reports/TrainingPerformance.py
#	immuneML/util/CompAIRRHelper.py
#	immuneML/util/ParameterValidator.py
#	immuneML/workflows/steps/MLMethodAssessment.py
#	test/reports/data_reports/test_AminoAcidFrequencyDistribution.py
…ureClassifier docs,

removed generalize_motifs option as it is currently not used in practice, and disabled allow_negative_aas option as it requires a few more fixes.
# Conflicts:
#	immuneML/data_model/dataset/Dataset.py
#	immuneML/data_model/dataset/ElementDataset.py
#	immuneML/data_model/encoded_data/EncodedData.py
#	immuneML/dsl/definition_parsers/DefinitionParser.py
#	immuneML/dsl/instruction_parsers/ExploratoryAnalysisParser.py
#	immuneML/encodings/word2vec/Word2VecEncoder.py
#	immuneML/environment/Constants.py
#	immuneML/ml_metrics/MetricUtil.py
#	immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py
#	immuneML/reports/ml_reports/TrainingPerformance.py
#	immuneML/workflows/instructions/TrainMLModelInstruction.py
#	immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisInstruction.py
#	immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisUnit.py
#	immuneML/workflows/steps/MLMethodAssessment.py
#	setup.py
#	test/dsl/test_immuneMLParser.py
#	test/encodings/distance_encoding/test_compAIRRDistanceEncoder.py
#	test/encodings/onehot/test_oneHotEncoder.py
#	test/reports/encoding_reports/test_Matches.py
#	test/workflows/instructions/test_trainMLModelInstruction.py
- Bugfix in AIRRExporter which read sequences as 'productive'=False when 'productive' was missing
- some updated variable names
- updated docs
…ile import. Add explicit option to import sequecnes with unknown productivity where relevant (true by default, option not made available for immunoseq import types as their documentation reveals that productivity type for those file formats is never 'unknown')
setup.py Outdated Show resolved Hide resolved
@pavlovicmilena pavlovicmilena merged commit 751ae6b into master Dec 1, 2023
1 check failed
@LonnekeScheffer LonnekeScheffer deleted the merge_master_into_short_motif_classifier branch October 16, 2024 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants