-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge in short motif code #161
Merged
pavlovicmilena
merged 284 commits into
master
from
merge_master_into_short_motif_classifier
Dec 1, 2023
Merged
Merge in short motif code #161
pavlovicmilena
merged 284 commits into
master
from
merge_master_into_short_motif_classifier
Dec 1, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Collaborator
LonnekeScheffer
commented
Oct 27, 2023
- Added short motif code: MotifEncoder, a sequence dataset encoder where each feature is a motif (position-specific, possibly gapped), 1 denotes the motif is present and 0 absent. Motifs are filtered by precision and recall thresholds. Several related reports are included (to calibrate recall threshold or analyse properties of learned motifs).
- Added SimilarToPositiveSequenceEncoder, a baseline method with a single feature. The encoder memorises all positive sequences in the training set, and the feature is 1 if a sequence is within a given hamming distance from ANY positive sequence in the training set, and 0 otherwise.
- Added BinaryFeatureClassifier which takes a binary encoder (such as MotifEncoder or SimilarToPositiveSequenceEncoder) and classifies an example as positive if it has any positive features. For SimilarToPositiveSequenceEncoder there is anyways just 1 feature. For MotifEncoder, optionally a subset of complementary motifs can be learned which together can be used to classify the sequence dataset.
- Added KerasSequenceCNN which is the Mason CNN for sequence datasets.
- Also contains example weighting code, to up-weight or down-weight individual examples. This is by default supported by most ML libraries like scikit-learn, pytorch. While this code is currently not actively in use, it does work and we may use it later. For now I removed the weighting strategy that I initially used for short motif (it only made sense for mutagenesis data), but kept that code on a separate branch. This PR now only contains the 'predefined weighting' strategy where example weights are supplied in a file. Other weighting strategies can be added similarly to encodings, ML methods, etc.
- Updated documentation: installation (optional dependency: Mason CNN) and added 'motif recovery' tutorial using the short motif method.
- Update version to 2.3.0
- added MotifGeneralizationAnalysis - do not store learned motifs as parameter of MotifEncoder, instead read from file - added random seed option to get_train_val_indices - MotifPrecisionTP may soon be deprecated or must be refactored to share code with MotifGeneralizationAnalysis -
…lists with None values
…classifier # Conflicts: # immuneML/reports/data_reports/WeightsDistribution.py
…modifying the dataset
- do not ignore sequences with TP=0 in test set - include FP scores when computing combined precision
# Conflicts: # immuneML/config/default_params/reports/amino_acid_frequency_distribution_params.yaml # immuneML/data_model/receptor/receptor_sequence/ReceptorSequence.py # immuneML/encodings/abundance_encoding/CompAIRRSequenceAbundanceEncoder.py # immuneML/hyperparameter_optimization/core/HPAssessment.py # immuneML/hyperparameter_optimization/core/HPSelection.py # immuneML/hyperparameter_optimization/core/HPUtil.py # immuneML/ml_methods/DeepRC.py # immuneML/ml_methods/SklearnMethod.py # immuneML/ml_metrics/MetricUtil.py # immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py # immuneML/reports/data_reports/SequenceLengthDistribution.py # immuneML/reports/ml_reports/TrainingPerformance.py # immuneML/util/CompAIRRHelper.py # immuneML/util/ParameterValidator.py # immuneML/workflows/steps/MLMethodAssessment.py # test/reports/data_reports/test_AminoAcidFrequencyDistribution.py
…ureClassifier docs, removed generalize_motifs option as it is currently not used in practice, and disabled allow_negative_aas option as it requires a few more fixes.
… is plotted, not logfold
# Conflicts: # immuneML/data_model/dataset/Dataset.py # immuneML/data_model/dataset/ElementDataset.py # immuneML/data_model/encoded_data/EncodedData.py # immuneML/dsl/definition_parsers/DefinitionParser.py # immuneML/dsl/instruction_parsers/ExploratoryAnalysisParser.py # immuneML/encodings/word2vec/Word2VecEncoder.py # immuneML/environment/Constants.py # immuneML/ml_metrics/MetricUtil.py # immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py # immuneML/reports/ml_reports/TrainingPerformance.py # immuneML/workflows/instructions/TrainMLModelInstruction.py # immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisInstruction.py # immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisUnit.py # immuneML/workflows/steps/MLMethodAssessment.py # setup.py # test/dsl/test_immuneMLParser.py # test/encodings/distance_encoding/test_compAIRRDistanceEncoder.py # test/encodings/onehot/test_oneHotEncoder.py # test/reports/encoding_reports/test_Matches.py # test/workflows/instructions/test_trainMLModelInstruction.py
- Bugfix in AIRRExporter which read sequences as 'productive'=False when 'productive' was missing - some updated variable names - updated docs
…ile import. Add explicit option to import sequecnes with unknown productivity where relevant (true by default, option not made available for immunoseq import types as their documentation reveals that productivity type for those file formats is never 'unknown')
pavlovicmilena
approved these changes
Dec 1, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.