Merge in short motif code #161

LonnekeScheffer · 2023-10-27T16:08:00Z

Added short motif code: MotifEncoder, a sequence dataset encoder where each feature is a motif (position-specific, possibly gapped), 1 denotes the motif is present and 0 absent. Motifs are filtered by precision and recall thresholds. Several related reports are included (to calibrate recall threshold or analyse properties of learned motifs).
Added SimilarToPositiveSequenceEncoder, a baseline method with a single feature. The encoder memorises all positive sequences in the training set, and the feature is 1 if a sequence is within a given hamming distance from ANY positive sequence in the training set, and 0 otherwise.
Added BinaryFeatureClassifier which takes a binary encoder (such as MotifEncoder or SimilarToPositiveSequenceEncoder) and classifies an example as positive if it has any positive features. For SimilarToPositiveSequenceEncoder there is anyways just 1 feature. For MotifEncoder, optionally a subset of complementary motifs can be learned which together can be used to classify the sequence dataset.
Added KerasSequenceCNN which is the Mason CNN for sequence datasets.
Also contains example weighting code, to up-weight or down-weight individual examples. This is by default supported by most ML libraries like scikit-learn, pytorch. While this code is currently not actively in use, it does work and we may use it later. For now I removed the weighting strategy that I initially used for short motif (it only made sense for mutagenesis data), but kept that code on a separate branch. This PR now only contains the 'predefined weighting' strategy where example weights are supplied in a file. Other weighting strategies can be added similarly to encodings, ML methods, etc.
Updated documentation: installation (optional dependency: Mason CNN) and added 'motif recovery' tutorial using the short motif method.
Update version to 2.3.0

- added MotifGeneralizationAnalysis - do not store learned motifs as parameter of MotifEncoder, instead read from file - added random seed option to get_train_val_indices - MotifPrecisionTP may soon be deprecated or must be refactored to share code with MotifGeneralizationAnalysis -

…lists with None values

…speed

…classifier # Conflicts: # immuneML/reports/data_reports/WeightsDistribution.py

…modifying the dataset

- do not ignore sequences with TP=0 in test set - include FP scores when computing combined precision

# Conflicts: # immuneML/config/default_params/reports/amino_acid_frequency_distribution_params.yaml # immuneML/data_model/receptor/receptor_sequence/ReceptorSequence.py # immuneML/encodings/abundance_encoding/CompAIRRSequenceAbundanceEncoder.py # immuneML/hyperparameter_optimization/core/HPAssessment.py # immuneML/hyperparameter_optimization/core/HPSelection.py # immuneML/hyperparameter_optimization/core/HPUtil.py # immuneML/ml_methods/DeepRC.py # immuneML/ml_methods/SklearnMethod.py # immuneML/ml_metrics/MetricUtil.py # immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py # immuneML/reports/data_reports/SequenceLengthDistribution.py # immuneML/reports/ml_reports/TrainingPerformance.py # immuneML/util/CompAIRRHelper.py # immuneML/util/ParameterValidator.py # immuneML/workflows/steps/MLMethodAssessment.py # test/reports/data_reports/test_AminoAcidFrequencyDistribution.py

…ureClassifier docs, removed generalize_motifs option as it is currently not used in practice, and disabled allow_negative_aas option as it requires a few more fixes.

…eighting

… is plotted, not logfold

# Conflicts: # immuneML/data_model/dataset/Dataset.py # immuneML/data_model/dataset/ElementDataset.py # immuneML/data_model/encoded_data/EncodedData.py # immuneML/dsl/definition_parsers/DefinitionParser.py # immuneML/dsl/instruction_parsers/ExploratoryAnalysisParser.py # immuneML/encodings/word2vec/Word2VecEncoder.py # immuneML/environment/Constants.py # immuneML/ml_metrics/MetricUtil.py # immuneML/reports/data_reports/AminoAcidFrequencyDistribution.py # immuneML/reports/ml_reports/TrainingPerformance.py # immuneML/workflows/instructions/TrainMLModelInstruction.py # immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisInstruction.py # immuneML/workflows/instructions/exploratory_analysis/ExploratoryAnalysisUnit.py # immuneML/workflows/steps/MLMethodAssessment.py # setup.py # test/dsl/test_immuneMLParser.py # test/encodings/distance_encoding/test_compAIRRDistanceEncoder.py # test/encodings/onehot/test_oneHotEncoder.py # test/reports/encoding_reports/test_Matches.py # test/workflows/instructions/test_trainMLModelInstruction.py

- Bugfix in AIRRExporter which read sequences as 'productive'=False when 'productive' was missing - some updated variable names - updated docs

…ile import. Add explicit option to import sequecnes with unknown productivity where relevant (true by default, option not made available for immunoseq import types as their documentation reveals that productivity type for those file formats is never 'unknown')

setup.py

LonnekeScheffer and others added 30 commits November 1, 2022 19:23

Merged in PositionalMotifFrequencies report

ae4b0ef

added todos for PositionalMotifFrequencies report

0b92d2d

small (formatting) corrections. updated todos

63e8bae

added precision/recall to feature annotations

8de83ef

Added SignificantMotifPrecisionTP report

1a84a1c

attempt at making MotifEncoder faster by initializing (long) growing …

d23618a

…lists with None values

allow label to be str or dict

d301596

parallelisation of MotifEncoder encoded data matrix construction for …

ea93bc9

…speed

more parallelisation in MotifEncoder

e6aaea6

add weight_thresholds, split_classes via YAML

fcff23d

minor updates

87bd4e1

add WeightsDistribution report

76529e3

add weight_thresholds, split_classes via YAML

156efdf

add docs, add unit test file (not completed)

c3efdef

Merge remote-tracking branch 'origin/weight_report' into short_motif_…

9fdc0f0

…classifier # Conflicts: # immuneML/reports/data_reports/WeightsDistribution.py

minor updates

3019675

added todos for Eric in WeightsDistribution report

6f7d928

fixed todos

1c8740f

minor correction

152f873

bugfix: DataWeighter should return a clone of the dataset instead of …

d4b2a02

…modifying the dataset

test print statements

421b0fc

debugging print statements

d75de73

attempted bugfix

3dcbee0

debugging prints

25f73e1

debugging

bb17c49

debugging

50dc558

bugfix

f8e1b36

removed debugging prints

2b4ca84

Bugfixes in MotifGeneralizationAnalysis:

6d50647

- do not ignore sequences with TP=0 in test set - include FP scores when computing combined precision

LonnekeScheffer added 15 commits April 3, 2023 14:36

merge in sklearn cv bugfix

d265652

final bugfixes merging in master

a3faaf2

added parameter checking when using manual splittype

77e5503

Keras sequence CNN documentation updates + minor fixes

dbe7d25

updated installation docs

8a2ae61

Updated SimilarToPositiveSequenceEncoder, MotifEncoder and BinaryFeat…

96dd440

…ureClassifier docs, removed generalize_motifs option as it is currently not used in practice, and disabled allow_negative_aas option as it requires a few more fixes.

fixes regarding disabling allow_negative_aas option

00a8b65

updated MotifGeneralizationAnalysis docs

c2e2386

added motif recovery tutorial to documentation

62e557d

updated docs

ebf6a9b

updated docs

ab0315d

remove deprecated pseudocount parameter

459d6b7

removed importanceweighting strategy and updated docs for predefinedw…

acd5aea

…eighting

removed importanceweighting tests

67f65e2

LonnekeScheffer closed this Oct 30, 2023

LonnekeScheffer added 2 commits October 30, 2023 14:31

removed importanceweighting tests

871d744

fixing tests

42ea64e

LonnekeScheffer reopened this Oct 31, 2023

LonnekeScheffer added 5 commits November 1, 2023 11:40

corrected docs (and variable names): percentage-wise frequency change…

956bf5f

… is plotted, not logfold

Merge latest master into short motif, resolve merge conflicts.

8b61b43

- Bugfix in AIRRExporter which read sequences as 'productive'=False when 'productive' was missing - some updated variable names - updated docs

workaround bionumpy+pickle error: not using pool but for loop

2af9157

pavlovicmilena reviewed Dec 1, 2023

View reviewed changes

setup.py Outdated Show resolved Hide resolved

pavlovicmilena added 2 commits December 1, 2023 12:36

Update setup.py

48782af

Update Constants.py

2d062f5

pavlovicmilena approved these changes Dec 1, 2023

View reviewed changes

pavlovicmilena merged commit 751ae6b into master Dec 1, 2023
1 check failed

LonnekeScheffer deleted the merge_master_into_short_motif_classifier branch October 16, 2024 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge in short motif code #161

Merge in short motif code #161

LonnekeScheffer commented Oct 27, 2023

Merge in short motif code #161

Merge in short motif code #161

Conversation

LonnekeScheffer commented Oct 27, 2023