Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge in short motif code #161

Merged
merged 284 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
284 commits
Select commit Hold shift + click to select a range
ae4b0ef
Merged in PositionalMotifFrequencies report
LonnekeScheffer Nov 1, 2022
0b92d2d
added todos for PositionalMotifFrequencies report
LonnekeScheffer Nov 1, 2022
63e8bae
small (formatting) corrections. updated todos
LonnekeScheffer Nov 1, 2022
8de83ef
added precision/recall to feature annotations
LonnekeScheffer Nov 1, 2022
1a84a1c
Added SignificantMotifPrecisionTP report
LonnekeScheffer Nov 2, 2022
7ad9999
- rename SignificantMotifEncoder to MotifEncoder
LonnekeScheffer Nov 3, 2022
d23618a
attempt at making MotifEncoder faster by initializing (long) growing …
LonnekeScheffer Nov 3, 2022
d301596
allow label to be str or dict
LonnekeScheffer Nov 3, 2022
ea93bc9
parallelisation of MotifEncoder encoded data matrix construction for …
LonnekeScheffer Nov 3, 2022
e6aaea6
more parallelisation in MotifEncoder
LonnekeScheffer Nov 3, 2022
fcff23d
add weight_thresholds, split_classes via YAML
EricEReber Nov 3, 2022
87bd4e1
minor updates
LonnekeScheffer Nov 3, 2022
76529e3
add WeightsDistribution report
EricEReber Nov 1, 2022
156efdf
add weight_thresholds, split_classes via YAML
EricEReber Nov 3, 2022
c3efdef
add docs, add unit test file (not completed)
EricEReber Nov 3, 2022
9fdc0f0
Merge remote-tracking branch 'origin/weight_report' into short_motif_…
LonnekeScheffer Nov 4, 2022
3019675
minor updates
LonnekeScheffer Nov 5, 2022
6f7d928
added todos for Eric in WeightsDistribution report
LonnekeScheffer Nov 6, 2022
1c8740f
fixed todos
EricEReber Nov 6, 2022
152f873
minor correction
LonnekeScheffer Nov 15, 2022
d4b2a02
bugfix: DataWeighter should return a clone of the dataset instead of …
LonnekeScheffer Nov 15, 2022
421b0fc
test print statements
LonnekeScheffer Nov 16, 2022
d75de73
debugging print statements
LonnekeScheffer Nov 16, 2022
3dcbee0
attempted bugfix
LonnekeScheffer Nov 16, 2022
25f73e1
debugging prints
LonnekeScheffer Nov 16, 2022
bb17c49
debugging
LonnekeScheffer Nov 16, 2022
50dc558
debugging
LonnekeScheffer Nov 16, 2022
f8e1b36
bugfix
LonnekeScheffer Nov 16, 2022
2b4ca84
removed debugging prints
LonnekeScheffer Nov 16, 2022
6d50647
Bugfixes in MotifGeneralizationAnalysis:
LonnekeScheffer Nov 16, 2022
8c30838
bugfixes & added smoothing option
LonnekeScheffer Nov 17, 2022
c9a113a
Bugfix: remove sorting from ElementDataset & add assert statement in …
LonnekeScheffer Nov 17, 2022
92347af
Merge branch 'bugfix_element_generator_make_subset' into short_motif_…
LonnekeScheffer Nov 17, 2022
e6fe3b8
extending importance weighting to restrict mutagenesis to only one class
LonnekeScheffer Nov 18, 2022
e8e57c3
finished implementation of class-specific ImportanceWeighting
LonnekeScheffer Nov 18, 2022
db1ba78
- Updated line smoothing code for MotifGeneralizationAnalysis
LonnekeScheffer Nov 21, 2022
4c160b7
added more todos for WeightsDistribution report
LonnekeScheffer Nov 21, 2022
d60aa7e
updated MotifGeneralizationAnalysis:
LonnekeScheffer Nov 21, 2022
fa10989
- Added AminoAcidFrequencyDistribution report: plots a barplot of eac…
LonnekeScheffer Nov 22, 2022
3622c64
Merge branch 'amino_acid_frequency_distribution_report' into short_mo…
LonnekeScheffer Nov 22, 2022
bad7086
Updated color palette
LonnekeScheffer Nov 22, 2022
42aece7
Merge branch 'amino_acid_frequency_distribution_report' into short_mo…
LonnekeScheffer Nov 22, 2022
c0ffb4a
updated AminoAcidFrequencyDistribution to include splitting by label …
LonnekeScheffer Nov 22, 2022
718712d
Merge branch 'amino_acid_frequency_distribution_report' into short_mo…
LonnekeScheffer Nov 22, 2022
563dc22
Updated docs
LonnekeScheffer Nov 22, 2022
6187dfa
Merge branch 'amino_acid_frequency_distribution_report' into short_mo…
LonnekeScheffer Nov 22, 2022
0c84689
update style
LonnekeScheffer Nov 23, 2022
80acc99
sorted categories AminoAcidFrequencyDistribution
LonnekeScheffer Nov 23, 2022
b15b4a0
made range of figures up to 1.01 to not cut off points
LonnekeScheffer Nov 23, 2022
d2c6375
temporarily add sequence hover data to WeightsDistribution report
LonnekeScheffer Nov 24, 2022
4f6a1a3
added option to predefine training set for MotifGeneralizationAnalysi…
LonnekeScheffer Dec 13, 2022
2328cef
automatically determine the optimal TP/recall cutoff and show in plot
LonnekeScheffer Dec 13, 2022
76a17a0
moved get_numpy_sequence_representation to PositionalMotifHelper
LonnekeScheffer Dec 14, 2022
644e851
update: write training set ids to files instead of printing in log (t…
LonnekeScheffer Dec 20, 2022
2a7d706
updated MotifGeneralisationAnalysis: choose last point of exceeding p…
LonnekeScheffer Dec 20, 2022
5a14ee8
plot highlighted motifs on top
LonnekeScheffer Dec 21, 2022
e358e5a
minor refactoring
LonnekeScheffer Dec 21, 2022
3d2207e
allow generalization plot for multiple motif sizes
LonnekeScheffer Dec 21, 2022
e934539
bugfix: dynamically change min_total_points_in_window
LonnekeScheffer Dec 21, 2022
9ab4876
Bugfix
LonnekeScheffer Dec 21, 2022
a3f2ead
bugfix
LonnekeScheffer Dec 22, 2022
7afc1cc
bugfix
LonnekeScheffer Dec 22, 2022
3a0ff78
plot fix
LonnekeScheffer Dec 23, 2022
bf016b8
separate recall cutoff for different motif sizes
LonnekeScheffer Dec 23, 2022
3c3b4f1
updated the way the recall threshold is determined
LonnekeScheffer Dec 31, 2022
5892daa
export confusion matrix
LonnekeScheffer Jan 2, 2023
018220c
theme white
LonnekeScheffer Jan 2, 2023
ea3f58a
minor fix
LonnekeScheffer Jan 2, 2023
7228c0b
added keep_all param to MotifClassifier
LonnekeScheffer Jan 3, 2023
458af3b
improved error message for Metric
LonnekeScheffer Jan 3, 2023
c383b30
merging in changes
LonnekeScheffer Jan 3, 2023
3c5fb2d
bugfix
LonnekeScheffer Jan 3, 2023
5bfac77
bugfix matches report: get subject ids
LonnekeScheffer Jan 4, 2023
d401aa4
Merge branch 'bugfixes' into short_motif_classifier
LonnekeScheffer Jan 4, 2023
df417bb
Merge branch 'master' into amino_acid_frequency_distribution_report
LonnekeScheffer Jan 4, 2023
fbacbfa
Merge branch 'bugfixes' into development
LonnekeScheffer Jan 4, 2023
45fb861
Merge branch 'development' into short_motif_classifier
LonnekeScheffer Jan 4, 2023
fd7b339
bugfix: class mapping
LonnekeScheffer Jan 4, 2023
9239812
added selected features as export value
LonnekeScheffer Jan 4, 2023
3548139
move selected feature writing to fit
LonnekeScheffer Jan 4, 2023
b48a56f
bugfix
LonnekeScheffer Jan 5, 2023
cdf168e
updated the way tp thresholds are determined
LonnekeScheffer Jan 10, 2023
f8e1652
added MotifTestSetPeformance report, refactored to share code with Mo…
LonnekeScheffer Jan 11, 2023
8a72c36
Merge branch 'weight_report' into short_motif_classifier
LonnekeScheffer Jan 13, 2023
6aeb886
New report: NonMotifSimilarity
LonnekeScheffer Jan 14, 2023
d383871
rename report
LonnekeScheffer Jan 14, 2023
8fac880
Merge remote-tracking branch 'origin/short_motif_classifier' into sho…
LonnekeScheffer Jan 14, 2023
a3d3652
removed deprecated report, added requirements specific for tensorflow
LonnekeScheffer Jan 14, 2023
0858ec8
updated format of example id files for compatibility
LonnekeScheffer Jan 14, 2023
46e7376
bugfix manual splitter: it didn't work for non-string classes, now ev…
LonnekeScheffer Jan 14, 2023
ef33a7e
Merge branch 'bugfixes' into short_motif_classifier
LonnekeScheffer Jan 14, 2023
82b5185
bugfix
LonnekeScheffer Jan 14, 2023
c7ea9d1
shorten log text - becomes extremely long and unreadable
LonnekeScheffer Jan 14, 2023
8419200
Merge branch 'bugfixes' into short_motif_classifier
LonnekeScheffer Jan 14, 2023
75cd713
bugfix identifiers
LonnekeScheffer Jan 16, 2023
a3ea577
refactored out col_names stuff for simplicity
LonnekeScheffer Jan 17, 2023
80e46fa
refactoring, more shared code, splitting per motif size of motiftests…
LonnekeScheffer Jan 17, 2023
4899298
Add MotifOverlapReport
EricEReber Jan 19, 2023
0f88191
prettier plots
LonnekeScheffer Jan 23, 2023
40c3d07
all tp cutoffs in one file
LonnekeScheffer Jan 23, 2023
295999c
started implementation, abandoned idea for now
LonnekeScheffer Jan 23, 2023
1e1d13c
export simple stats from MotifEncoder
LonnekeScheffer Jan 23, 2023
d1ef769
updated plot
LonnekeScheffer Jan 24, 2023
a4ca0de
Initial version
EricEReber Jan 24, 2023
393441b
backup, installing new OS
EricEReber Jan 24, 2023
22df0bf
small edits
LonnekeScheffer Jan 25, 2023
a80a463
comment out some experimental code
LonnekeScheffer Jan 25, 2023
68e2564
bufgfix test
LonnekeScheffer Jan 25, 2023
24816b1
added SimilarToPositiveSequenceEncoder: a full sequence hamming dist-…
LonnekeScheffer Jan 26, 2023
46f3e3d
add facet
EricEReber Jan 26, 2023
780a415
different sizes
EricEReber Jan 26, 2023
4a1dc9a
clean up
EricEReber Jan 26, 2023
df953e9
slight speed improvement: allow lower size limit on motifs and don't …
LonnekeScheffer Jan 27, 2023
9dc3ed7
more helpful error message
LonnekeScheffer Jan 27, 2023
ebb0889
minor updates to plot styling
LonnekeScheffer Jan 27, 2023
fb54eee
minor updates
LonnekeScheffer Jan 27, 2023
e2f120c
minor bugfix
LonnekeScheffer Jan 27, 2023
b55478e
change dataframe structure
EricEReber Jan 29, 2023
0d8bda8
all in one plot, change table
EricEReber Jan 29, 2023
6671046
add help method
EricEReber Jan 29, 2023
2cd851b
update test bench
EricEReber Jan 29, 2023
d8eb122
add duplicate max values
EricEReber Jan 29, 2023
057bf24
added option for negative amino acids to Motif encoder
LonnekeScheffer Jan 30, 2023
690753f
added option for negative amino acids to Motif encoder
LonnekeScheffer Jan 30, 2023
c2d6b3e
add top/bottom n and filtering to FeatureValueBarplot
pavlovicmilena Jan 30, 2023
b76647c
added option for negative amino acids to Motif encoder
LonnekeScheffer Jan 31, 2023
69da4cd
Add max_gap_size_only functionality
EricEReber Feb 2, 2023
cc28900
Label:
LonnekeScheffer Feb 2, 2023
39f5875
Merge branch 'bugfix_label_classes' into bugfixes
LonnekeScheffer Feb 2, 2023
3c4bda8
Merge branch 'bugfix_label_classes' into short_motif_classifier
LonnekeScheffer Feb 2, 2023
fd3b82a
fixes after new update
LonnekeScheffer Feb 2, 2023
cc84032
cleaner way of getting label desc for storing ML models
LonnekeScheffer Feb 2, 2023
d2e8360
improved tests
LonnekeScheffer Feb 3, 2023
5b9f2ae
Merge branch 'bugfix_label_classes' into bugfixes
LonnekeScheffer Feb 3, 2023
df96243
Merge branch 'bugfixes' into short_motif_classifier
LonnekeScheffer Feb 3, 2023
34d7571
minor fix
LonnekeScheffer Feb 3, 2023
1f530ae
Merge branch 'short_motif_classifier' into MotifOverlap
LonnekeScheffer Feb 3, 2023
07857f8
little refactoring, cleaned up some shared code between GroundTruthMo…
LonnekeScheffer Feb 3, 2023
d7b1c04
Merge branch 'short_motif_classifier' into PositionalFreq
LonnekeScheffer Feb 3, 2023
d08fc07
made gap plot a lineplot
LonnekeScheffer Feb 3, 2023
579a44c
default param
LonnekeScheffer Feb 3, 2023
75a5ee7
check params
LonnekeScheffer Feb 3, 2023
6e11e56
added BinaryFeaturePrecisionRecall: a precision-recall plot for Binar…
LonnekeScheffer Feb 4, 2023
da55c0d
added precision-recall plot for BinaryFeatureClassifier, plus the opt…
LonnekeScheffer Feb 4, 2023
429aa48
added precision-recall plot for BinaryFeatureClassifier, plus the opt…
LonnekeScheffer Feb 4, 2023
dc18fa7
minor update error message
LonnekeScheffer Feb 4, 2023
9a5fae7
Merge branch 'bugfixes' into short_motif_classifier
LonnekeScheffer Feb 4, 2023
3bf16d6
bugfix
LonnekeScheffer Feb 4, 2023
f0e14f6
bugfix
LonnekeScheffer Feb 4, 2023
6e9e264
improved test
LonnekeScheffer Feb 4, 2023
afa79ee
bugfix, got stuck in an infinite loop
LonnekeScheffer Feb 4, 2023
6fc0d75
bugfix
LonnekeScheffer Feb 4, 2023
e0f9454
bugfix
LonnekeScheffer Feb 4, 2023
3fe09e0
bugfixes
LonnekeScheffer Feb 4, 2023
450a0e2
temporarily set higher recursion depth to prevent crashing
LonnekeScheffer Feb 5, 2023
e5caaaa
update report to show training-validation-test set performance indepe…
LonnekeScheffer Feb 6, 2023
57aa58a
Made CompAIRR-powered version of SimilarToPositiveSequenceEncoder
LonnekeScheffer Feb 6, 2023
14add54
minor fixes GroundTruthMotifOverlap plot & make it possible for Binar…
LonnekeScheffer Feb 6, 2023
1fc588c
remove print statement
LonnekeScheffer Feb 6, 2023
59ff6de
minor update
LonnekeScheffer Feb 6, 2023
0617a40
rename highlight_motifs_path to groundtruth_motifs_path
LonnekeScheffer Feb 6, 2023
5f1c53e
bugfixes compairr-version of SimilarToPositiveSequenceEncoder
LonnekeScheffer Feb 7, 2023
a4f0206
bugfixes compairr-version of SimilarToPositiveSequenceEncoder
LonnekeScheffer Feb 7, 2023
ba4d5a3
separate output folder for learning model
LonnekeScheffer Feb 7, 2023
1b76456
added option to automatically remove test dataset (can be large)
LonnekeScheffer Feb 7, 2023
fc2c8b9
Update AminoAcidFrequencyDistribution report to show log-fold change
LonnekeScheffer Feb 9, 2023
15286ae
implemented get_attribute for Receptor. All receptors have identifier…
LonnekeScheffer Feb 9, 2023
77d069c
Merge branch 'bugfixes' into short_motif_classifier
LonnekeScheffer Feb 9, 2023
eed9106
bugfix
LonnekeScheffer Feb 9, 2023
894235a
switch from logfold change to difference in relativbe frequency
LonnekeScheffer Feb 9, 2023
5033692
.
LonnekeScheffer Feb 11, 2023
8432452
1-based counting of positions
LonnekeScheffer Feb 11, 2023
925a10a
functionality to export non-optimal ML models in addition to the opti…
LonnekeScheffer Feb 15, 2023
8bb8fa3
undo partial commit
LonnekeScheffer Feb 15, 2023
db1d2ec
improved efficiency of BinaryFeatureClassifier
LonnekeScheffer Feb 15, 2023
4b4fae5
added lots of log statements to find out where the running time bottl…
LonnekeScheffer Feb 15, 2023
5b13d35
keep track of val predictions instead of recomputing them every time
LonnekeScheffer Feb 15, 2023
7719032
added multiprocessing option for BinaryFeatureClassifier
LonnekeScheffer Feb 15, 2023
31a4f6b
remove default cores for training to test
LonnekeScheffer Feb 15, 2023
e253da0
bugfix: pass cores_for_training in recursive function
LonnekeScheffer Feb 15, 2023
f911915
possible speed improvement: dont recompute scoring fn when array is e…
LonnekeScheffer Feb 15, 2023
102ed8e
remove log statement
LonnekeScheffer Feb 15, 2023
0476057
- in BinaryFeatureClassifier, keep track of indices that show improve…
LonnekeScheffer Feb 15, 2023
afcc07f
updated log statement
LonnekeScheffer Feb 15, 2023
8aeb80d
remove log statements
LonnekeScheffer Feb 16, 2023
5642ee1
minor fix docs
LonnekeScheffer Feb 16, 2023
8c28e59
fixes for Label in MLApplication instruction: explicitly pass on the …
LonnekeScheffer Feb 16, 2023
0b7f3a4
Merge branch 'bugfix_label_mlapplication' into short_motif_classifier
LonnekeScheffer Feb 16, 2023
2def537
Merge branch 'master' into development
LonnekeScheffer Feb 16, 2023
ca27b0e
Allow metrics to be computed during MLApplication if the same label i…
LonnekeScheffer Feb 16, 2023
d08166c
small fix to make tests pass
LonnekeScheffer Feb 16, 2023
0991971
fix: html was overwritten
LonnekeScheffer Feb 16, 2023
5785eb8
Merge branch 'development' into short_motif_classifier
LonnekeScheffer Feb 16, 2023
a9a3f2b
bugfixes
LonnekeScheffer Feb 16, 2023
c473f46
Merge branch 'development' into short_motif_classifier
LonnekeScheffer Feb 16, 2023
232fee7
restored example weigths
LonnekeScheffer Feb 16, 2023
24fa1c1
bugfix: test if proba available
LonnekeScheffer Feb 17, 2023
afd221a
bugfix: dont access _proba columns when not defined
LonnekeScheffer Feb 17, 2023
5e0e66b
bugfix: convert everything to string
LonnekeScheffer Feb 17, 2023
c812179
Merge branch 'development' into short_motif_classifier
LonnekeScheffer Feb 17, 2023
e09fdf8
small fixes
LonnekeScheffer Feb 19, 2023
9ceb587
bugfixes
LonnekeScheffer Feb 19, 2023
32911ed
Merge branch 'development' into short_motif_classifier
LonnekeScheffer Feb 19, 2023
161c535
fix bug
EricEReber Feb 20, 2023
e135dd4
big bug fix
EricEReber Feb 21, 2023
7cb23fa
added test for GroundTruthMotifOverlap + small fixes
LonnekeScheffer Feb 25, 2023
445e129
Merge remote-tracking branch 'origin/short_motif_classifier' into sho…
LonnekeScheffer Feb 25, 2023
8494ea6
small fix for faster test
LonnekeScheffer Feb 25, 2023
1774127
small updates to motif reports
LonnekeScheffer Feb 27, 2023
c51be0e
axis title updates
LonnekeScheffer Feb 27, 2023
2fd3c48
minor aestetic update
LonnekeScheffer Feb 27, 2023
6c017e2
undo change in test
LonnekeScheffer Feb 27, 2023
2c2e059
visual updates to plots
LonnekeScheffer Feb 28, 2023
200923b
bugfix to gaps report
LonnekeScheffer Mar 2, 2023
b04f000
fixed warning
LonnekeScheffer Mar 2, 2023
24d9d88
minor fix gaps figure
LonnekeScheffer Mar 3, 2023
495020b
minor fix gaps figure
LonnekeScheffer Mar 3, 2023
ad4c04d
minor fix gaps figure
LonnekeScheffer Mar 3, 2023
d5533e0
Add new _get_max_overlap
EricEReber Mar 4, 2023
f7d28a9
remove obsolete title
LonnekeScheffer Mar 7, 2023
98086be
minor updates
LonnekeScheffer Mar 18, 2023
4d7b4f9
plot update: show line on left side of test plots for motif generaliz…
LonnekeScheffer Mar 29, 2023
cfa1003
remove obsolete report
LonnekeScheffer Mar 29, 2023
c89cf97
remove internal cv in outer assessment loop for sklearn
pavlovicmilena Apr 3, 2023
cb8b89a
Merge branch 'master' into short_motif_classifier
LonnekeScheffer Apr 3, 2023
b5d6aa6
Merge branch 'sklearn_internal_cv_fix' into short_motif_classifier
LonnekeScheffer Apr 3, 2023
d265652
merge in sklearn cv bugfix
LonnekeScheffer Apr 3, 2023
a5611a6
Merge branch 'master' into merge_master_into_short_motif_classifier
LonnekeScheffer Oct 26, 2023
a3faaf2
final bugfixes merging in master
LonnekeScheffer Oct 26, 2023
77e5503
added parameter checking when using manual splittype
LonnekeScheffer Oct 26, 2023
dbe7d25
Keras sequence CNN documentation updates + minor fixes
LonnekeScheffer Oct 26, 2023
8a2ae61
updated installation docs
LonnekeScheffer Oct 27, 2023
96dd440
Updated SimilarToPositiveSequenceEncoder, MotifEncoder and BinaryFeat…
LonnekeScheffer Oct 27, 2023
00a8b65
fixes regarding disabling allow_negative_aas option
LonnekeScheffer Oct 27, 2023
c2e2386
updated MotifGeneralizationAnalysis docs
LonnekeScheffer Oct 27, 2023
62e557d
added motif recovery tutorial to documentation
LonnekeScheffer Oct 27, 2023
ebf6a9b
updated docs
LonnekeScheffer Oct 27, 2023
ab0315d
updated docs
LonnekeScheffer Oct 27, 2023
459d6b7
remove deprecated pseudocount parameter
LonnekeScheffer Oct 27, 2023
acd5aea
removed importanceweighting strategy and updated docs for predefinedw…
LonnekeScheffer Oct 27, 2023
67f65e2
removed importanceweighting tests
LonnekeScheffer Oct 27, 2023
871d744
removed importanceweighting tests
LonnekeScheffer Oct 30, 2023
42ea64e
fixing tests
LonnekeScheffer Oct 30, 2023
956bf5f
corrected docs (and variable names): percentage-wise frequency change…
LonnekeScheffer Nov 1, 2023
efa55e8
Merge branch 'master' into merge_master_into_short_motif_classifier
LonnekeScheffer Nov 16, 2023
8b61b43
Merge latest master into short motif, resolve merge conflicts.
LonnekeScheffer Nov 27, 2023
5a04ee9
Bugfixes related to sequence frame type and 'productive' status for f…
LonnekeScheffer Nov 27, 2023
2af9157
workaround bionumpy+pickle error: not using pool but for loop
LonnekeScheffer Nov 28, 2023
48782af
Update setup.py
pavlovicmilena Dec 1, 2023
2d062f5
Update Constants.py
pavlovicmilena Dec 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/developer_docs/how_to_add_new_encoding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ An example of the implementation of :code:`NewKmerFrequencyEncoder` for the :py:
"""
Encodes the repertoires of the dataset by k-mer frequencies and normalizes the frequencies to zero mean and unit variance.

Arguments:
Specification arguments:

k (int): k-mer length

Expand Down Expand Up @@ -324,7 +324,7 @@ This is the example of documentation for :py:obj:`~immuneML.encodings.filtered_s
Nature Genetics 49, no. 5 (May 2017): 659–65. `doi.org/10.1038/ng.3822 <https://doi.org/10.1038/ng.3822>`_.


Arguments:
Specification arguments:

comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in
comparison_attributes will be considered, all other fields are ignored. Valid comparison value can be any repertoire field name.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/developer_docs/how_to_add_new_preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ It includes implementations of the abstract methods and class documentation at t
lower_limit, or more clonotypes than specified by the upper_limit.
Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Arguments:
Specification arguments:

lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.

Expand Down Expand Up @@ -260,7 +260,7 @@ This is the example of documentation for :py:obj:`~immuneML.preprocessing.filter
lower_limit, or more clonotypes than specified by the upper_limit.
Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Arguments:
Specification arguments:

lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.

Expand Down
59 changes: 50 additions & 9 deletions docs/source/installation/install_with_package_manager.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,6 @@ Note: when creating a python virtual environment, it will automatically use the

pip install immuneML

Alternatively, if you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, include the optional extra :code:`TCRdist`:

.. code-block:: console

pip install immuneML[TCRdist]

See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`



Install immuneML with conda
Expand Down Expand Up @@ -95,6 +87,25 @@ Install immuneML with conda
Installing optional dependencies
----------------------------------

TCRDist
*******

If you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, you can include the optional extra :code:`TCRdist`:

.. code-block:: console

pip install immuneML[TCRdist]

The TCRdist dependencies can also be installed manually using the :download:`requirements_TCRdist.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_TCRdist.txt>` file:

.. code-block:: console

pip install -r requirements_TCRdist.txt


DeepRC
******

Optionally, if you want to use the :ref:`DeepRC` ML method and and corresponding :ref:`DeepRCMotifDiscovery` report, you also
have to install DeepRC dependencies using the :download:`requirements_DeepRC.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_DeepRC.txt>` file.
Important note: DeepRC uses PyTorch functionalities that depend on GPU. Therefore, DeepRC does not work on a CPU.
Expand All @@ -104,8 +115,38 @@ To install the DeepRC dependencies, run:

pip install -r requirements_DeepRC.txt --no-dependencies

See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`


Keras-based sequence CNN
************************

In order to use the :ref:`KerasSequenceCNN`, optional dependencies :code:`keras` and :code:`tensorflow` need to be installed.
By default, version 2.11.0 of both dependencies are used.
Other versions may work as well, as long as the used versions of :code:`keras` and :code:`tensorflow` are compatible with eachother.

To install the default versions of these packages, you can include the optional extra :code:`KerasSequenceCNN`:

.. code-block:: console

pip install immuneML[KerasSequenceCNN]

Or install the dependencies manually using the :download:`requirements_KerasSequenceCNN.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_KerasSequenceCNN.txt>` file:

.. code-block:: console

pip install -r requirements_KerasSequenceCNN.txt


The :ref:`KerasSequenceCNN` uses CPU, it does *not* rely on GPU.

CompAIRR
********

If you want to use the :ref:`CompAIRRDistance` or :ref:`CompAIRRSequenceAbundance` encoder, you have to install the C++ tool `CompAIRR <https://github.com/uio-bmi/compairr>`_.
The easiest way to do this is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder:
Furthermore, the :ref:`SimilarToPositiveSequence` encoder can be run both with and without CompAIRR, but the CompAIRR-based version is faster.

The easiest way to install CompAIRR is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder:

.. code-block:: console

Expand Down
7 changes: 5 additions & 2 deletions docs/source/tutorials/how_to_apply_to_new_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,11 @@ For a tutorial on importing datasets to immuneML (for training or applying an ML
YAML specification example using the MLApplication instruction
------------------------------------------------------------------
The :ref:`MLApplication` instruction takes in a :code:`dataset` and a :code:`config_path`. The :code:`config_path` should
point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction. They can be found in the sub-folder
:code:`instruction_name/optimal_label_name` in the results folder.
point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction.
The configuration of the optimal ML setting can always be found in the sub-folder :code:`<instruction_name>/optimal_<label_name>/zip` in the results folder.
Alternatively, when running the :ref:`TrainMLModel` instruction with the parameter :code:`export_all_ml_settings` set to :code:`True`,
the config file for each of the ML settings can be found inside :code:`<instruction_name>/split_<number>/<ml_setting_name>/ml_settings_config/zip`
for each ML setting in each assessment split.


.. highlight:: yaml
Expand Down
45 changes: 45 additions & 0 deletions docs/source/tutorials/motif_recovery.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,51 @@ immuneML provides several different options for recovering motifs associated wit
Depending on the context, immuneML provides several different reports which can be used for this purpose.


Discovering positional motifs using precision and recall thresholds
----------------------------------------------------------------------

It is often assumed that the antigen binding status of an immune receptor (antibody/TCR) may be determined by the *presence*
of a short motif in the CDR3.
We developed a method (manuscript in preparation) for the discovery of antigen binding associated motifs with the following properties:

- Short position-specific motifs with possible gaps
- High precision for predicting antigen binding
- High generalisability to unseen data, i.e., retaining a relatively high precision on test data


Method description
^^^^^^^^^^^^^^^^^^^^^^^^^^^

A motif with a high precision for predicting antigen binding implies that when the motif is present,
the probability that the sequence is a binder is high. One can thus iterate through every possible motif and filter
them by applying a precision threshold. However, the more 'rare' a motif is, the more likely that the motif just had
a high precision by chance (for example: a motif that occurs in only 1 binder and 0 non-binders has a perfect precision,
but may not retain high precision on unseen data). Thus, an additional recall threshold is applied to remove
rare motifs.
Our method allows the user to define a precision threshold and learn the optimal recall threshold using a training + validation set.

The method consists the following steps:

1. Splitting the data into training, validation and test sets.

2. Using the training set, find all motifs with a high training-precision.

3. Using the validation set, determine the recall threshold for which the validation-precision is still high (separate recall thresholds may be learned for motifs with different sizes).

4. Using the combined training + validation set, find all motifs exceeding the user-defined precision threshold and learned recall threshold(s).

5. Using the test set, report the precision and recall of these learned motifs.

6. Optional: use the set of learned motifs as input features for ML classifiers (e.g., :ref:`BinaryFeatureClassifier` or :ref:`LogisticRegression`) for antigen binding prediction.

Steps 2+3 are done by the report :ref:`MotifGeneralizationAnalysis`. This report exports the learned recall cutoff(s).
It is recommended to run this report using the :ref:`ExploratoryAnalysis` instruction.
Steps 4+5 are done by the :ref:`Motif` encoder. The learned recall cutoff(s) are used as input parameters. This encoder
can be used either in :ref:`ExploratoryAnalysis` or :ref:`TrainMLModel` instructions.




Discovering motifs learned by classifiers
-----------------------------------------

Expand Down
8 changes: 5 additions & 3 deletions immuneML/IO/dataset_export/AIRRExporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,12 +207,14 @@ def _postprocess_dataframe(df, dataset_labels: dict, omit_columns: list = None):
if "frame_type" in df.columns:
AIRRExporter._enums_to_strings(df, "frame_type")

df["productive"] = df["frame_type"] == SequenceFrameType.IN.name
df.loc[df["frame_type"].isnull(), "productive"] = ''
df["productive"] = df["frame_type"] == SequenceFrameType.IN.value
df.loc[df["frame_type"].isnull(), "productive"] = ""
df.loc[df["frame_type"] == "", "productive"] = ""
df.loc[df["frame_type"] == SequenceFrameType.UNDEFINED.value, "productive"] = ""

df["vj_in_frame"] = df["productive"]

df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.name
df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.value
df.loc[df["frame_type"].isnull(), "stop_codon"] = ''

df.drop(columns=["frame_type"], inplace=True)
Expand Down
11 changes: 7 additions & 4 deletions immuneML/IO/dataset_import/AIRRImport.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ class AIRRImport(DataImport):

- import_productive (bool): Whether productive sequences (with value 'T' in column productive) should be included in the imported sequences. By default, import_productive is True.

- import_unknown_productivity (bool): Whether sequences with unknown productivity (missing value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.

- import_with_stop_codon (bool): Whether sequences with stop codons (with value 'T' in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.

- import_out_of_frame (bool): Whether out of frame sequences (with value 'F' in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.
Expand Down Expand Up @@ -110,15 +112,16 @@ def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
- the allele information is removed from the V and J genes
"""
if "productive" in df.columns:
df["frame_type"] = SequenceFrameType.OUT.name
df.loc[df["productive"], "frame_type"] = SequenceFrameType.IN.name
df["frame_type"] = SequenceFrameType.UNDEFINED.value
df.loc[df["productive"]==True, "frame_type"] = SequenceFrameType.IN.value
df.loc[df["productive"]==False, "frame_type"] = SequenceFrameType.OUT.value
else:
df["frame_type"] = None

if "vj_in_frame" in df.columns:
df.loc[df["vj_in_frame"], "frame_type"] = SequenceFrameType.IN.name
df.loc[df["vj_in_frame"]==True, "frame_type"] = SequenceFrameType.IN.value
if "stop_codon" in df.columns:
df.loc[df["stop_codon"], "frame_type"] = SequenceFrameType.STOP.name
df.loc[df["stop_codon"]==True, "frame_type"] = SequenceFrameType.STOP.value

if "productive" in df.columns:
frame_type_list = ImportHelper.prepare_frame_type_list(params)
Expand Down
1 change: 1 addition & 0 deletions immuneML/IO/dataset_import/DatasetImportParams.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ class DatasetImportParams:
column_mapping_synonyms: dict = None
region_type: RegionType = None
import_productive: bool = None
import_unknown_productivity: bool = None
import_unproductive: bool = None
import_with_stop_codon: bool = None
import_out_of_frame: bool = None
Expand Down
20 changes: 15 additions & 5 deletions immuneML/IO/dataset_import/TenxGenomicsImport.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ class TenxGenomicsImport(DataImport):

- receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values for receptor_chains are the names of the :py:obj:`~immuneML.data_model.receptor.ChainPair.ChainPair` enum. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

- import_productive (bool): Whether productive sequences (with value 'True' in column productive) should be included in the imported sequences. By default, import_productive is True.

- import_unproductive (bool): Whether productive sequences (with value 'Fale' in column productive) should be included in the imported sequences. By default, import_unproductive is False.

- import_unknown_productivity (bool): Whether sequences with unknown productivity (missing or 'NA' value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.

- import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon '*', or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

- import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
Expand Down Expand Up @@ -105,17 +111,21 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset:

@staticmethod
def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
df["frame_type"] = None
df['productive'] = df['productive'] == 'True'
df.loc[df['productive'], "frame_type"] = SequenceFrameType.IN.name
df["frame_type"] = SequenceFrameType.UNDEFINED.value
df.loc[df['productive']=="True", "frame_type"] = SequenceFrameType.IN.value
df.loc[df['productive']=="False", "frame_type"] = SequenceFrameType.OUT.value

allowed_productive_values = []
if params.import_productive:
allowed_productive_values.append(True)
allowed_productive_values.append('True')
if params.import_unproductive:
allowed_productive_values.append(False)
allowed_productive_values.append('False')
if params.import_unknown_productivity:
allowed_productive_values.append('')
allowed_productive_values.append('NA')

df = df[df.productive.isin(allowed_productive_values)]
df.drop(columns=["productive"], inplace=True)

ImportHelper.junction_to_cdr3(df, params.region_type)
df.loc[:, "region_type"] = params.region_type.name
Expand Down
2 changes: 1 addition & 1 deletion immuneML/IO/dataset_import/VDJdbImport.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset:

@staticmethod
def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
df["frame_type"] = SequenceFrameType.IN.name
df["frame_type"] = SequenceFrameType.IN.value
ImportHelper.junction_to_cdr3(df, params.region_type)
df.loc[:, "region_type"] = params.region_type.name

Expand Down
1 change: 1 addition & 0 deletions immuneML/config/default_params/datasets/airr_params.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ is_repertoire: True
path: ./
paired: False
import_productive: True
import_unknown_productivity: True
import_with_stop_codon: False
import_out_of_frame: False
import_illegal_characters: False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ is_repertoire: True
path: ./
paired: False
import_productive: True
import_unknown_productivity: True
import_with_stop_codon: False
import_out_of_frame: False
import_illegal_characters: False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ is_repertoire: True
path: ./
import_productive: True # whether to only import productive sequences
import_unproductive: False # whether to only import unproductive sequences
import_unknown_productivity: True # whether to import sequences with unknown productivity (missing/NA)
import_illegal_characters: False
region_type: "IMGT_CDR3" # which region to use - IMGT_CDR3 option means removing first and last amino acid as 10xGenomics uses IMGT junction as CDR3
separator: "," # column separator
Expand Down
5 changes: 5 additions & 0 deletions immuneML/config/default_params/encodings/motif_params.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
max_positions: 4
min_positions: 1
min_precision: 0.8
min_recall: 0
min_true_positives: 10
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
hamming_distance: 1
ignore_genes: false
threads: 8
keep_temporary_files: false
compairr_path: null
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
separator: "\t"
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,6 @@ assessment: # outer loop of nested CV
selection: # inner loop of nested CV
split_strategy: random # perform random split to train and validation datasets
split_count: 1 # how many fold to create
training_percentage: 0.7
training_percentage: 0.7
example_weighting: null
export_all_ml_settings: False # only export the optimal model
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
training_percentage: 0.7
max_features: 100
patience: 5
min_delta: 0
keep_all: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
training_percentage: 0.7
units_per_layer: [[CONV, 400, 3, 1], [DROP, 0.5], [POOL, 2, 1], [FLAT], [DENSE, 50]]
activation: relu
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
training_set_identifier_path: null
training_percentage: 0.7
split_by_motif_size: true
max_positions: 4
min_positions: 1
min_precision: 0.9
min_recall: 0
min_true_positives: 1
test_precision_threshold: 0.8
highlight_motifs_name: Highlighted motif
min_points_in_window: 50
smoothing_constant1: 5
smoothing_constant2: 10
training_set_name: training set
test_set_name: test set
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
n_splits: 5
max_positions: 4
min_precision: 0
min_recall: 0
min_true_positives: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
highlight_motifs_name: Highlighted motif
min_points_in_window: 50
smoothing_constant1: 5
smoothing_constant2: 10
training_set_name: training set
test_set_name: test set
split_by_motif_size: true
keep_test_dataset: true
Loading
Loading