Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microalgae model #19

Open
simonbrd opened this issue Mar 31, 2022 · 18 comments
Open

Microalgae model #19

simonbrd opened this issue Mar 31, 2022 · 18 comments

Comments

@simonbrd
Copy link

simonbrd commented Mar 31, 2022

Hello again,
I have been using your tools for some time and now I would like to know more about how you designed your model model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem. bn13_sn16.both_bilstm.epoch6.ckpt?

For information, I am working on microalgae data and I would also like to make my model on microalgae data ?

Besides, have you also planned to design a model like (model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem. bn13_sn16.both_bilstm.epoch6.ckpt) the one that is available but for R10.3 data?

Thank you in advance

@PengNi
Copy link
Owner

PengNi commented Mar 31, 2022

Hi @simonbrd , you can try something like the following steps to train a new model:

# demo cmds for generating training samples
# 1. deepsignal_plant extract (extract features from fast5s)
deepsignal_plant  extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 1 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_positive.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/positive/sites.tsv
deepsignal_plant extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 0 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_negative.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/negative/sites.tsv

# 2. randomly select equally number (e.g., 10m) of positive and negative samples
# the selected positive and negative samples then can be combined and used for training, see step 3.
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_positive.tsv --write_filepath samples_CG.hc_poses_positive.r10m.tsv --num_lines 10000000 --header false &
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_negative.tsv --write_filepath samples_CG.hc_poses_negative.r10m.tsv --num_lines 10000000 --header false &

# 3. combine positive and negative samples for training
# after combining, the combined file can be splited into two files as training/validating set, see step 4.
python /path/to/scripts/concat_two_files.py --fp1 samples_CG.hc_poses_positive.r10m.tsv --fp2 samples_CG.hc_poses_negative.r10m.tsv --concated_fp samples_CG.hc_poses.r20m.tsv

# 4. split samples for training/validating
# suppose file "samples_CG.hc_poses.r20m.tsv" has 20000000 lines (samples), and we use 200k samples for validation
# the .train.tsv and .valid.tsv can be converted to .bin format to accelerate training (scripts/generate_binary_feature_file.py)
head -19800000 samples_CG.hc_poses.r20m.tsv > samples_CG.hc_poses.r20m.train.tsv
tail -200000 samples_CG.hc_poses.r20m.tsv > samples_CG.hc_poses.r20m.valid.tsv

# 5. train
CUDA_VISIBLE_DEVICES=0 deepsignal_plant train --train_file samples_CG.hc_poses.r20m.train.tsv --valid_file samples_CG.hc_poses.r20m.valid.tsv --model_dir model.deepsignal_plant.CG --display_step 2000

We currently have no R10.3 data, so we haven't planned to train an R10.3 model yet.

Best,
Peng

@simonbrd
Copy link
Author

Thanks very much

What exactly are sites.tsv files?

Thank you in advance

@PengNi
Copy link
Owner

PengNi commented Mar 31, 2022

Thanks very much

What exactly are sites.tsv files?

Thank you in advance

It is a text file with chrom\tpos\tstrand format in each line (pos is 0-based). You can check deepsignal_plant extract -h for more details.

Best,
Peng

@simonbrd
Copy link
Author

simonbrd commented Apr 1, 2022

Hello, I tried your workflow.
For the last step train I don't understand I don't have an output file? where is the new trained model?

deepsignal_plant train --train_file samples_CG.hc_poses.r20m.train.tsv --valid_file samples_CG.hc_poses.r20m.valid.tsv --model_dir model.deepsignal_plant_prymnesium

cd model.deepsignal_plant_prymnesium/
no file created in this folder?

Thank you in advance

`
(/appli/conda-env/bioinfo/deepsignal_plant-0.1.4) sbrocard@r1i4n7:/home1/scratch/sbrocard/model> deepsignal_plant train --train_file samples_CG.hc_poses.r20m.train.tsv --valid_file samples_CG.hc_poses.r20m.valid.tsv --model_dir model.deepsignal_plant_prymnesium
[main] start..

===============================================

parameters:

train_file:
samples_CG.hc_poses.r20m.train.tsv
valid_file:
samples_CG.hc_poses.r20m.valid.tsv
model_dir:
model.deepsignal_plant_prymnesium
model_type:
both_bilstm
seq_len:
13
signal_len:
16
layernum1:
3
layernum2:
1
class_num:
2
dropout_rate:
0.5
n_vocab:
16
n_embed:
4
is_base:
yes
is_signallen:
yes
hid_rnn:
256
optim_type:
Adam
batch_size:
512
lr:
0.001
lr_decay:
0.1
lr_decay_step:
2
max_epoch_num:
10
min_epoch_num:
5
step_interval:
100
pos_weight:
1.0
init_model:
None
tmpdir:
/tmp

===============================================

[train] start..
GPU is not available!
reading data..

using linecache to access 'samples_CG.hc_poses.r20m.train.tsv'<<<
after done using the file, remember to use linecache.clearcache() to clear cache for safety<<<
using linecache to access 'samples_CG.hc_poses.r20m.valid.tsv'<<<
after done using the file, remember to use linecache.clearcache() to clear cache for safety<<<
/appli/conda-env/bioinfo/deepsignal_plant-0.1.4/lib/python3.7/site-packages/torch/nn/modules/rnn.py:51: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1
"num_layers={}".format(dropout, num_layers))
total_step: 6
best accuracy: 0, early stop!
[train] training cost 92.44051790237427 seconds
[main] costs 92.48895478248596 seconds
`

@PengNi
Copy link
Owner

PengNi commented Apr 2, 2022

@simonbrd , this may indicate you don't have enough samples to make the training process check the model parameters. Please either use more samples, or set a smaller --batch_size (and/or) --step_interval.

Best,
Peng

@simonbrd
Copy link
Author

simonbrd commented Apr 4, 2022

Hello,
thank you for your help but i have a new error with a larger dataset can you help me ?

`(/appli/conda-env/bioinfo/deepsignal_plant-0.1.4) sbrocard@r1i4n7:/home1/scratch/sbrocard/methyldackel> deepsignal_plant train --train_file samples_CG.hc_poses.r20m.train.tsv --valid_file samples_CG.hc_poses.r20m.valid.tsv --model_dir model.deepsignal_plant.CG --batch_size 30 --step_interval 10
[main] start..

===============================================

parameters:

train_file:
samples_CG.hc_poses.r20m.train.tsv
valid_file:
samples_CG.hc_poses.r20m.valid.tsv
model_dir:
model.deepsignal_plant.CG
model_type:
both_bilstm
seq_len:
13
signal_len:
16
layernum1:
3
layernum2:
1
class_num:
2
dropout_rate:
0.5
n_vocab:
16
n_embed:
4
is_base:
yes
is_signallen:
yes
hid_rnn:
256
optim_type:
Adam
batch_size:
30
lr:
0.001
lr_decay:
0.1
lr_decay_step:
2
max_epoch_num:
10
min_epoch_num:
5
step_interval:
10
pos_weight:
1.0
init_model:
None
tmpdir:
/tmp

===============================================

[train] start..
GPU is not available!
reading data..

using linecache to access 'samples_CG.hc_poses.r20m.train.tsv'<<<
after done using the file, remember to use linecache.clearcache() to clear cache for safety<<<
using linecache to access 'samples_CG.hc_poses.r20m.valid.tsv'<<<
after done using the file, remember to use linecache.clearcache() to clear cache for safety<<<
/appli/conda-env/bioinfo/deepsignal_plant-0.1.4/lib/python3.7/site-packages/torch/nn/modules/rnn.py:51: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1
"num_layers={}".format(dropout, num_layers))
total_step: 78148
Traceback (most recent call last):
File "/appli/conda-env/bioinfo/deepsignal_plant-0.1.4/bin/deepsignal_plant", line 10, in
sys.exit(main())
File "/appli/conda-env/bioinfo/deepsignal_plant-0.1.4/lib/python3.7/site-packages/deepsignal_plant/deepsignal_plant.py", line 477, in main
args.func(args)
File "/appli/conda-env/bioinfo/deepsignal_plant-0.1.4/lib/python3.7/site-packages/deepsignal_plant/deepsignal_plant.py", line 71, in main_train
train(args)
File "/appli/conda-env/bioinfo/deepsignal_plant-0.1.4/lib/python3.7/site-packages/deepsignal_plant/train.py", line 158, in train
vlabels_total += vlabels
TypeError: add(): argument 'other' (position 1) must be Tensor, not list
`

@PengNi
Copy link
Owner

PengNi commented Apr 4, 2022

Hi @simonbrd , it is a bug, I have fixed it and updated the code. Please install the latest version of deepsignal-plant from github, or replace /appli/conda-env/bioinfo/deepsignal_plant-0.1.4/lib/python3.7/site-packages/deepsignal_plant/train.py with the train.py from github directly.

Best,
Peng

@simonbrd
Copy link
Author

simonbrd commented Apr 5, 2022

Thank you very much for your availability. Your tool is awesome !
best
Simon

@simonbrd simonbrd closed this as completed Apr 5, 2022
@simonbrd
Copy link
Author

Hi @PengNi,
I integrated a model as you told me but I have some new questions, ...
Why when I run deepsignal-plant again with my new model do I have a result like this? is it because my data is not varied enough?
thank you in advance

with my model :
Prymnesium_parvum_GenomeV1.0_Contig_78 146123 + 146123 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TGCCA Prymnesium_parvum_GenomeV1.0_Contig_78 146124 + 146124 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GCCAG Prymnesium_parvum_GenomeV1.0_Contig_78 146128 + 146128 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GGCGT Prymnesium_parvum_GenomeV1.0_Contig_78 146132 + 146132 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TTCCT Prymnesium_parvum_GenomeV1.0_Contig_78 146133 + 146133 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TCCTC Prymnesium_parvum_GenomeV1.0_Contig_78 146135 + 146135 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 CTCAT Prymnesium_parvum_GenomeV1.0_Contig_78 146139 + 146139 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TGCTA Prymnesium_parvum_GenomeV1.0_Contig_78 146144 + 146144 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TACGG Prymnesium_parvum_GenomeV1.0_Contig_78 146149 + 146149 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 ATCGT Prymnesium_parvum_GenomeV1.0_Contig_78 146156 + 146156 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TACTG Prymnesium_parvum_GenomeV1.0_Contig_78 146160 + 146160 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GTCAG Prymnesium_parvum_GenomeV1.0_Contig_78 146163 + 146163 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 AGCTG Prymnesium_parvum_GenomeV1.0_Contig_78 146174 + 146174 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 ATCGA Prymnesium_parvum_GenomeV1.0_Contig_78 146177 + 146177 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GACGA Prymnesium_parvum_GenomeV1.0_Contig_78 146183 + 146183 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GGCAC Prymnesium_parvum_GenomeV1.0_Contig_78 146185 + 146185 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 CACAA Prymnesium_parvum_GenomeV1.0_Contig_78 146194 + 146194 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GACGA Prymnesium_parvum_GenomeV1.0_Contig_78 146199 + 146199 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GACCA Prymnesium_parvum_GenomeV1.0_Contig_78 146200 + 146200 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 ACCAT Prymnesium_parvum_GenomeV1.0_Contig_78 146211 + 146211 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 AGCCA Prymnesium_parvum_GenomeV1.0_Contig_78 146212 + 146212 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GCCAG Prymnesium_parvum_GenomeV1.0_Contig_78 146219 + 146219 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 AGCTG Prymnesium_parvum_GenomeV1.0_Contig_78 146232 + 146232 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TGCCA Prymnesium_parvum_GenomeV1.0_Contig_78 146233 + 146233 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 GCCAG Prymnesium_parvum_GenomeV1.0_Contig_78 146242 + 146242 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TGCAG Prymnesium_parvum_GenomeV1.0_Contig_78 146245 + 146245 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 AGCAC Prymnesium_parvum_GenomeV1.0_Contig_78 146247 + 146247 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 CACTT Prymnesium_parvum_GenomeV1.0_Contig_78 146251 + 146251 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TTCCG Prymnesium_parvum_GenomeV1.0_Contig_78 146252 + 146252 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TCCGC Prymnesium_parvum_GenomeV1.0_Contig_78 146254 + 146254 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 CGCTC Prymnesium_parvum_GenomeV1.0_Contig_78 146256 + 146256 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 CTCTG Prymnesium_parvum_GenomeV1.0_Contig_78 146259 + 146259 68cbdeb0-9a8c-4b28-b94a-af9e1e01f2f5 t 0.0 1.0 1 TGCTT Prymnesium_parvum_GenomeV1.0_Contig_190 9375 + 9375 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 GCCGA Prymnesium_parvum_GenomeV1.0_Contig_190 9381 + 9381 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 GACGG Prymnesium_parvum_GenomeV1.0_Contig_190 9394 + 9394 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 AGCGC Prymnesium_parvum_GenomeV1.0_Contig_190 9396 + 9396 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 CGCGG Prymnesium_parvum_GenomeV1.0_Contig_190 9405 + 9405 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 GACCA Prymnesium_parvum_GenomeV1.0_Contig_190 9406 + 9406 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 ACCAA Prymnesium_parvum_GenomeV1.0_Contig_190 9409 + 9409 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 AACGT Prymnesium_parvum_GenomeV1.0_Contig_190 9421 + 9421 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 AACAT Prymnesium_parvum_GenomeV1.0_Contig_190 9436 + 9436 03e72fdf-bdfd-4144-8f8c-bf2fbf1cade4 t 0.0 1.0 1 TACGC

with your model :

Prymnesium_parvum_GenomeV1.0_Contig_119 278052 + 278052 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.52201 0.47799 0 CTCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278053 + 278053 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.485137 0.514863 1 TCCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278054 + 278054 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.569415 0.430585 0 CCCCA Prymnesium_parvum_GenomeV1.0_Contig_119 278055 + 278055 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.505096 0.494904 0 CCCAC Prymnesium_parvum_GenomeV1.0_Contig_119 278057 + 278057 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.597858 0.402142 0 CACAT Prymnesium_parvum_GenomeV1.0_Contig_119 278061 + 278061 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.594696 0.405305 0 TGCAG Prymnesium_parvum_GenomeV1.0_Contig_119 278064 + 278064 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.983388 0.016612 0 AGCCG Prymnesium_parvum_GenomeV1.0_Contig_119 278065 + 278065 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.558601 0.441399 0 GCCGT Prymnesium_parvum_GenomeV1.0_Contig_119 278069 + 278069 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.971634 0.028366 0 TACCC Prymnesium_parvum_GenomeV1.0_Contig_119 278070 + 278070 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.973943 0.026057 0 ACCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278071 + 278071 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.927618 0.072382 0 CCCCT Prymnesium_parvum_GenomeV1.0_Contig_119 278072 + 278072 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.672811 0.327189 0 CCCTC Prymnesium_parvum_GenomeV1.0_Contig_119 278074 + 278074 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.112322 0.887678 1 CTCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278075 + 278075 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.140023 0.859977 1 TCCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278076 + 278076 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.306377 0.693623 1 CCCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278077 + 278077 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.607476 0.392524 0 CCCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278078 + 278078 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.781513 0.218487 0 CCCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278079 + 278079 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.362203 0.637797 1 CCCCC Prymnesium_parvum_GenomeV1.0_Contig_119 278080 + 278080 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.612741 0.387259 0 CCCCT Prymnesium_parvum_GenomeV1.0_Contig_119 278081 + 278081 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.732287 0.267713 0 CCCTC Prymnesium_parvum_GenomeV1.0_Contig_119 278083 + 278083 42bf785c-671c-4cef-8a37-130f00f3e9ca t 0.896638 0.103362 0 CTCTC

@simonbrd simonbrd reopened this Apr 12, 2022
@PengNi
Copy link
Owner

PengNi commented Apr 14, 2022

@simonbrd , maybe this is related with your training data. What did the log of your training show? I suggest rechecking the high-confidence positive/negative sites you seleceted, and the commands you used to generate trianing samples.

@DelphIONe
Copy link

Thanks for your tool !
Do you think this model training protocol could work on RNA ?

@PengNi
Copy link
Owner

PengNi commented Aug 9, 2022

Thanks for your tool ! Do you think this model training protocol could work on RNA ?

Thanks for your interest @DelphIONe.
I am not sure if the training protocol is suitable for RNA methylation. To my knowledge, tools like Epinano and DENA uses also basecall errors/quality besides raw signals as features to classify RNA m6As. So the features of DNA methylation and RNA methylation are quite different.

@DelphIONe
Copy link

Thanks for your reply!
We have tried Epinano or DRUMMER or others but not very successfully for the moment. I'm wondering if to train my model could more effective.
Thanks again

@WeipengMO
Copy link

Hi @simonbrd , you can try something like the following steps to train a new model:

# demo cmds for generating training samples
# 1. deepsignal_plant extract (extract features from fast5s)
deepsignal_plant  extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 1 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_positive.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/positive/sites.tsv
deepsignal_plant extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 0 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_negative.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/negative/sites.tsv

# 2. randomly select equally number (e.g., 10m) of positive and negative samples
# the selected positive and negative samples then can be combined and used for training, see step 3.
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_positive.tsv --write_filepath samples_CG.hc_poses_positive.r10m.tsv --num_lines 10000000 --header false &
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_negative.tsv --write_filepath samples_CG.hc_poses_negative.r10m.tsv --num_lines 10000000 --header false &

# 3. combine positive and negative samples for training
# after combining, the combined file can be splited into two files as training/validating set, see step 4.
python /path/to/scripts/concat_two_files.py --fp1 samples_CG.hc_poses_positive.r10m.tsv --fp2 samples_CG.hc_poses_negative.r10m.tsv --concated_fp samples_CG.hc_poses.r20m.tsv

# 4. split samples for training/validating
# suppose file "samples_CG.hc_poses.r20m.tsv" has 20000000 lines (samples), and we use 200k samples for validation
# the .train.tsv and .valid.tsv can be converted to .bin format to accelerate training (scripts/generate_binary_feature_file.py)
head -19800000 samples_CG.hc_poses.r20m.tsv > samples_CG.hc_poses.r20m.train.tsv
tail -200000 samples_CG.hc_poses.r20m.tsv > samples_CG.hc_poses.r20m.valid.tsv

# 5. train
CUDA_VISIBLE_DEVICES=0 deepsignal_plant train --train_file samples_CG.hc_poses.r20m.train.tsv --valid_file samples_CG.hc_poses.r20m.valid.tsv --model_dir model.deepsignal_plant.CG --display_step 2000

We currently have no R10.3 data, so we haven't planned to train an R10.3 model yet.

Best, Peng

Hi Peng,

@PengNi Thanks for your excellent tools. I want to know how to train the model for DNA methylation in CG, CHG, and CHH context together, rather than the only CG shown above?

Thanks in advance.
Weipeng

@PengNi
Copy link
Owner

PengNi commented Nov 1, 2022

@WeipengMO , you can just try to combine training samples of the three motifs (after step3, before step4). The training samples of CHG/CHH should better be denoised before combining.

@WeipengMO
Copy link

Combining training samples using cat like this?

cat samples_CG.hc_poses_positive.r10m.tsv samples_CHG.hc_poses_positive.r10m.tsv samples_CHG.hc_poses_positive.r10m.tsv > samples_5mC.hc_poses_positive.r10m.tsv 
cat samples_CG.hc_poses_negative.r10m.tsv samples_CHG.hc_poses_negative.r10m.tsv samples_CHH.hc_poses_negative.r10m.tsv > samples_5mC.hc_poses_negative.r10m.tsv

Thank you for your reply! @PengNi
Weipeng

@PengNi
Copy link
Owner

PengNi commented Nov 1, 2022

@WeipengMO , cat would work. You can also use /path/to/scripts/concat_two_files.py to cat training sample files.

@WeipengMO
Copy link

Thank you, Peng! @PengNi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants