-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Microalgae model #19
Comments
Hi @simonbrd , you can try something like the following steps to train a new model: # demo cmds for generating training samples
# 1. deepsignal_plant extract (extract features from fast5s)
deepsignal_plant extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 1 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_positive.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/positive/sites.tsv
deepsignal_plant extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 0 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_negative.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/negative/sites.tsv
# 2. randomly select equally number (e.g., 10m) of positive and negative samples
# the selected positive and negative samples then can be combined and used for training, see step 3.
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_positive.tsv --write_filepath samples_CG.hc_poses_positive.r10m.tsv --num_lines 10000000 --header false &
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_negative.tsv --write_filepath samples_CG.hc_poses_negative.r10m.tsv --num_lines 10000000 --header false &
# 3. combine positive and negative samples for training
# after combining, the combined file can be splited into two files as training/validating set, see step 4.
python /path/to/scripts/concat_two_files.py --fp1 samples_CG.hc_poses_positive.r10m.tsv --fp2 samples_CG.hc_poses_negative.r10m.tsv --concated_fp samples_CG.hc_poses.r20m.tsv
# 4. split samples for training/validating
# suppose file "samples_CG.hc_poses.r20m.tsv" has 20000000 lines (samples), and we use 200k samples for validation
# the .train.tsv and .valid.tsv can be converted to .bin format to accelerate training (scripts/generate_binary_feature_file.py)
head -19800000 samples_CG.hc_poses.r20m.tsv > samples_CG.hc_poses.r20m.train.tsv
tail -200000 samples_CG.hc_poses.r20m.tsv > samples_CG.hc_poses.r20m.valid.tsv
# 5. train
CUDA_VISIBLE_DEVICES=0 deepsignal_plant train --train_file samples_CG.hc_poses.r20m.train.tsv --valid_file samples_CG.hc_poses.r20m.valid.tsv --model_dir model.deepsignal_plant.CG --display_step 2000 We currently have no R10.3 data, so we haven't planned to train an R10.3 model yet. Best, |
Thanks very much What exactly are sites.tsv files? Thank you in advance |
It is a text file with Best, |
Hello, I tried your workflow.
Thank you in advance ` ===============================================parameters:train_file: ===============================================[train] start..
|
@simonbrd , this may indicate you don't have enough samples to make the training process check the model parameters. Please either use more samples, or set a smaller Best, |
Hello, `(/appli/conda-env/bioinfo/deepsignal_plant-0.1.4) sbrocard@r1i4n7:/home1/scratch/sbrocard/methyldackel> deepsignal_plant train --train_file samples_CG.hc_poses.r20m.train.tsv --valid_file samples_CG.hc_poses.r20m.valid.tsv --model_dir model.deepsignal_plant.CG --batch_size 30 --step_interval 10 ===============================================parameters:train_file: ===============================================[train] start..
|
Hi @simonbrd , it is a bug, I have fixed it and updated the code. Please install the latest version of deepsignal-plant from github, or replace Best, |
Thank you very much for your availability. Your tool is awesome ! |
Hi @PengNi, with my model : with your model :
|
@simonbrd , maybe this is related with your training data. What did the log of your training show? I suggest rechecking the high-confidence positive/negative sites you seleceted, and the commands you used to generate trianing samples. |
Thanks for your tool ! |
Thanks for your interest @DelphIONe. |
Thanks for your reply! |
Hi Peng, @PengNi Thanks for your excellent tools. I want to know how to train the model for DNA methylation in CG, CHG, and CHH context together, rather than the only CG shown above? Thanks in advance. |
@WeipengMO , you can just try to combine training samples of the three motifs (after step3, before step4). The training samples of CHG/CHH should better be denoised before combining. |
Combining training samples using
Thank you for your reply! @PengNi |
@WeipengMO , cat would work. You can also use |
Thank you, Peng! @PengNi |
Hello again,
I have been using your tools for some time and now I would like to know more about how you designed your model model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem. bn13_sn16.both_bilstm.epoch6.ckpt?
For information, I am working on microalgae data and I would also like to make my model on microalgae data ?
Besides, have you also planned to design a model like (model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem. bn13_sn16.both_bilstm.epoch6.ckpt) the one that is available but for R10.3 data?
Thank you in advance
The text was updated successfully, but these errors were encountered: