-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathlab_notebook.txt
1013 lines (861 loc) · 51.1 KB
/
lab_notebook.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
G===============================================================================
Fenner Macrae Megadaph - Nuclear Analysis Lab Notebook
===============================================================================
Logs of specific commands run can be found in analysis directories.
===============================================================================
August 21st, 2017
===============================================================================
-------------------------------------------------------------------------------
Summary of project so far:
-------------------------------------------------------------------------------
- Emily has trimmed and merged reads (~/megadaph/reads/cleaned_reads/)
- Emily has produced draft assemblies for all starting controls using SPAdes
(~/megadaph/assembly/spades_Aug2017/)
- I constructed a Kraken (https://ccb.jhu.edu/software/kraken/) database
containing all complete prokaryotic refseq assemblies
(~/megadaph/decontamination/kraken/db/). Kraken's database build scripts are
out of date, so I used a user produced set of scripts
(https://github.com/mw55309/Kraken_db_install_scripts). Protocol was followed
from (http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-\
structure-and-no-gi-numbers/)
- I also downloaded the prebuilt full nt database for Centrifuge
(https://ccb.jhu.edu/software/centrifuge)
===============================================================================
November 19th, 2017
===============================================================================
-------------------------------------------------------------------------------
Activity Log for today:
-------------------------------------------------------------------------------
- Ran centrifuge on cleaned FA_SC reads
(~/megadaph/decontamination/centrifuge/reads) and the D. magna reference
assembly (~/megadaph/decontamination/centrifuge/reference_assembly). Used the
complete nt centrifuge database but excluded arthropods and bony fish. The
taxonomic IDs for these groups were found by examining
~/megadaph/decontamination/centrifuge/custom_db/taxonomy/names.dmp.
===============================================================================
August 22nd, 2017 - August 23rd, 2017
===============================================================================
- Constructed a custom centrifuge database using protocol from centrifuge
manual. All Bacterial, Plant, Protozoan, Archael and Fungal genomes classified
as 'Complete' by NCBI were included, as well as the human reference.
===============================================================================
August 23rd, 2017
===============================================================================
- Created an R package fen.R.util (https://github.com/fennerm/fen.R.util) to
house code shared between other R packages. Decided to make it separate to the
bioinformatics_scripts repo since it will primarily be updated from my personal
machine.
-------------------------------------------------------------------------------
Preliminary Centrifuge Analysis Of D. magna reference assembly
-------------------------------------------------------------------------------
(~/megadaph/decontamination/centrifuge/reference_assembly)
- Wrote R functions for adding taxonomic name, and match fraction to the
centrifuge output file.
Taxa with strong evidence for contamination:
- Thermus thermophilus (From Taq polymerase?)
- Homo sapiens
- Limnohabitans
- Plant of some kind (hits to tomato, rice, cucumber etc.)
- Parasitic worm (Trichobilharzia, Spirometra, Echinostoma, Protopolystoma,
Schistosoma)
===============================================================================
August 24th, 2017
===============================================================================
-------------------------------------------------------------------------------
Decontamination Strategy (to be continuously updated)
-------------------------------------------------------------------------------
- Identify confident contaminants with centrifuge and remove
- Reassemble and run centrifuge on assembly
- Use GC content and coverage to remove additional suspicious contigs
(See Koutsovoulos et al., 2016)
- Exclude contigs below 500bp
(See Koutsovoulos et al., 2016)
- Figure out plan for checking that assembly isn't overcleaned.
- Check for differences in relative coverage between different starting control
assemblies!
- Check for dramatic shifts in coverage across potential misassemblies
===============================================================================
August 25th, 2017
===============================================================================
Planned analysis:
- Goal: Produce confidence metric for centrifuge classifications.
- Strategy: Scramble all of the nucleotides in the fasta input to centrifuge.
Then run centrifuge. This will give us our expectations under the null.
- Wrote a script to scramble all of the sequences in a multifasta file
(fmacrae/code/R/scripts/scramble_fasta.R)
- Started test run of analysis on FA_SC in
megadaph/decontamination/centrifuge/null_distribution
- Found a bunch of false positive hits. A bunch to human, sheep, tomato etc.
Basically all large genomes. Largest hit score was only 289 though, so
filtering by score should be pretty easy and effective.
===============================================================================
August 28th, 2017
===============================================================================
- Large reorganization of project github. Local and IBEST directories should
now be kept fairly well in sync using github. Data will still reside only on
IBEST.
- Wrote the alignment function for the megadaph pipeline
(fmacrae/code/util/python_modules/run_bowtie2.py)
===============================================================================
August 29th, 2017
===============================================================================
- Added run_index_fasta module to the Megadaph decontamination pipeline.
- Added CheckedArgs class module to validate arguments passed to the pipeline.
- Added unit tests directory, and got testing working for CheckedArgs
===============================================================================
August 30th, 2017
===============================================================================
- Setup my personal python modules as a package (fmbiopy) for easier import and
extensibility.
- Using pytest for package testing (https://docs.pytest.org/en/latest/)
- Plan to set up fmbiopy and bioinformatics_scripts as subtrees of the Megadaph
repository, but the version of git on IBEST is super out of date. Messaged
Benji about getting a new version. Will use
https://medium.com/@porteneuve/mastering-git-subtrees-943d29a798ec as a guide
for subtree set up.
===============================================================================
August 31st, 2017
===============================================================================
- Fixed bugs in run_bowtie2.py, run_index_fasta.py and fen_util.run_command().
All tests now passing
- Came up with an object oriented pipeline structure. Basically will define
pipeline and submodule classes which will automatically handle argument
passing, logging and argument checking.
- Benji made an updated git module but its only available for some of the
servers until they update the rest. Don't know how long this will take but I'll
put off getting the subtree properly set up for the time being
- Read about pytest features:
http://pythontesting.net/framework/pytest/pytest-fixtures-nuts-bolts/#bare
===============================================================================
September 3rd, 2017
===============================================================================
- Partially created BioFileGroup class for housing and type checking
bioinformatics files within the pipeline (~/fmacrae/code/fmbiopy/fmbiopy.py).
Some tests still failing. Tomorrow will start by refactoring the prevalidation
functions.
===============================================================================
September 4th, 2017
===============================================================================
- Added a bunch of classes to BioFileGroup.
===============================================================================
September 5th, 2017
===============================================================================
- Added further classes to BioFileGroup. Now just need to implement an
IndexedFasta class and I can move onto the pipeline code.
- IndexedFasta implemented. Documentation could still use a little work but
biofile.py is basically complete.
- Installed pydocstring plugin for easy documentation in reST formatting. Call
with :Pydocstring in vim while cursor is on function or class definition.
- Thought a lot about what docstring style to use from now on. Decided on numpy
style for the following reasons:
- Ease of reading
- Can use 'pyment -wo "numpydoc" <file.py>' to automate much of the process
- Can be converted into documentation with Sphinx
- Guidelines:
http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html
===============================================================================
September 6th, 2017
===============================================================================
- Think I might have reinvented the wheel a bit with biofile. Ruffus seems like
it includes the same functionality and more.
- New plan: Next step rewrite bowtie2_align.py and index_fasta.py as ruffus
functions
- Interrupted while working: currently working on editing the .vimrc to auto-
open split tabs with python environment
===============================================================================
September 7th, 2017
===============================================================================
- Officially switched over to Ruffus for pipelining.
- Set up Ruffus logging class in fmbiopy.fmruffus
===============================================================================
September 10th, 2017
===============================================================================
- Wrote and tested following tasks in fmbiopy.ruffus_tasks: bowtie_index_fasta,
samtools_index_fasta, gunzip, gzip, paired_bowtie2_align
===============================================================================
September 11th, 2017
===============================================================================
- Spent a while trying to get jedi-vim to play nice with vim 8 and pyenv.
Eventually got it to work by adding a script to .vim/after/ftplugin/python.vim,
and adding an __init__.py to fmbiopy.fmbiopy
- Started on the decontam_pipeline proper in
'fmacrae/megadaph/decontam_pipeline'. Decided to structure as a script rather
than a package.
- Started adding some additional functionality to biofile, so that it can be
used for checking argument validity in the pipeline.
===============================================================================
September 13th, 2017
===============================================================================
- Ideas from meeting
- Primers arrived
- Meeting with Sarah: Updated on progress. We shared a little concern about
Emily using hard GC content/coverage cutoffs for filtering contamination in the
assemblies. She showed me Emily's assembly stats though and they were super
impressive. Very large N50s and a pretty consistent total assembly length
~130-180 Mb.
- Lab meeting: Introductions - my presentation slot set for Nov. 29th
- Met with Emily about assembly progress. We talked about the hard cutoff
criteria issue. I suggested running the published reference genome through her
filter and see how much is lost. A nontrivial amount was excluded - large
enough that I would be uncomfortable using these filters for the mutation
calling pipeline. However we agreed they are probably fine for the TE detection
if she is comfortable with them. Given her timeframe, she probably doesn't have
the time to work at my snailpace. I asked her about her assembly parameters so
that I could eventually include them in the Ruffus pipeline.
===============================================================================
September 24th, 2017
===============================================================================
Been having troubles getting vim and pyenv to work with the new server
infrastructure which Benji is updating. Recompiled vim with following commands:
pyenv local 3.5.2
./configure --prefix=$HOME --with-features=huge --enable-multibyte
--enable-cscope=yes ---enable-pythoninterp=dynamic
--with-python3-config-dir=/opt/modules/devel/python/2.7.10/lib/python2.7/config
--enable-python3interp=dynamic
--with-python3-config-dir=/opt/modules/devel/python/3.5.2/lib/python3.5/config-3.5m/
--disable-gui
make VIMRUNTIMEDIR=$HOME/.vim
This worked to get vim working with python3 sort of, but broke python2. Also
syntax highlighting stopped working.
Decided to just switch to Neovim to avoid these issues. First installed
linuxbrew. Linuxbrew complained about our git version being out of date.
Attempted to update it using 'linuxbrew install git', but its throwing the
following error:
/mnt/lfs2/schaack/.linuxbrew/bin/ld: cannot find -ldb
collect2: error: ld returned 1 exit status
===============================================================================
September 25th, 2017
===============================================================================
Installed a brewed version of perl with linuxbrew.
===============================================================================
September 27th, 2017
===============================================================================
Installed brewed versions of git and neovim.
===============================================================================
September 28th, 2017
===============================================================================
Neovim up and running. Installed vim-plug for managing plugins. Added Ale for
syntax checking and vim-test for testing code in nvim.
===============================================================================
September 29th, 2017
===============================================================================
Refactored my dotfile set up so that my custom dotfiles are copied over ssh
upon login. This means that any changes to my dotfiles wont affect other users
and can also be more easily transferred to new servers.
Finally got neovim to run entirely without errors with my setup.
===============================================================================
October 2nd, 2017
===============================================================================
Spent all day bug fixing the decontamination pipeline
===============================================================================
October 4th, 2017
===============================================================================
Finally finished major bugfixing on fmruffus module. Architecture for running
Ruffus tasks is now pretty solid. Should now be fairly simple to add new
Ruffus tasks. Next step is to get a basic pipeline running with just the
initial Bowtie2 alignment to get a feel for how Ruffus pipeline scripts work.
===============================================================================
October 5th, 2017
===============================================================================
Minor bugfixes in fmruffus
===============================================================================
October 7th, 2017
===============================================================================
Bugfixing and refactoring in fmbiopy test functions
===============================================================================
October 8th, 2017
===============================================================================
Started on the actual pipeline script. Got the bowtie2 alignment step working
on small test files.
===============================================================================
October 9th, 2017
===============================================================================
Got a little stuck in the mud refactoring biofile. Decided that it makes more
sense to have the base class be a singular Biofile rather than a BiofileGroup.
Fixing this requires moving the bulk of the validation steps into the new
Biofile class and refactoring all of the non-generic BiofileGroup classes to
singular Biofile classes. The generic BiofileGroup will stick around as a way
to store a set of Biofiles, and do validation checks which pertain to groups.
===============================================================================
October 10th, 2017
===============================================================================
Worked on biofile refactoring
===============================================================================
October 11th, 2017
===============================================================================
Biofile refactoring mostly done. Generic Biofile class now works, as do Fasta
and Fastq. Commented out BiofileGroups and MatchedPrefixGroups for now. Will
fix them later if they end up being useful.
Created small Bam and Sam files with 10000 reads in testdat by subsampling
megadaph data with samtools.
===============================================================================
October 12th, 2017
===============================================================================
-Created small paired fastq files with 10000 reads by subsampling the megadaph
reads with seqtk for use in testdat.
-Did the same thing (1000 contigs) with the September 2017 redundans Megadaph
assemblies.
-Cloned the new testdat repository into fmbiopy and the megadaph
decontam_pipeline.
-Rewrote fmbiopy functions which rely on testdat
===============================================================================
October 14th, 2017
===============================================================================
- Switched from os.path to pathlib.Path for path manipulation in fmbiopy.
- Added BiofileGroup and MatchedPrefixGroup classes back to fmbiopy.biofile
- Worked on input file validation in megadaph pipeline
===============================================================================
October 15th, 2017
===============================================================================
- Got input validation working. Pipeline script now capable of producing a
flowchart representing the pipeline.
- Added apply function to fmruffus which applies a single input RuffusTask to
multiple inputs.
===============================================================================
October 16th, 2017
===============================================================================
- Got test functions to work outside of the test directory. Can now be run from
the python package home folder.
===============================================================================
October 17th, 2017
===============================================================================
- Refactored, reorganized a bunch of fmbiopy code.
- Alignment and symlink parts of pipeline now working.
- Wrote RuffusTask for building a custom centrifuge index but didn't get time
to test it
===============================================================================
October 18th, 2017
===============================================================================
- Bug fixed fmruffus.RuffusLog
- Added fixtures to fmtest
===============================================================================
October 19th, 2017
===============================================================================
- Added RuffusTransform class and altered RuffusTask class to handle
ruffus.Pipeline.originate tasks.
===============================================================================
October 20th, 2017
===============================================================================
- Added centrifuge task to fmbiopy.fmruffus
===============================================================================
October 21st, 2017
===============================================================================
- Had to do a fresh linuxbrew install due to some persistent issues with
glibc6.
- Wrote script for converting centrifuge output to blobtools compatible hits
file ('fmscripts/python/centrifuge_to_hits.py')
===============================================================================
October 24th, 2017
===============================================================================
- Improved fmpaths test coverage
===============================================================================
October 25-30th, 2017
===============================================================================
- Major connection difficulties with IBEST
- Used the intermediate time to work on other project and improve my dotfiles.
===============================================================================
November 1st, 2017
===============================================================================
- Received assembly and read processing scripts from Emily.
===============================================================================
November 2nd, 2017
===============================================================================
- Started adding Emily's analyses to the pipeline. Everything should be fairly
easy except her script for removing heterozygosity. It explicitly repeats
the same protocol for each file rather than looping so it'll need to be
completely rewritten.
===============================================================================
November 4th-5th, 2017
===============================================================================
- Wrote Ruffus functions for all of Emily's tasks with the exception of the
custom script mentioned above.
===============================================================================
November 6th, 2017
===============================================================================
- Moved fmbiopy.fmruffus and fmbiopy.biofile into their own github repos.
===============================================================================
November 7th, 2017
===============================================================================
- Split duffus into multiple module files.
Met with Emily about her strategy for reducing heterozygosity. Basically she
just removed all heterozygosity because it wasn't relevant to transposable
element discovery. I'll have to come up with a different solution.
Found an interesting looking tool called HaploMerger2 which rebuilds both
haploid assemblies from a heterozygous diploid assembly.
===============================================================================
November 8th and 10th, 2017
===============================================================================
- Worked on custom duffus Pipeline class.
===============================================================================
November 11th, 2017
===============================================================================
- The more time I spend looking at the Ruffus source code, the more I think its
pretty poorly constructed. I've been spending an awful amount of time getting
its Input-> Output formatter to work and I think its probably a waste of
time. Going to cut my losses and switch to an alternative pipelining
infrastructure. Snakemake seems like a good choice.
===============================================================================
November 14th, 2017
===============================================================================
- Made some good progress on the new snakemake pipeline. Got Emily's data set
up in the pipeline folder so we don't have to rebuild all of it.
===============================================================================
November 15th, 2017
===============================================================================
- Started to test blobtools on a single file in
'megadaph.private/data/exploratory/decontam/blobtools'. Hoping to nail down
the blobtools part of the protocol here.
===============================================================================
November 16th, 2017
===============================================================================
- Decent portion of the exploratory blobtools pipeline finished. Currently
running the initial alignment and centrifuge steps on trillian.
===============================================================================
November 17th, 2017
===============================================================================
- Job got stuck on Trillian for some reason. Restarted on Ford.
===============================================================================
November 18th, 2017
===============================================================================
- Exploratory pipeline has produced most of the inputs to blobtools and is now
working on creating the blobtools database.
- Added the trimming steps to the main snakemake pipeline but I'm running into
persistant syntax errors at the start of the script.
===============================================================================
November 19th, 2017
===============================================================================
- Blobtools is dependent on python2 but calls its submodules using 'python' so
it ends up defaulting to python3 and failing. I fixed this by replacing all
instances of python with python2 in the main script but a more intelligent
solution might be required for the main pipeline. Submitted an issue on
github.
- Can't get the initial symlink step to work in the main snakemake pipeline.
Snakemake seems to have some trouble with symlinks in general so I'm just
going to remove the step for now.
- Got the snakemake pipeline working all the way through to production of
multiqc files. Will start running it on the full dataset.
- Got some initial blobplots from the blobplot pipeline.
===============================================================================
November 20th, 2017
===============================================================================
- Wrote a script for downloading a new custom centrifuge database and added it
to the main snakemake pipeline.
- Started running the pipeline on the actual datafiles. When(if) this completes
all fastq cleaning, QC and database downloading will be done.
===============================================================================
November 21st, 2017
===============================================================================
- Centrifuge kept encountering time out errors while connecting to NCBI. Online
discussions suggested that this only occurs when centrifuge-download uses
rsync. Edited the binary in ~/bin to use curl instead. Main pipeline is now
running on Tesla.
- Meanwhile I've been testing SLURM scheduling of the pipeline to see if this
leads to faster execution.
===============================================================================
November 22nd, 2017
===============================================================================
- Getting more errors with centrifuge-download. I'm going to split the
centrifuge database building part of the pipeline into multiple tasks so that
I don't have to start fresh each time it fails.
===============================================================================
November 23rd, 2017
===============================================================================
- Decided to rerun the exploratory blobtools pipeline with the entire nt
database. It probably makes more sense to filter out unrealistic hits
afterwards than spending too much time curating a custom database. Only thing
that I'm worried about is that the centrifuge docs are very ambiguous as to
the contents of their nt database. In the manual it says that the database
contains all 'spliced coding' sequence. However in their paper etc. they say
that its equivalent to the database used in BLAST.
===============================================================================
November 23rd, 2017
===============================================================================
- The main pipeline encountered errors in Fastqc building. This resulted from
my removal of run_command in my fmbiopy library. Moved Fastqc from a script
to a simple shell command, tested and resubmitted with SLURM.
- Very unsatisfied with the output I'm getting from Centrifuge. Going to try
switching to BLAST for taxonomic assignment.
===============================================================================
November 27th, 2017
===============================================================================
- Got BLAST integrated into blobtools pipeline and running on slarti.
===============================================================================
November 28th, 2017
===============================================================================
- Fine tuned many aspects of exploratory pipeline - # BLAST hits per contig
increased from 1 to 10, added a simple filter script to remove obvious
contaminants. All done very messily to get done in time for presentation at
lab meeting. Will need to go over again later.
- Tried to get Haplomerger2 set up. Its written very poorly so it'll take a lot
of work to get it up and running.
-------------------------------------------------------------------------------
Pipeline for decontamination
-------------------------------------------------------------------------------
- Assemble
- Diploidify with haplomerger2
- Align to assembly
- blobtools pipeline
- Group by taxonomy and Concoct.
- Remove concoct groups which contain contaminants
- Remove obvious contaminants
- Possibly do a hard filter by coverage/GC if required
- Reassemble
===============================================================================
November 29th, 2017
===============================================================================
- Presented in lab meeting.
===============================================================================
November 30th, 2017
===============================================================================
- Started the initial SPADEs assemblies in the snakemake pipeline on SLURM.
===============================================================================
December 1st, 2017
===============================================================================
- Got haplomerger2 working on the provided test dataset. Had to install a bunch
of ucsc-* packages from bioconda, and add an old version of libpng to my
linker path.
================================================================================
December 22nd, 2017
================================================================================
- The assemblies died to a memory error again after been queued for weeks.
Upped the memory request to 100G and resubmitted.
================================================================================
January 12th, 2018
================================================================================
- The SPADEs assemblies finally completed. They a long time due to congestion
on the server, but hopefully subsequent assemblies should be snappier.
================================================================================
January 15th, 2018
================================================================================
- Added bowtie2 alignment and BLAST search steps to the pipeline.
- Went back and forth a lot on whether I should run Haplomerger2 before decontamination. Decided against it because: 1. It would reduce our
decontamination sensitivity by half. 2. It would add a huge amount of processing time.
- Instead we'll just have to be careful of removing heterozygous contigs during
decontamination then attempt to build haploid assemblies afterwards. Some loss
of heterozygous positions is probably inevitable though.
================================================================================
January 16th, 2018
================================================================================
- Typo in bowtie2 alignment, requeued pipeline.
================================================================================
January 21st, 2018
================================================================================
- The initial bowtie2 alignments completed successfully.
================================================================================
January 22nd, 2018
================================================================================
- Added Diamond BLAST step to pipeline, didn't get a chance to properly test it,
so will probably die with an error, but submitted it to SLURM just in case.
================================================================================
January 23rd, 2018
================================================================================
- Worked out the kinks in the diamond database construction script. Schedule the
diamond BLAST runs on SLURM.
================================================================================
January 31st, 2018
================================================================================
- Diamond BLAST searches successfully completed after a false start. Now running
blobtools database initialization.
================================================================================
February 2nd, 2018
================================================================================
Initial round of blobplot and covplots are complete.
--------------------------------------------------------------------------------
Notes on initial plots
--------------------------------------------------------------------------------
- Major contaminant groups:
Proteobacteria, Bacteria-undef, Cyanobacteria, Actinobacteria.
- Should be safe just removing all bacterial reads.
- Initial inspection suggests that mitosporidial contigs
Daphnia.
Plan for first round of contamination removal:
- Remove all contigs with bacterial and plant taxonomic assignments.
- Remove all contigs with GC content > 0.6 or coverage < 20
- Potentially remove extra contigs based on coverage plots.
================================================================================
February 13th, 2018
================================================================================
- Initial round of decontamination completed. Adjusted removal criteria to
remove and reran:
1. Contigs with bacterial/plant/archaeal/viral taxonomic assignments
2. Contigs with GC > 0.6
3. Contigs with coverage < 10
================================================================================
February 20th, 2018
================================================================================
- Decontamination iteration 1 completed successfully. The majority of
contamination seems to be gone. Observed the following contamination
candidates in some assemblies:
1. Contigs with strong BLAST hits to primates
2. Contigs with various fungal taxonomic assignments.
3. New bacterial contigs
4. Contigs with ~0 coverage in some samples.
- Wrote a script for the second iteration of decontamination to target these
candidates. Also included a step to remove the mitochondrial contigs.
================================================================================
February 22nd, 2018
================================================================================
- Had to remove the mitochondrial filter step from the pipeline since it was
missing in some assemblies. Will have to do mitochondrial filtering manually
after the redundans step in the pipeline.
================================================================================
February 26th, 2018
================================================================================
- Added redundans, duplicate removal and basic variant calling to the pipeline.
Pipeline will probably have to run for a week or so.
================================================================================
March 5th, 2018
================================================================================
- Redundans process got stuck for a couple of inputs over the weekend. Restarted
with larger memory allocations.
================================================================================
March 6th, 2018
================================================================================
- Added script for filtering mitochondrial scaffolds from the redundans filtered
assembly.
================================================================================
March 9th, 2018
================================================================================
- Pipeline had a bunch of errors which I patched and reran.
================================================================================
March 12th, 2018
================================================================================
- Redundans still running for a couple of samples (has been running for > 3
days).
================================================================================
March 22nd, 2018
================================================================================
- Pipeline completed over break. Have initial variant sets for all lines except
FBSC for which assembly still isn't working for some reason.
- BUSCO reports suggest that redundans is being too aggressive in the reduction
step; assemblies contain ~99 % of conserved genes before redundans and ~90 %
afterwards.
- Wrote a snakemake script which (1) runs just the reduction step of redundans
with varying min-identity parameter values [0.5-0.98] and (2) creates a BUSCO
report. From the results of that script we should be able to choose the identity
parameter s.t reduction is maximized without loss of true genes from the
assembly.
================================================================================
March 23rd, 2018
================================================================================
- Diagnosed the issue with the FBSC assembly as a memory error (again!). Upped
the memory allocation to 150G and reran.
- Results from redundans-test snakemake script:
- All results contained ~98 % of BUSCO genes. This indicates that the
reduction step of redundans may not be the culprit. Two options remain:
1). The BUSCO genes are in the assembly but BUSCO is missing them after
scaffolding
2). Scaffolding/gap-closing is removing genes from the assembly.
- 1 seems far more likely. To test, I reran BUSCO on all the redundans test
files with a relaxed e-value cutoff (1).
================================================================================
March 26th, 2018
================================================================================
- Results for BUSCO with relaxed e-value cutoff:
- This seemed to solve the problem at first: with a looser e-value cutoff,
BUSCO found 98.6 % of expected genes across all redundans identity
thresholds. I decided to leave the redundans step alone in the main
pipeline and rerun BUSCO with an e-value cutoff of 0.01
- However, the BUSCO results were again ~ 89 % when run on the redundans
data from the full pipeline.
- It seems like some step other than reduction is causing the BUSCO genes to
be lost.
- For the next test I ran BUSCO on each intermediate redundans result for
the FA genotype to pinpoint which step causes the BUSCO genes to be lost.
- Results:
1. Scaffolding with long reads removed the majority of the BUSCOs
2. Scaffolding with the reference removed another ~5 BUSCOs.
3. The BUSCO genes removed by the reference scaffolding may have just
drifted below BUSCO's significance threshold. I'll look at the BLAST
results manually tomorrow.
================================================================================
March 27th, 2018
================================================================================
- Manually examined the BLAST results for the latest BUSCO reports.
- The BUSCO genes seem to have been completely removed from the assemblies
after scaffolding with merged reads.
- Several of the genes which were missing after scaffolding with the
assembly during scaffolding with the reference seem to be present in the
BLAST report but filtered out by some later BUSCO step.
- This suggests that scaffolding with the merged reads may be the main
culprit. I'll try to rerun redundans without this step.
- Added GVCF genotyping to the pipeline.
- Found a really useful resource on using GATK in non-model organisms:
evodify.com/gatk-the-best-practice-for-genotype-calling-in-a-non-model-organism/
Main takeaways:
- We need to be careful about heterozygosity levels, GATK is
targeted at human data, so heterozygosity parameters will likely need to
be adjusted for our data. Tucker et al., 2013 estimates heterozygosity
levels at ~0.01 in sexual D. pulex, so this might be a good approximation
for now. We may need to produce our own estimates later.
- Base quality score recalibration will probably not be possible for our
data. The author argues that bootstrapping gives unsatisfactory results.
================================================================================
April 3rd, 2018
================================================================================
- The SLURM IBEST servers crashed last night so we lost 5 redundans jobs which
had been running for 6 days. Had to restart them fresh, which will put us back
another week unfortunately.
================================================================================
April 11th, 2018
================================================================================
- Getting a new redundans error for the last remaining redundans jobs: lastal
Input/Output error. I opened an issue on the redundans github. My hunch is
that its memory related. So I'll try again with a 200 Gb allocation.
================================================================================
April 11th-25th, 2018
================================================================================
- Redundans maintainer recommended I upgrade my redundans version and try the
use-bwa parameter.
- Initial results indicate that this solved the issue. Redundans is still
running for a few samples but the majority succeeded very quickly.
- Followed the afforementioned GATK guide and called an initial set of indels
and snps. Also implemented a filter to remove all variants present in multiple
samples.
- This protocol proved to be way too permissive as we ended up with ~200000
variants after filtering.
- Misalignments due to repetitive sequences are the most likely cause of this.
I added a repeat masking step to the pipeline using Emily's repeat library.
Apparently bowtie2 has a long standing bug when aligning to masked sequences,
so I also decided to switch to Novoalign for the final alignment step. This
will probably also increase alignment accuracy in non-repetitive sequences.
================================================================================
April 28-30th, 2018
================================================================================
- Did a large refactor of the pipeline which had become very messy/buggy. Found
a couple of small bugs which had been causing continous headaches.
===============================================================================
May 1st, 2018
===============================================================================
- The linuxbrew install on the server imploded for some unknown reason. The GCC
version just broke out of nowhere which caused constant random IO errors which
broke various pipeline steps. Had to do a clean reinstall of linuxbrew and all
installed packages. Its a lot more organized now but I still can't seem to get
Perl modules to be detected properly when installing some programs. Also
installed conda so that it doesn't conflict with linuxbrew (mostly). Between
the two we should be able to install most programs.
- Added step to pipeline for counting number of callable loci using GATK.
===============================================================================
May 2nd-3rd, 2018
===============================================================================
- Realized that its much more sensible to simply not call variants in repetitive
regions rather than masking repetive regions in the assembly directly. Rewrote
and reran the variant calling steps.
===============================================================================
May 3rd-8th, 2018
===============================================================================
- Added some extra variant plotting functionality and spent a lot of time
manually sifting through the initial variant sets to determine any potential
errors or biases.
- Clearly we need to add more stringent filtering for removing variants which
are shared across samples.
===============================================================================
May 10th, 2018
===============================================================================
- Added a script for filtering out variants which show evidence for being
variants across multiple samples.
===============================================================================
May 13th, 2018
===============================================================================
- A lot of variants are slipping through the uniqueness filter due to being low
depth in one or more samples. I implemented a script which filters out any
sites which is not within an ideal read depth range across all samples.
===============================================================================
may 15th, 2018
===============================================================================
- GATK4 is still missing a lot of functionality particularly for doing pileup
analysis. I decided to revert all GATK analyses to GATK3.
- Switched from using GATK to a custom script for counting number of callable
loci since I'm using a custom read depth filter.
===============================================================================
may 16th-17th, 2018
===============================================================================
- A bunch of memory errors occuring during the depth + unique filtering steps.
Clearly the input files are too large for reading in all at once. I rewrote
the scripts to read in chunks instead.
- The latest results look very good for the majority of samples with most having
~10-100 variants. However for several samples we see extremely inflated
mutation counts. The mutations in these samples seem to stem mostly from a few
scaffolds. I did a quick spotcheck of a couple in IGV and the mutations looked
fairly sensible. I'll look further into this tomorrow.
===============================================================================
may 18th, 2018
===============================================================================
Started my analysis by doing some datavis in R using the suspicious samples
(SUS).
- The affected scaffolds seem to broadly distributed in the SUS samples. In
IAEC1 and IAEC2 which were both highly inflated, we see a strong but imperfect
correlation between the scaffolds which were affected.
- In many cases it seems that gatk is calling homozygous mutations when there is
clear evidence for heterozygosity in the alignments. The allele counts output
by GATK seems to be quite innacurate.
Examples in IA genotype:
- scaffold00007:11527,11613, 10379
- Cases of normal looking mutations:
- scaffold00007:10460
- In other cases we see regions with a series of apparent loss of
heterozygosity mutations in one sample. This could potentially be due to
overcleaning during decontamination. To test this I'll realign all of the raw
read sets to the final assemblies to see if the heterozygous sites are
restored.
- There appears to be no correlation across samples between number of mutations
and generation number, strongly suggesting that most detected mutations are
false positives even in the less extreme samples.
===============================================================================
may 21st, 2018
===============================================================================
Examined the alignment of the raw IAEC1 reads to the final assembly to check for
evidence of overcleaning.
- Raw IAEC1 reads showed almost identical alignments in all positions I
examined.
- Overcleaning does not seem to explain the suspicious variant patterns.
Examined alignments of IAEC1 to assemblies from other genotypes:
- IAEC1 showed a large number of mismatches/zero coverage regions in these
alignments as compared to the alignments to IASC
The alignments of IAEC1 to other I isolates was less clearcut. I'll need to see
similar alignments from a normal sample to determine the level of difference I
should expect. I scheduled alignments of IB2 to all other isolates and will
examine the results tomorrow.
Also worked on producing a new script for filtering shared alleles. I will need
to instead parse bam-readcount pileups and will be a bit trickier than before.
===============================================================================
may 22nd, 2018
===============================================================================
Examined the mitochondrial alignments of IAEC1/IAEC2 vs IB and IC samples.
- Found marker polymorphisms in IB and IC which are not present in the IA
samples. This strongly suggests that the IAEC samples were not mislabelled.
Realized the probable cause of the extra variants! My unique variant filtering
step does not distinguish between gene conversion and de novo mutations. I've
spent too long studying mitochondria for my own good.
Reimplemented the uniqueness filter.
- Results look a lot better, with no samples have hugely elevated numbers of
mutations. The numbers are in general still too high though. I examined a few
mutations in IGV and found clear cases where alternative alleles were present
in multiple samples. It looks like I'll have to reimplement the filter using
the raw readcounts after all since GATK's numbers are not completely
informative.
===============================================================================
June 1st, 2018
===============================================================================
- Uniqueness filter and depth filter have been reimplemented to work on
readcount files.
===============================================================================
June 5th, 2018
===============================================================================
- Uniqueness + depth filter required a couple more days of bug fixing to get
working completely. The final variant sets are looking to be almost there. The
strand bias filter does not seem to be working very effectively, so I'll need
to increase its stringency, I'll try filtering SOR>3.0.
- While examining variants in IGV I also noticed that there are cases where IGV