lab_notebook.txt

G===============================================================================
Fenner Macrae Megadaph - Nuclear Analysis Lab Notebook
===============================================================================

Logs of specific commands run can be found in analysis directories.

===============================================================================
August 21st, 2017
===============================================================================

-------------------------------------------------------------------------------
Summary of project so far:
-------------------------------------------------------------------------------
- Emily has trimmed and merged reads (~/megadaph/reads/cleaned_reads/)

- Emily has produced draft assemblies for all starting controls using SPAdes
  (~/megadaph/assembly/spades_Aug2017/)

- I constructed a Kraken (https://ccb.jhu.edu/software/kraken/) database
containing all complete prokaryotic refseq assemblies
(~/megadaph/decontamination/kraken/db/). Kraken's database build scripts are
out of date, so I used a user produced set of scripts
(https://github.com/mw55309/Kraken_db_install_scripts). Protocol was followed
from (http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-\
structure-and-no-gi-numbers/)

- I also downloaded the prebuilt full nt database for Centrifuge
  (https://ccb.jhu.edu/software/centrifuge)
===============================================================================
November 19th, 2017
===============================================================================

-------------------------------------------------------------------------------
Activity Log for today:
-------------------------------------------------------------------------------
- Ran centrifuge on cleaned FA_SC reads
  (~/megadaph/decontamination/centrifuge/reads) and the D. magna reference
assembly (~/megadaph/decontamination/centrifuge/reference_assembly). Used the
complete nt centrifuge database but excluded arthropods and bony fish. The
taxonomic IDs for these groups were found by examining
~/megadaph/decontamination/centrifuge/custom_db/taxonomy/names.dmp.


===============================================================================
August 22nd, 2017 - August 23rd, 2017
===============================================================================
- Constructed a custom centrifuge database using protocol from centrifuge
manual. All Bacterial, Plant, Protozoan, Archael and Fungal genomes classified
as 'Complete' by NCBI were included, as well as the human reference.

===============================================================================
August 23rd, 2017
===============================================================================
- Created an R package fen.R.util (https://github.com/fennerm/fen.R.util) to
house code shared between other R packages. Decided to make it separate to the
bioinformatics_scripts repo since it will primarily be updated from my personal
machine.

-------------------------------------------------------------------------------
Preliminary Centrifuge Analysis Of D. magna reference assembly
-------------------------------------------------------------------------------
(~/megadaph/decontamination/centrifuge/reference_assembly)
- Wrote R functions for adding taxonomic name, and match fraction to the
centrifuge output file.

Taxa with strong evidence for contamination:
- Thermus thermophilus (From Taq polymerase?)
- Homo sapiens
- Limnohabitans
- Plant of some kind (hits to tomato, rice, cucumber etc.)
- Parasitic worm (Trichobilharzia, Spirometra, Echinostoma, Protopolystoma,
Schistosoma)

===============================================================================
August 24th, 2017
===============================================================================
-------------------------------------------------------------------------------
Decontamination Strategy (to be continuously updated)
-------------------------------------------------------------------------------
- Identify confident contaminants with centrifuge and remove
- Reassemble and run centrifuge on assembly
- Use GC content and coverage to remove additional suspicious contigs
(See Koutsovoulos et al., 2016)
- Exclude contigs below 500bp
(See Koutsovoulos et al., 2016)
- Figure out plan for checking that assembly isn't overcleaned.
- Check for differences in relative coverage between different starting control
assemblies!
- Check for dramatic shifts in coverage across potential misassemblies

===============================================================================
August 25th, 2017
===============================================================================
Planned analysis:
- Goal: Produce confidence metric for centrifuge classifications.
- Strategy: Scramble all of the nucleotides in the fasta input to centrifuge.
Then run centrifuge. This will give us our expectations under the null.
- Wrote a script to scramble all of the sequences in a multifasta file
(fmacrae/code/R/scripts/scramble_fasta.R)
- Started test run of analysis on FA_SC in
megadaph/decontamination/centrifuge/null_distribution
- Found a bunch of false positive hits. A bunch to human, sheep, tomato etc.
Basically all large genomes. Largest hit score was only 289 though, so
filtering by score should be pretty easy and effective.

===============================================================================
August 28th, 2017
===============================================================================
- Large reorganization of project github. Local and IBEST directories should
now be kept fairly well in sync using github. Data will still reside only on
IBEST.
- Wrote the alignment function for the megadaph pipeline
(fmacrae/code/util/python_modules/run_bowtie2.py)

===============================================================================
August 29th, 2017
===============================================================================
- Added run_index_fasta module to the Megadaph decontamination pipeline.
- Added CheckedArgs class module to validate arguments passed to the pipeline.
- Added unit tests directory, and got testing working for CheckedArgs

===============================================================================
August 30th, 2017
===============================================================================
- Setup my personal python modules as a package (fmbiopy) for easier import and
extensibility.
	- Using pytest for package testing (https://docs.pytest.org/en/latest/)

- Plan to set up fmbiopy and bioinformatics_scripts as subtrees of the Megadaph
repository, but the version of git on IBEST is super out of date. Messaged
Benji about getting a new version. Will use
https://medium.com/@porteneuve/mastering-git-subtrees-943d29a798ec as a guide
for subtree set up.

===============================================================================
August 31st, 2017
===============================================================================
- Fixed bugs in run_bowtie2.py, run_index_fasta.py and fen_util.run_command().
All tests now passing
- Came up with an object oriented pipeline structure. Basically will define
pipeline and submodule classes which will automatically handle argument
passing, logging and argument checking.
- Benji made an updated git module but its only available for some of the
servers until they update the rest. Don't know how long this will take but I'll
put off getting the subtree properly set up for the time being
- Read about pytest features:
http://pythontesting.net/framework/pytest/pytest-fixtures-nuts-bolts/#bare

===============================================================================
September 3rd, 2017
===============================================================================
- Partially created BioFileGroup class for housing and type checking
bioinformatics files within the pipeline (~/fmacrae/code/fmbiopy/fmbiopy.py).
Some tests still failing. Tomorrow will start by refactoring the prevalidation
functions.

===============================================================================
September 4th, 2017
===============================================================================
- Added a bunch of classes to BioFileGroup.

===============================================================================
September 5th, 2017
===============================================================================
- Added further classes to BioFileGroup. Now just need to implement an
IndexedFasta class and I can move onto the pipeline code.
- IndexedFasta implemented. Documentation could still use a little work but
biofile.py is basically complete.
- Installed pydocstring plugin for easy documentation in reST formatting. Call
with :Pydocstring in vim while cursor is on function or class definition.
- Thought a lot about what docstring style to use from now on. Decided on numpy
style for the following reasons:
	- Ease of reading
	- Can use 'pyment -wo "numpydoc" <file.py>' to automate much of the process
	- Can be converted into documentation with Sphinx
	- Guidelines:
	  http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html

===============================================================================
September 6th, 2017
===============================================================================
- Think I might have reinvented the wheel a bit with biofile. Ruffus seems like
it includes the same functionality and more.
- New plan: Next step rewrite bowtie2_align.py and index_fasta.py as ruffus
functions
- Interrupted while working: currently working on editing the .vimrc to auto-
open split tabs with python environment

===============================================================================
September 7th, 2017
===============================================================================
- Officially switched over to Ruffus for pipelining.
- Set up Ruffus logging class in fmbiopy.fmruffus

===============================================================================
September 10th, 2017
===============================================================================
- Wrote and tested following tasks in fmbiopy.ruffus_tasks: bowtie_index_fasta,
samtools_index_fasta, gunzip, gzip, paired_bowtie2_align

===============================================================================
September 11th, 2017
===============================================================================
- Spent a while trying to get jedi-vim to play nice with vim 8 and pyenv.
Eventually got it to work by adding a script to .vim/after/ftplugin/python.vim,
and adding an __init__.py to fmbiopy.fmbiopy
- Started on the decontam_pipeline proper in
  'fmacrae/megadaph/decontam_pipeline'. Decided to structure as a script rather
than a package.
- Started adding some additional functionality to biofile, so that it can be
used for checking argument validity in the pipeline.

===============================================================================
September 13th, 2017
===============================================================================
- Ideas from meeting
- Primers arrived
- Meeting with Sarah: Updated on progress. We shared a little concern about
Emily using hard GC content/coverage cutoffs for filtering contamination in the
assemblies. She showed me Emily's assembly stats though and they were super
impressive. Very large N50s and a pretty consistent total assembly length
~130-180 Mb.
- Lab meeting: Introductions - my presentation slot set for Nov. 29th
- Met with Emily about assembly progress. We talked about the hard cutoff
criteria issue. I suggested running the published reference genome through her
filter and see how much is lost. A nontrivial amount was excluded - large
enough that I would be uncomfortable using these filters for the mutation
calling pipeline. However we agreed they are probably fine for the TE detection
if she is comfortable with them. Given her timeframe, she probably doesn't have
the time to work at my snailpace. I asked her about her assembly parameters so
that I could eventually include them in the Ruffus pipeline.


===============================================================================
September 24th, 2017
===============================================================================
Been having troubles getting vim and pyenv to work with the new server
infrastructure which Benji is updating. Recompiled vim with following commands:

pyenv local 3.5.2

./configure --prefix=$HOME --with-features=huge --enable-multibyte
--enable-cscope=yes ---enable-pythoninterp=dynamic
--with-python3-config-dir=/opt/modules/devel/python/2.7.10/lib/python2.7/config
--enable-python3interp=dynamic
--with-python3-config-dir=/opt/modules/devel/python/3.5.2/lib/python3.5/config-3.5m/
--disable-gui

make VIMRUNTIMEDIR=$HOME/.vim

This worked to get vim working with python3 sort of, but broke python2. Also
syntax highlighting stopped working.

Decided to just switch to Neovim to avoid these issues. First installed
linuxbrew. Linuxbrew complained about our git version being out of date.
Attempted to update it using 'linuxbrew install git', but its throwing the
following error:

/mnt/lfs2/schaack/.linuxbrew/bin/ld: cannot find -ldb
collect2: error: ld returned 1 exit status

===============================================================================
September 25th, 2017
===============================================================================
Installed a brewed version of perl with linuxbrew.

===============================================================================
September 27th, 2017
===============================================================================
Installed brewed versions of git and neovim.

===============================================================================
September 28th, 2017
===============================================================================
Neovim up and running. Installed vim-plug for managing plugins. Added Ale for
syntax checking and vim-test for testing code in nvim.

===============================================================================
September 29th, 2017
===============================================================================
Refactored my dotfile set up so that my custom dotfiles are copied over ssh
upon login. This means that any changes to my dotfiles wont affect other users
and can also be more easily transferred to new servers.

Finally got neovim to run entirely without errors with my setup.

===============================================================================
October 2nd, 2017
===============================================================================
Spent all day bug fixing the decontamination pipeline

===============================================================================
October 4th, 2017
===============================================================================
Finally finished major bugfixing on fmruffus module. Architecture for running
Ruffus tasks is now pretty solid. Should now be fairly simple to add new
Ruffus tasks. Next step is to get a basic pipeline running with just the
initial Bowtie2 alignment to get a feel for how Ruffus pipeline scripts work.

===============================================================================
October 5th, 2017
===============================================================================
Minor bugfixes in fmruffus

===============================================================================
October 7th, 2017
===============================================================================
Bugfixing and refactoring in fmbiopy test functions

===============================================================================
October 8th, 2017
===============================================================================
Started on the actual pipeline script. Got the bowtie2 alignment step working
on small test files.

===============================================================================
October 9th, 2017
===============================================================================
Got a little stuck in the mud refactoring biofile. Decided that it makes more
sense to have the base class be a singular Biofile rather than a BiofileGroup.
Fixing this requires moving the bulk of the validation steps into the new
Biofile class and refactoring all of the non-generic BiofileGroup classes to
singular Biofile classes. The generic BiofileGroup will stick around as a way
to store a set of Biofiles, and do validation checks which pertain to groups.

===============================================================================
October 10th, 2017
===============================================================================
Worked on biofile refactoring

===============================================================================
October 11th, 2017
===============================================================================
Biofile refactoring mostly done. Generic Biofile class now works, as do Fasta
and Fastq. Commented out BiofileGroups and MatchedPrefixGroups for now. Will
fix them later if they end up being useful.

Created small Bam and Sam files with 10000 reads in testdat by subsampling
megadaph data with samtools.

===============================================================================
October 12th, 2017
===============================================================================
-Created small paired fastq files with 10000 reads by subsampling the megadaph
reads with seqtk for use in testdat.
-Did the same thing (1000 contigs) with the September 2017 redundans Megadaph
assemblies.
-Cloned the new testdat repository into fmbiopy and the megadaph
decontam_pipeline.
-Rewrote fmbiopy functions which rely on testdat

===============================================================================
October 14th, 2017
===============================================================================
- Switched from os.path to pathlib.Path for path manipulation in fmbiopy.
- Added BiofileGroup and MatchedPrefixGroup classes back to fmbiopy.biofile
- Worked on input file validation in megadaph pipeline

===============================================================================
October 15th, 2017
===============================================================================
- Got input validation working. Pipeline script now capable of producing a
  flowchart representing the pipeline.
- Added apply function to fmruffus which applies a single input RuffusTask to
  multiple inputs.

===============================================================================
October 16th, 2017
===============================================================================
- Got test functions to work outside of the test directory. Can now be run from
  the python package home folder.

===============================================================================
October 17th, 2017
===============================================================================
- Refactored, reorganized a bunch of fmbiopy code.
- Alignment and symlink parts of pipeline now working.
- Wrote RuffusTask for building a custom centrifuge index but didn't get time
  to test it

===============================================================================
October 18th, 2017
===============================================================================
- Bug fixed fmruffus.RuffusLog
- Added fixtures to fmtest

===============================================================================
October 19th, 2017
===============================================================================
- Added RuffusTransform class and altered RuffusTask class to handle
  ruffus.Pipeline.originate tasks.

===============================================================================
October 20th, 2017
===============================================================================
- Added centrifuge task to fmbiopy.fmruffus

===============================================================================
October 21st, 2017
===============================================================================
- Had to do a fresh linuxbrew install due to some persistent issues with
  glibc6.
- Wrote script for converting centrifuge output to blobtools compatible hits
  file ('fmscripts/python/centrifuge_to_hits.py')

===============================================================================
October 24th, 2017
===============================================================================
- Improved fmpaths test coverage

===============================================================================
October 25-30th, 2017
===============================================================================
- Major connection difficulties with IBEST
- Used the intermediate time to work on other project and improve my dotfiles.

===============================================================================
November 1st, 2017
===============================================================================
- Received assembly and read processing scripts from Emily.

===============================================================================
November 2nd, 2017
===============================================================================
- Started adding Emily's analyses to the pipeline. Everything should be fairly
  easy except her script for removing heterozygosity. It explicitly repeats
  the same protocol for each file rather than looping so it'll need to be
  completely rewritten.

===============================================================================
November 4th-5th, 2017
===============================================================================
- Wrote Ruffus functions for all of Emily's tasks with the exception of the
  custom script mentioned above.

===============================================================================
November 6th, 2017
===============================================================================
- Moved fmbiopy.fmruffus and fmbiopy.biofile into their own github repos.

===============================================================================
November 7th, 2017
===============================================================================
- Split duffus into multiple module files.

Met with Emily about her strategy for reducing heterozygosity. Basically she
just removed all heterozygosity because it wasn't relevant to transposable
element discovery. I'll have to come up with a different solution.

Found an interesting looking tool called HaploMerger2 which rebuilds both
haploid assemblies from a heterozygous diploid assembly.

===============================================================================
November 8th and 10th, 2017
===============================================================================
- Worked on custom duffus Pipeline class.

===============================================================================
November 11th, 2017
===============================================================================
- The more time I spend looking at the Ruffus source code, the more I think its
  pretty poorly constructed. I've been spending an awful amount of time getting
  its Input-> Output formatter to work and I think its probably a waste of
  time. Going to cut my losses and switch to an alternative pipelining
  infrastructure. Snakemake seems like a good choice.

===============================================================================
November 14th, 2017
===============================================================================
- Made some good progress on the new snakemake pipeline. Got Emily's data set
  up in the pipeline folder so we don't have to rebuild all of it.

===============================================================================
November 15th, 2017
===============================================================================
- Started to test blobtools on a single file in
  'megadaph.private/data/exploratory/decontam/blobtools'. Hoping to nail down
  the blobtools part of the protocol here.

===============================================================================
November 16th, 2017
===============================================================================
- Decent portion of the exploratory blobtools pipeline finished. Currently
  running the initial alignment and centrifuge steps on trillian.

===============================================================================
November 17th, 2017
===============================================================================
- Job got stuck on Trillian for some reason. Restarted on Ford.

===============================================================================
November 18th, 2017
===============================================================================
- Exploratory pipeline has produced most of the inputs to blobtools and is now
  working on creating the blobtools database.
- Added the trimming steps to the main snakemake pipeline but I'm running into
  persistant syntax errors at the start of the script.

===============================================================================
November 19th, 2017
===============================================================================
- Blobtools is dependent on python2 but calls its submodules using 'python' so
  it ends up defaulting to python3 and failing. I fixed this by replacing all
  instances of python with python2 in the main script but a more intelligent
  solution might be required for the main pipeline. Submitted an issue on
  github.

- Can't get the initial symlink step to work in the main snakemake pipeline.
  Snakemake seems to have some trouble with symlinks in general so I'm just
  going to remove the step for now.

- Got the snakemake pipeline working all the way through to production of
  multiqc files. Will start running it on the full dataset.

- Got some initial blobplots from the blobplot pipeline.

===============================================================================
November 20th, 2017
===============================================================================
- Wrote a script for downloading a new custom centrifuge database and added it
  to the main snakemake pipeline.
- Started running the pipeline on the actual datafiles. When(if) this completes
  all fastq cleaning, QC and database downloading will be done.

===============================================================================
November 21st, 2017
===============================================================================
- Centrifuge kept encountering time out errors while connecting to NCBI. Online
  discussions suggested that this only occurs when centrifuge-download uses
  rsync. Edited the binary in ~/bin to use curl instead. Main pipeline is now
  running on Tesla.

- Meanwhile I've been testing SLURM scheduling of the pipeline to see if this
  leads to faster execution.

===============================================================================
November 22nd, 2017
===============================================================================
- Getting more errors with centrifuge-download. I'm going to split the
  centrifuge database building part of the pipeline into multiple tasks so that
  I don't have to start fresh each time it fails.

===============================================================================
November 23rd, 2017
===============================================================================
- Decided to rerun the exploratory blobtools pipeline with the entire nt
  database. It probably makes more sense to filter out unrealistic hits
  afterwards than spending too much time curating a custom database. Only thing
  that I'm worried about is that the centrifuge docs are very ambiguous as to
  the contents of their nt database. In the manual it says that the database
  contains all 'spliced coding' sequence. However in their paper etc. they say
  that its equivalent to the database used in BLAST.

===============================================================================
November 23rd, 2017
===============================================================================
- The main pipeline encountered errors in Fastqc building. This resulted from
  my removal of run_command in my fmbiopy library. Moved Fastqc from a script
  to a simple shell command, tested and resubmitted with SLURM.

- Very unsatisfied with the output I'm getting from Centrifuge. Going to try
  switching to BLAST for taxonomic assignment.


===============================================================================
November 27th, 2017
===============================================================================
- Got BLAST integrated into blobtools pipeline and running on slarti.

===============================================================================
November 28th, 2017
===============================================================================
- Fine tuned many aspects of exploratory pipeline - # BLAST hits per contig
  increased from 1 to 10, added a simple filter script to remove obvious
  contaminants. All done very messily to get done in time for presentation at
  lab meeting. Will need to go over again later.

- Tried to get Haplomerger2 set up. Its written very poorly so it'll take a lot
  of work to get it up and running.


-------------------------------------------------------------------------------
Pipeline for decontamination
-------------------------------------------------------------------------------
- Assemble
- Diploidify with haplomerger2
- Align to assembly
- blobtools pipeline
- Group by taxonomy and Concoct.
- Remove concoct groups which contain contaminants
- Remove obvious contaminants
- Possibly do a hard filter by coverage/GC if required
- Reassemble

===============================================================================
November 29th, 2017
===============================================================================
- Presented in lab meeting.

===============================================================================
November 30th, 2017
===============================================================================
- Started the initial SPADEs assemblies in the snakemake pipeline on SLURM.

===============================================================================
December 1st, 2017
===============================================================================
- Got haplomerger2 working on the provided test dataset. Had to install a bunch
  of ucsc-* packages from bioconda, and add an old version of libpng to my
  linker path.

================================================================================
December 22nd, 2017
================================================================================
- The assemblies died to a memory error again after been queued for weeks.
  Upped the memory request to 100G and resubmitted.

================================================================================
January 12th, 2018
================================================================================
- The SPADEs assemblies finally completed. They a long time due to congestion
  on the server, but hopefully subsequent assemblies should be snappier.

================================================================================
January 15th, 2018
================================================================================
- Added bowtie2 alignment and BLAST search steps to the pipeline.
- Went back and forth a lot on whether I should run Haplomerger2 before decontamination. Decided against it because: 1. It would reduce our
decontamination sensitivity by half. 2. It would add a huge amount of processing time.
- Instead we'll just have to be careful of removing heterozygous contigs during
decontamination then attempt to build haploid assemblies afterwards. Some loss
of heterozygous positions is probably inevitable though.

================================================================================
January 16th, 2018
================================================================================
- Typo in bowtie2 alignment, requeued pipeline.

================================================================================
January 21st, 2018
================================================================================
- The initial bowtie2 alignments completed successfully.

================================================================================
January 22nd, 2018
================================================================================
- Added Diamond BLAST step to pipeline, didn't get a chance to properly test it,
so will probably die with an error, but submitted it to SLURM just in case.

================================================================================
January 23rd, 2018
================================================================================
- Worked out the kinks in the diamond database construction script. Schedule the
diamond BLAST runs on SLURM.

================================================================================
January 31st, 2018
================================================================================
- Diamond BLAST searches successfully completed after a false start. Now running
  blobtools database initialization.

================================================================================
February 2nd, 2018
================================================================================
Initial round of blobplot and covplots are complete.

--------------------------------------------------------------------------------
Notes on initial plots
--------------------------------------------------------------------------------
- Major contaminant groups:
Proteobacteria, Bacteria-undef, Cyanobacteria, Actinobacteria.
- Should be safe just removing all bacterial reads.
- Initial inspection suggests that mitosporidial contigs
  Daphnia.

Plan for first round of contamination removal:
- Remove all contigs with bacterial and plant taxonomic assignments.
- Remove all contigs with GC content > 0.6 or coverage < 20
- Potentially remove extra contigs based on coverage plots.

================================================================================
February 13th, 2018
================================================================================
- Initial round of decontamination completed. Adjusted removal criteria to
  remove and reran:
1. Contigs with bacterial/plant/archaeal/viral taxonomic assignments
2. Contigs with GC > 0.6
3. Contigs with coverage < 10

================================================================================
February 20th, 2018
================================================================================
- Decontamination iteration 1 completed successfully. The majority of
  contamination seems to be gone. Observed the following contamination
  candidates in some assemblies:
    1. Contigs with strong BLAST hits to primates
    2. Contigs with various fungal taxonomic assignments.
    3. New bacterial contigs
    4. Contigs with ~0 coverage in some samples.

- Wrote a script for the second iteration of decontamination to target these
  candidates. Also included a step to remove the mitochondrial contigs.

================================================================================
February 22nd, 2018
================================================================================
- Had to remove the mitochondrial filter step from the pipeline since it was
missing in some assemblies. Will have to do mitochondrial filtering manually
after the redundans step in the pipeline.

================================================================================
February 26th, 2018
================================================================================
- Added redundans, duplicate removal and basic variant calling to the pipeline.
  Pipeline will probably have to run for a week or so.

================================================================================
March 5th, 2018
================================================================================
- Redundans process got stuck for a couple of inputs over the weekend. Restarted
with larger memory allocations.

================================================================================
March 6th, 2018
================================================================================
- Added script for filtering mitochondrial scaffolds from the redundans filtered
assembly.

================================================================================
March 9th, 2018
================================================================================
- Pipeline had a bunch of errors which I patched and reran.

================================================================================
March 12th, 2018
================================================================================
- Redundans still running for a couple of samples (has been running for > 3
days).

================================================================================
March 22nd, 2018
================================================================================
- Pipeline completed over break. Have initial variant sets for all lines except
FBSC for which assembly still isn't working for some reason.
- BUSCO reports suggest that redundans is being too aggressive in the reduction
step; assemblies contain ~99 % of conserved genes before redundans and ~90 %
afterwards.
- Wrote a snakemake script which (1) runs just the reduction step of redundans
with varying min-identity parameter values [0.5-0.98] and (2) creates a BUSCO
report. From the results of that script we should be able to choose the identity
parameter s.t reduction is maximized without loss of true genes from the
assembly.


================================================================================
March 23rd, 2018
================================================================================
- Diagnosed the issue with the FBSC assembly as a memory error (again!). Upped
  the memory allocation to 150G and reran.
- Results from redundans-test snakemake script:
  - All results contained ~98 % of BUSCO genes. This indicates that the
    reduction step of redundans may not be the culprit. Two options remain:
    1). The BUSCO genes are in the assembly but BUSCO is missing them after
        scaffolding
    2). Scaffolding/gap-closing is removing genes from the assembly.
  - 1 seems far more likely. To test, I reran BUSCO on all the redundans test
    files with a relaxed e-value cutoff (1).


================================================================================
March 26th, 2018
================================================================================
- Results for BUSCO with relaxed e-value cutoff:
    - This seemed to solve the problem at first: with a looser e-value cutoff,
      BUSCO found 98.6 % of expected genes across all redundans identity
      thresholds. I decided to leave the redundans step alone in the main
      pipeline and rerun BUSCO with an e-value cutoff of 0.01
    - However, the BUSCO results were again ~ 89 % when run on the redundans
      data from the full pipeline.
    - It seems like some step other than reduction is causing the BUSCO genes to
      be lost.
- For the next test I ran BUSCO on each intermediate redundans result for
  the FA genotype to pinpoint which step causes the BUSCO genes to be lost.
    - Results:
        1. Scaffolding with long reads removed the majority of the BUSCOs
        2. Scaffolding with the reference removed another ~5 BUSCOs.
        3. The BUSCO genes removed by the reference scaffolding may have just
           drifted below BUSCO's significance threshold. I'll look at the BLAST
           results manually tomorrow.

================================================================================
March 27th, 2018
================================================================================
- Manually examined the BLAST results for the latest BUSCO reports.
    - The BUSCO genes seem to have been completely removed from the assemblies
      after scaffolding with merged reads.
    - Several of the genes which were missing after scaffolding with the
      assembly during scaffolding with the reference seem to be present in the
      BLAST report but filtered out by some later BUSCO step.
    - This suggests that scaffolding with the merged reads may be the main
      culprit. I'll try to rerun redundans without this step.

- Added GVCF genotyping to the pipeline.

- Found a really useful resource on using GATK in non-model organisms:
evodify.com/gatk-the-best-practice-for-genotype-calling-in-a-non-model-organism/

    Main takeaways:
    - We need to be careful about heterozygosity levels, GATK is
      targeted at human data, so heterozygosity parameters will likely need to
      be adjusted for our data. Tucker et al., 2013 estimates heterozygosity
      levels at ~0.01 in sexual D. pulex, so this might be a good approximation
      for now. We may need to produce our own estimates later.
    - Base quality score recalibration will probably not be possible for our
      data. The author argues that bootstrapping gives unsatisfactory results.

================================================================================
April 3rd, 2018
================================================================================
- The SLURM IBEST servers crashed last night so we lost 5 redundans jobs which
  had been running for 6 days. Had to restart them fresh, which will put us back
  another week unfortunately.


================================================================================
April 11th, 2018
================================================================================
- Getting a new redundans error for the last remaining redundans jobs: lastal
  Input/Output error. I opened an issue on the redundans github. My hunch is
  that its memory related. So I'll try again with a 200 Gb allocation.


================================================================================
April 11th-25th, 2018
================================================================================
- Redundans maintainer recommended I upgrade my redundans version and try the
  use-bwa parameter.
- Initial results indicate that this solved the issue. Redundans is still
  running for a few samples but the majority succeeded very quickly.
- Followed the afforementioned GATK guide and called an initial set of indels
  and snps. Also implemented a filter to remove all variants present in multiple
  samples.
- This protocol proved to be way too permissive as we ended up with ~200000
  variants after filtering.
- Misalignments due to repetitive sequences are the most likely cause of this.
  I added a repeat masking step to the pipeline using Emily's repeat library.
  Apparently bowtie2 has a long standing bug when aligning to masked sequences,
  so I also decided to switch to Novoalign for the final alignment step. This
  will probably also increase alignment accuracy in non-repetitive sequences.

================================================================================
April 28-30th, 2018
================================================================================
- Did a large refactor of the pipeline which had become very messy/buggy. Found
  a couple of small bugs which had been causing continous headaches.

===============================================================================
May 1st, 2018
===============================================================================
- The linuxbrew install on the server imploded for some unknown reason. The GCC
  version just broke out of nowhere which caused constant random IO errors which
  broke various pipeline steps. Had to do a clean reinstall of linuxbrew and all
  installed packages. Its a lot more organized now but I still can't seem to get
  Perl modules to be detected properly when installing some programs. Also
  installed conda so that it doesn't conflict with linuxbrew (mostly). Between
  the two we should be able to install most programs.
- Added step to pipeline for counting number of callable loci using GATK.

===============================================================================
May 2nd-3rd, 2018
===============================================================================
- Realized that its much more sensible to simply not call variants in repetitive
  regions rather than masking repetive regions in the assembly directly. Rewrote
  and reran the variant calling steps.

===============================================================================
May 3rd-8th, 2018
===============================================================================
- Added some extra variant plotting functionality and spent a lot of time
  manually sifting through the initial variant sets to determine any potential
  errors or biases.
- Clearly we need to add more stringent filtering for removing variants which
  are shared across samples.


===============================================================================
May 10th, 2018
===============================================================================
- Added a script for filtering out variants which show evidence for being
  variants across multiple samples.

===============================================================================
May 13th, 2018
===============================================================================
- A lot of variants are slipping through the uniqueness filter due to being low
  depth in one or more samples. I implemented a script which filters out any
  sites which is not within an ideal read depth range across all samples.

===============================================================================
may 15th, 2018
===============================================================================
- GATK4 is still missing a lot of functionality particularly for doing pileup
  analysis. I decided to revert all GATK analyses to GATK3.
- Switched from using GATK to a custom script for counting number of callable
  loci since I'm using a custom read depth filter.

===============================================================================
may 16th-17th, 2018
===============================================================================
- A bunch of memory errors occuring during the depth + unique filtering steps.
  Clearly the input files are too large for reading in all at once. I rewrote
  the scripts to read in chunks instead.
- The latest results look very good for the majority of samples with most having
  ~10-100 variants. However for several samples we see extremely inflated
  mutation counts. The mutations in these samples seem to stem mostly from a few
  scaffolds. I did a quick spotcheck of a couple in IGV and the mutations looked
  fairly sensible. I'll look further into this tomorrow.

===============================================================================
may 18th, 2018
===============================================================================
Started my analysis by doing some datavis in R using the suspicious samples
(SUS).
- The affected scaffolds seem to broadly distributed in the SUS samples. In
  IAEC1 and IAEC2 which were both highly inflated, we see a strong but imperfect
  correlation between the scaffolds which were affected.
- In many cases it seems that gatk is calling homozygous mutations when there is
  clear evidence for heterozygosity in the alignments. The allele counts output
  by GATK seems to be quite innacurate.

  Examples in IA genotype:
  - scaffold00007:11527,11613, 10379

- Cases of normal looking mutations:
  - scaffold00007:10460

- In other cases we see regions with a series of apparent loss of
  heterozygosity mutations in one sample. This could potentially be due to
  overcleaning during decontamination. To test this I'll realign all of the raw
  read sets to the final assemblies to see if the heterozygous sites are
  restored.

- There appears to be no correlation across samples between number of mutations
  and generation number, strongly suggesting that most detected mutations are
  false positives even in the less extreme samples.

===============================================================================
may 21st, 2018
===============================================================================
Examined the alignment of the raw IAEC1 reads to the final assembly to check for
evidence of overcleaning.
- Raw IAEC1 reads showed almost identical alignments in all positions I
  examined.
- Overcleaning does not seem to explain the suspicious variant patterns.

Examined alignments of IAEC1 to assemblies from other genotypes:
- IAEC1 showed a large number of mismatches/zero coverage regions in these
  alignments as compared to the alignments to IASC

The alignments of IAEC1 to other I isolates was less clearcut. I'll need to see
similar alignments from a normal sample to determine the level of difference I
should expect. I scheduled alignments of IB2 to all other isolates and will
examine the results tomorrow.

Also worked on producing a new script for filtering shared alleles. I will need
to instead parse bam-readcount pileups and will be a bit trickier than before.

===============================================================================
may 22nd, 2018
===============================================================================
Examined the mitochondrial alignments of IAEC1/IAEC2 vs IB and IC samples.
- Found marker polymorphisms in IB and IC which are not present in the IA
  samples. This strongly suggests that the IAEC samples were not mislabelled.

Realized the probable cause of the extra variants! My unique variant filtering
step does not distinguish between gene conversion and de novo mutations. I've
spent too long studying mitochondria for my own good.

Reimplemented the uniqueness filter.
- Results look a lot better, with no samples have hugely elevated numbers of
  mutations. The numbers are in general still too high though. I examined a few
  mutations in IGV and found clear cases where alternative alleles were present
  in multiple samples. It looks like I'll have to reimplement the filter using
  the raw readcounts after all since GATK's numbers are not completely
  informative.

===============================================================================
June 1st, 2018
===============================================================================
- Uniqueness filter and depth filter have been reimplemented to work on
  readcount files.

===============================================================================
June 5th, 2018
===============================================================================
- Uniqueness + depth filter required a couple more days of bug fixing to get
  working completely. The final variant sets are looking to be almost there. The
  strand bias filter does not seem to be working very effectively, so I'll need
  to increase its stringency, I'll try filtering SOR>3.0.
- While examining variants in IGV I also noticed that there are cases where IGV
  shows evidence for an allele across multiple samples, but the readcount files
  show it in only one sample. I'll try to remove any prefiltering by samtools to
  see if I can rescue them. Hoping that it isn't a bug in mpileup2readcounts.

===============================================================================
June 6th, 2018
===============================================================================
- It turns out that the errors I've been running into are due to missing values
  in samtools mpileup which cause the readcount tables to fail to align
  correctly. I rewrote the produce_pileup step so that all samples from each
  genotype are called  together. This produces a merged table which then has to
  be split into per-sample readcounts. Unfortunately this will probably be
  significantly slower to execute.