docs/mode_modify.txt

SYNOPSIS

    metacache modify <database> <sequence file/directory>... [OPTION]...

    metacache modify <database> [OPTION]... <sequence file/directory>...


DESCRIPTION

    Add reference sequence and/or taxonomic information to an existing database.


REQUIRED PARAMETERS

    <database>        database file name;
                      A MetaCache database contains taxonomic information and
                      min-hash signatures of reference sequences (complete
                      genomes, scaffolds, contigs, ...).

    <sequence file/directory>...
                      FASTA or FASTQ files containing genomic sequences
                      (complete genomes, scaffolds, contigs, ...) that shall
                      beused as representatives of an organism/taxon.
                      If directory names are given, they will be searched for
                      sequence files (at most 10 levels deep).


BASIC OPTIONS

    -taxonomy <path>  directory with taxonomic hierarchy data (see NCBI's
                      taxonomic data files)

    -taxpostmap <file>
                      Files with sequence to taxon id mappings that are used as
                      alternative source in a post processing step.
                      default: 'nucl_(gb|wgs|est|gss).accession2taxid'

    -silent|-verbose  information level during build:
                      silent => none / verbose => most detailed
                      default: neither => only errors/important info


ADVANCED OPTIONS

    -reset-taxa       Attempts to re-rank all sequences after the main build
                      phase using '.accession2taxid' files. This will reset the
                      taxon id of a reference sequence even if a taxon id could
                      be obtained from other sources during the build phase.
                      default: off

    -max-locations-per-feature <#>
                      maximum number of reference sequence locations to be
                      stored per feature;
                      If the value is too high it will significantly impact
                      querying speed. Note that an upper hard limit is always
                      imposed by the data type used for the hash table bucket
                      size (set with compilation macro
                      '-DMC_LOCATION_LIST_SIZE_TYPE').
                      default: 254

    -remove-overpopulated-features
                      Removes all features that have reached the maximum allowed
                      amount of locations per feature. This can improve querying
                      speed and can be used to remove non-discriminative
                      features.
                      default: off
                      Not available in the GPU version.

    -remove-ambig-features <rank>
                      Removes all features that have more distinct reference
                      sequence on the given taxonomic rank than set by
                      '-max-ambig-per-feature'. This can decrease the database
                      size significantly at the expense of sensitivity. Note
                      that the lower the given taxonomic rank is, the more
                      pronounced the effect will be.
                      Valid values: sequence, form, variety, subspecies,
                      species, subgenus, genus, subtribe, tribe, subfamily,
                      family, suborder, order, subclass, class, subphylum,
                      phylum, subkingdom, kingdom, domain
                      default: off
                      Not available in the GPU version.

    -max-ambig-per-feature <#>
                      Maximum number of allowed different reference sequence
                      taxa per feature if option '-remove-ambig-features' is
                      used.
                      Not available in the GPU version.

    -max-load-fac <factor>
                      maximum hash table load factor;
                      This can be used to trade off larger memory consumption
                      for speed and vice versa. A lower load factor will improve
                      speed, a larger one will improve memory efficiency.
                      default: 0.800000
                      Not available in the GPU version.


EXAMPLES
    Add reference sequence 'penicillium.fna' to database 'fungi'
        metacache modify fungi penicillium.fna

    Add taxonomic information from NCBI to database 'myBacteria'
        download_ncbi_taxonomy myTaxo
        metacache modify myBacteria -taxonomy myTaxo