-
Notifications
You must be signed in to change notification settings - Fork 11
/
mode_modify.txt
106 lines (79 loc) · 4.48 KB
/
mode_modify.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
SYNOPSIS
metacache modify <database> <sequence file/directory>... [OPTION]...
metacache modify <database> [OPTION]... <sequence file/directory>...
DESCRIPTION
Add reference sequence and/or taxonomic information to an existing database.
REQUIRED PARAMETERS
<database> database file name;
A MetaCache database contains taxonomic information and
min-hash signatures of reference sequences (complete
genomes, scaffolds, contigs, ...).
<sequence file/directory>...
FASTA or FASTQ files containing genomic sequences
(complete genomes, scaffolds, contigs, ...) that shall
beused as representatives of an organism/taxon.
If directory names are given, they will be searched for
sequence files (at most 10 levels deep).
BASIC OPTIONS
-taxonomy <path> directory with taxonomic hierarchy data (see NCBI's
taxonomic data files)
-taxpostmap <file>
Files with sequence to taxon id mappings that are used as
alternative source in a post processing step.
default: 'nucl_(gb|wgs|est|gss).accession2taxid'
-silent|-verbose information level during build:
silent => none / verbose => most detailed
default: neither => only errors/important info
ADVANCED OPTIONS
-reset-taxa Attempts to re-rank all sequences after the main build
phase using '.accession2taxid' files. This will reset the
taxon id of a reference sequence even if a taxon id could
be obtained from other sources during the build phase.
default: off
-max-locations-per-feature <#>
maximum number of reference sequence locations to be
stored per feature;
If the value is too high it will significantly impact
querying speed. Note that an upper hard limit is always
imposed by the data type used for the hash table bucket
size (set with compilation macro
'-DMC_LOCATION_LIST_SIZE_TYPE').
default: 254
-remove-overpopulated-features
Removes all features that have reached the maximum allowed
amount of locations per feature. This can improve querying
speed and can be used to remove non-discriminative
features.
default: off
Not available in the GPU version.
-remove-ambig-features <rank>
Removes all features that have more distinct reference
sequence on the given taxonomic rank than set by
'-max-ambig-per-feature'. This can decrease the database
size significantly at the expense of sensitivity. Note
that the lower the given taxonomic rank is, the more
pronounced the effect will be.
Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain
default: off
Not available in the GPU version.
-max-ambig-per-feature <#>
Maximum number of allowed different reference sequence
taxa per feature if option '-remove-ambig-features' is
used.
Not available in the GPU version.
-max-load-fac <factor>
maximum hash table load factor;
This can be used to trade off larger memory consumption
for speed and vice versa. A lower load factor will improve
speed, a larger one will improve memory efficiency.
default: 0.800000
Not available in the GPU version.
EXAMPLES
Add reference sequence 'penicillium.fna' to database 'fungi'
metacache modify fungi penicillium.fna
Add taxonomic information from NCBI to database 'myBacteria'
download_ncbi_taxonomy myTaxo
metacache modify myBacteria -taxonomy myTaxo