Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CCHFV to loculus #1920

Merged
merged 81 commits into from
Jun 6, 2024
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
9277b4b
Add ccfv to yaml
anna-parker May 10, 2024
21d2877
Fix ingest for single segment case
anna-parker May 15, 2024
d65a076
Fix: values.yaml - nucleotideSequences need to be a list in prepro co…
anna-parker May 15, 2024
fef7ebe
Add correct genome annotations from NCBI
anna-parker May 15, 2024
c323743
Update configs to use githubusercontent for nextclade_datasets.
anna-parker May 16, 2024
309dfeb
Use new dataset link
anna-parker May 16, 2024
c94ba9f
Fix preprocessing issues after default values.yaml changes.
anna-parker May 22, 2024
4756ffe
Add segmented as a config param
anna-parker May 22, 2024
49ff8e2
Join segments based on isolate name.
anna-parker May 23, 2024
1d9df16
Fix some prepro issues
anna-parker May 23, 2024
e0f8801
Add default config changes
anna-parker May 23, 2024
9dea930
Update silo configs
anna-parker May 23, 2024
b3c7645
Remove preprocessing temp results file.
anna-parker May 23, 2024
530cb30
Fix cchfv table columns as metdata has now been renamed.
anna-parker May 23, 2024
5717cd4
Fix author_affiliations
anna-parker May 23, 2024
61bb4f7
Merge branch 'main' into ccfv
anna-parker May 23, 2024
7069e0b
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 23, 2024
9d1eb2a
Fix merge issues with instanceName.
anna-parker May 23, 2024
f375fb7
Merge branch 'main' into ccfv
anna-parker May 23, 2024
1d915a1
Fix prepare_metdata bug.
anna-parker May 23, 2024
320d1f3
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 23, 2024
9b299d4
Add back missing website metadata.
anna-parker May 23, 2024
303c630
Fix author list sorting, fix displayName.
anna-parker May 23, 2024
6e6e75c
Fix values.yaml
anna-parker May 24, 2024
ddbc4d9
Fix reingest.
anna-parker May 25, 2024
7afc4ed
Add segmented to ingest configs and make use in scripts consistent.
anna-parker May 25, 2024
60f5992
Update README.
anna-parker May 25, 2024
7e70b73
Fix little ingest bug
anna-parker May 25, 2024
1c6d8ea
Refactor ingest to make steps clearer.
anna-parker May 27, 2024
31d1af1
Fix webpage bug.
anna-parker May 27, 2024
b14e4ee
Small prepro fixes
anna-parker May 27, 2024
1c0c841
Remove unnecessary files from gitignore
anna-parker May 27, 2024
0fbb72c
Merge branch 'main' into ccfv
anna-parker May 28, 2024
38778f3
Small fixes
anna-parker May 28, 2024
80c52de
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 28, 2024
4fdfd1b
Clean up preprocessing
anna-parker May 28, 2024
fc6c7e4
add args
anna-parker May 28, 2024
f535691
Use links to sequences instead of full sequences in values.yaml.
anna-parker May 28, 2024
3cbbcc7
Merge branch 'main' into ccfv
anna-parker May 28, 2024
498a88d
Fix little bug
anna-parker May 28, 2024
3e1c6e0
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 28, 2024
8b85432
Fix length bug
anna-parker May 28, 2024
c177584
Merge branch 'main' into ccfv
anna-parker May 28, 2024
aacad98
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 28, 2024
4b48c46
Fix merge bug
anna-parker May 28, 2024
3e5377d
Make check stricter
anna-parker May 28, 2024
0591f09
Update docs
anna-parker May 28, 2024
16bb0aa
Merge remote-tracking branch 'origin/main' into ccfv
anna-parker May 31, 2024
eaba61f
Merge branch 'main' into ccfv
anna-parker May 31, 2024
b16026e
Fix prepro bug introduced by merge
anna-parker May 31, 2024
aab29a1
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 31, 2024
7e512a7
Remove ncbi_length from defaults - this was removed from values.yaml …
anna-parker May 31, 2024
0186f52
Update READMEs with suggestions.
anna-parker May 31, 2024
a29ca19
Resolve some issues
anna-parker Jun 1, 2024
8ede192
Change `segmented` to `per_segment`.
anna-parker Jun 2, 2024
28c9402
Remove the requirement for adding `segmented:True` to the config.yaml
anna-parker Jun 3, 2024
9026276
Fix backend bug
anna-parker Jun 3, 2024
f77b447
Fix bug
anna-parker Jun 3, 2024
9c14af7
Second try to fix bug
anna-parker Jun 3, 2024
22c29dd
Merge branch 'main' into ccfv
corneliusroemer Jun 4, 2024
542606d
Add dag for segmented
corneliusroemer Jun 4, 2024
0914133
Simplify segmentation inference
corneliusroemer Jun 4, 2024
18db327
Remove unnecessary/confusing functions
corneliusroemer Jun 4, 2024
3cb089a
Simplify extraction script, DRYer
corneliusroemer Jun 4, 2024
ef375a4
Reorder to never have rules do forward references
corneliusroemer Jun 4, 2024
94b9706
Remove unused function
corneliusroemer Jun 4, 2024
fe20091
Keep top level dir clean by moving images to folder
corneliusroemer Jun 4, 2024
3fb9060
Review segment parsing script
corneliusroemer Jun 4, 2024
47afdea
Switch default log level to INFO, debug is very verbose and there's n…
corneliusroemer Jun 4, 2024
14d784d
Log a few important lines at INFO, not everything at debug only
corneliusroemer Jun 4, 2024
1fbc79c
Avoid a very broad try/except block, if necessary, use in more locali…
corneliusroemer Jun 4, 2024
c6d0ceb
Mention all config in `params:` blocks, so snakemake can rerun rule o…
corneliusroemer Jun 4, 2024
1012ecb
Use input.script consistently (the advantage of using the script as i…
corneliusroemer Jun 4, 2024
2cfdb12
All config files to be used by Python MUST use snake case, not camel …
corneliusroemer Jun 4, 2024
de3461b
Fix ruff lints and unnecessary indentations
corneliusroemer Jun 4, 2024
fac633e
Update documentation of group_segments
anna-parker Jun 6, 2024
8543f6c
Fix issues raised in get_segment_details
anna-parker Jun 6, 2024
df9c57d
Fix weird error I introduced when merging changes
anna-parker Jun 6, 2024
77119db
Go back to old regex as this catches more cases.
anna-parker Jun 6, 2024
bcefc7c
Update ingest config file
anna-parker Jun 6, 2024
cd19b1c
Merge branch 'main' into ccfv
anna-parker Jun 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 39 additions & 1 deletion docs/src/content/docs/guides/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,12 +134,50 @@ organisms:

In this example, the configuration for the "ebolavirus-sudan" organism is defined. It includes schema settings, website display options, silo configuration, preprocessing details, and reference genome information.

Note the metadata section includes various fields for how the metadata of specific sequences should be displayed. Each metadata item must have a `name` which will also be displayed on the page unless `displayName` is also set. The `type` of the data, as well as if the field is `required` and if `autoComplete` is enabled can also be added. Additionally, links from metadata entries to external websites can be added using the `customDisplay` option. We also allow metadata to be grouped in sections, specified by the `header` field.
Note the metadata section includes various fields for how the metadata of specific sequences should be displayed. Each metadata item must have a `name` which will also be displayed on the page unless `displayName` is also set. The `type` of the data, as well as if the field is `required` and if `autoComplete` is enabled can also be added. Additionally, links from metadata entries to external websites can be added using the `customDisplay` option. We also allow metadata to be grouped in sections, specified by the `header` field. The `noInput` parameter specifies that a parameter is generated internally by loculus (can be specified in the preprocessing pipeline) and should not be expected as input.

Your preprocessing pipeline can be customized for each organism. Currently, we use `nextclade run` in our preprocessing pipeline and we suggest it as a fast option to do basic checks on your input sequences. Given a `nextclade dataset` (in its simplest form a reference sequence and a gene_annotation file) nextclade tries to align new sequences to the reference and will discard sequences that cannot be aligned. It will also compute mutations, insertions and deletions for the nucleotide sequence as well as for the corresponding genes. If you would like to use our preprocessing set-up you can add a nextclade dataset to your `values.yaml` as follows:

```yaml
preprocessing:
- version: 1
image: ghcr.io/loculus-project/preprocessing-nextclade
args:
- "prepro"
configFile:
log_level: DEBUG
nextclade_dataset_name: nextstrain/ebola/zaire
genes: [NP, VP35, VP40, GP, sGP, ssGP, VP30, VP24, L]
batch_size: 100
```

Additionally, the `tableColumns` section defines which metadata fields are shown as columns in the search results.

You can add multiple organisms under the organisms section, each with its own unique configuration.

### Multi-segmented pathogens

In Loculus, sequence data from multi-segmented viruses is stored in accessioned sequence entries which group together the segments from a particular sample or isolate. Multi-segmented organisms should be annotated with a list with the names of the segments supplied as `nucleotideSequences`. For CCHFV this looks like:

```yaml
organisms:
cchf:
schema:
organismName: "Crimean-Congo Hemorrhagic Fever Virus"
nucleotideSequences: [L, M, S]
metadata:
- name: length
type: int
header: "Length"
per_segment: true
```

Additionally, if you are using the preprocessing or ingest pipelines, `nucleotideSequences` must also be defined in those sections of the config.

Metadata fields can be isolate- or segment-specific. By default we assume metadata fields are isolate-specific (i.e. are shared across all segments), therefore segment-specific fields must be marked as `per_segment` in the config file. Marking a field as `per_segment` will result in that field existing for each segment. In the example above, instead of there being one metadata field called `length` there will now be three fields called `length_L`, `length_M` and `length_S`.

Loculus expects multi-segmented pathogen sequences to be submitted in a specific format. Fasta files should have a separate entry/record for each segment, with a Fasta header of `>[submissionID]_[segmentName]`, e.g. `>sample123_L` for the `L` segment of the sample with the submissionID `sample123`. Metadata is uploaded for an entire sequence entry, rather than per segment, i.e. there will be only one row for each `submissionID`.

## Secrets

Our secrets configuration supports three types of secrets.
Expand Down
26 changes: 20 additions & 6 deletions ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@ Using NCBI `datasets` CLI, download all sequences and corresponding NCBI curated

Sequences and metadata are transformed into (nd)json files to simplify (de)serialization and further processing.

### Segmented viruses

NCBI handles segmented viruses differently than Loculus. In NCBI, the primary level of accession is per segment of a genomic sequence, with each segment having its own metadata. In Loculus a sample is uploaded with all its segments grouped under a collective accession ID, and metadata applies at the sample (or group) level. FASTA files when downloaded have each segment headed under `>[accessionID]_[segmentName]`. (When uploaded to Loculus they need be headed as `>[submissionID]_[segmentName]`)

The segment a sample corresponds to can only be determined from the descriptions of a sequence fasta record. In `get_segment_details.py` we discard all sequences with unclear segment annotations and add `segment` as a metadata field. (TODO #2079: Use nextclade instead of a regex search to determine which segment the sequence aligns with best to keep as much data as possible).

### Transforming values to conform with Loculus' expectations

Metadata as received from `datasets` is transformed to conform to Loculus' expectations. This includes for example:
Expand All @@ -26,15 +32,25 @@ Metadata as received from `datasets` is transformed to conform to Loculus' expec
- transforming values, e.g. turn author strings from `LastName1, Initial1, LastName2, Initial2` into `Initial1 LastName1, Initial2 LastName2`
- splitting fields, e.g. NCBI's single, complex collection country field (`Germany: Munich, Bavaria`) is split into multiple fields `country`, `state`, `division` (`Germany`, `Bavaria`, `Munich`)

Note that the `submissionId` is just the `genbank_accession` for non-segmented viruses, but the concatenation of the `genbank_accession` of each segment (with the appended segment name for each segment) for segmented viruses.

### Calculating a hash for each sequence entry

Every sequence entry is to be uploaded only once and must be ignored by future periodic ingest runs unless the metadata and/or sequence has changed.

To achieve this, an md5 hash is generated for each sequence entry based on the post-transform metadata and sequence content. The hash is based on all metadata fields submitted to Loculus as well as the sequence. Hence, changes to the ingest pipeline's transform step (above) can lead to changes in hash and resubmission - even without underlying data change on INSDC. Likewise, some changes to the INSDC data might not cause a sequence update on Loculus if what has been changed does not affect the post-transformed metadata.

For segmented viruses we calculate the md5 hash of each segment and then, after grouping segments we concatenate the hashes of each segment before again hashing the hashes.

### Grouping segmented viruses

In NCBI sequences are uploaded for each segment separately. To upload all segments from the same isolate we need to group the sequences. We do this by grouping NCBI segments based on `ncbi_isolate_name` and other isolate-specific attributes. Segments will only be uploaded together if all these parameters match. We also add additional checks to prevent multiple sequences of the same segment being grouped together. If a check fails, or the segments do not have isolate information, the segments will be ingested and uploaded to Loculus individually.

We group segments by adding a `joint_accession` field to the metadata which consists of a concatenated list of all `genbank_accession` IDs of the segments in the group. Each fasta record is also modified to use `joint_accession` with the concatenated segment as their ID (as required by loculus).

### Getting status and hashes of previously submitted sequences and triaging

Before uploading new sequences, the pipeline queries the Loculus backend for the status and hash of all previously submitted sequences. This is done to avoid uploading sequences that have already been submitted and have not changed. Furthermore, only accessions whose higest version is in status `APPROVED_FOR_RELEASE` can be updated through revision. Entries in other states cannot currently be updated (TODO: Potentially use `/submit-edited-data` endpoint to allow updating entries in more states).
Before uploading new sequences, the pipeline queries the Loculus backend for the status and hash of all previously submitted sequences. This is done to avoid uploading sequences that have already been submitted and have not changed. Furthermore, only accessions whose highest version is in status `APPROVED_FOR_RELEASE` can be updated through revision. Entries in other states cannot currently be updated (TODO: Potentially use `/submit-edited-data` endpoint to allow updating entries in more states).

Hashes and statuses are used to triage sequences into 4 categories which determine the action to be taken:

Expand Down Expand Up @@ -79,7 +95,7 @@ In production the ingest pipeline runs in a docker container that takes a config
We use the Snakemake workflow management system which also uses different config files:

- `profiles/default/config.yaml` sets default command line options while running Snakemake.
- `config/config.yaml` turns into a config dict in the `Snakefile` (attributes defined in this file become global variables in the `Snakefile`).
- `config/config.yaml` and `config/defaults.yaml` turn into a config dict in the `Snakefile` (attributes defined in this file become global variables in the `Snakefile`). The `config/config.yaml` file used in production is generated by the `kubernetes/loculus/templates/loculus-preprocessing-config.yaml`.

TLDR: The `Snakefile` contains workflows defined as rules with required input and expected output files. By default Snakemake takes the first rule as the target one and then constructs a graph of dependencies (a DAG) required to produce the expected output of the first rule. The target rule can be specified using `snakemake {rule}`

Expand Down Expand Up @@ -109,6 +125,8 @@ Then run snakemake using `snakemake` or `snakemake {rule}`.

Note that by default the pipeline will submit sequences to main. If you want to change this to another branch (that has a preview tag) you can modify the `backend_url` and `keycloak_token_url` arguments in the `config.yaml` file. They are of the form `https://backend-{branch_name}.loculus.org/` and `https://authentication-{branch_name}.loculus.org`. Alternatively, if you are running the backend locally you can also specify the local backend port: `http://localhost:8079` and the local keyclock port: `http://localhost:8083`.

The ingest pipeline requires config files, found in the directory `config`. The `default.yaml` contains default values which will be overridden by the `config.yaml`. To produce the `config.yaml` used in production you can run `../generate_local_test_config.sh` and then copy the configs from the pathogen to the `config.yaml`.

## Testing

Currently, there is not automated testing other than running the pipeline manually and in preview deployments.
Expand All @@ -127,10 +145,6 @@ To be able to run tests independently, we should use UUIDs for mock data. Curren

One complication for testing is that we don't have ARM containers for the backend yet (see <https://github.com/loculus-project/loculus/issues/1765>).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do now 😀


### Multi-segment support

Currently, the pipeline only supports single-segment sequences. We need to add support for multi-segment viruses, like CCHF and Influenza.

### Recover from processing errors

At some point we should be able to recover from processing errors by using `/submit-edited-data` to update entries in more states than just `APPROVED_FOR_RELEASE`.
112 changes: 103 additions & 9 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,20 @@ for key, value in defaults.items():
if not key in config:
config[key] = value

# Check if organism is segmented
if "nucleotideSequences" not in config:
config["nucleotideSequences"] = ["main"]
config["segmented"] = not (
anna-parker marked this conversation as resolved.
Show resolved Hide resolved
len(config["nucleotideSequences"]) == 1
and config["nucleotideSequences"][0] == "main"
)

Path("results").mkdir(parents=True, exist_ok=True)
with open("results/config.yaml", "w") as f:
f.write(yaml.dump(config))

TAXON_ID = config["taxon_id"]
SEGMENTED = config["segmented"]
ALL_FIELDS = ",".join(config["all_fields"])
COLUMN_MAPPING = config["column_mapping"]
LOG_LEVEL = config.get("log_level", "INFO")
Expand Down Expand Up @@ -50,17 +59,54 @@ rule fetch_ncbi_dataset_package:
"""


def get_extract_output(wildcard):
if wildcard:
return ("results/sequences_full.fasta",)
else:
return ("results/sequences.fasta",)


rule extract_ncbi_dataset_sequences:
input:
dataset_package="results/ncbi_dataset.zip",
output:
ncbi_dataset_sequences="results/sequences.fasta",
ncbi_dataset_sequences=get_extract_output(SEGMENTED),
params:
segmented=SEGMENTED,
shell:
"""
unzip -jp {input.dataset_package} \
if [[ {params.segmented} ]]; then
unzip -jp {input.dataset_package} \
ncbi_dataset/data/genomic.fna \
| seqkit seq -w0 \
> {output.ncbi_dataset_sequences}
else
unzip -jp {input.dataset_package} \
ncbi_dataset/data/genomic.fna \
| seqkit seq -i -w0 \
> {output.ncbi_dataset_sequences}
| seqkit seq -i -w0 \
> {output.ncbi_dataset_sequences}
fi
"""


rule get_segment_details:
"""Check if viruses are segmented, if so add segment to metadata"""
input:
sequences="results/sequences_full.fasta",
script="scripts/get_segment_details.py",
ncbi_dataset_tsv="results/metadata_post_rename.tsv",
config="results/config.yaml",
output:
sequences_processed="results/sequences.fasta",
ncbi_dataset_tsv="results/metadata_post_segment.tsv",
shell:
"""
python {input.script} \
--config-file {input.config} \
--input-seq {input.sequences} \
--input-metadata {input.ncbi_dataset_tsv} \
--output-seq {output.sequences_processed} \
--output-metadata {output.ncbi_dataset_tsv}
"""


Expand Down Expand Up @@ -106,9 +152,16 @@ rule rename_columns:
rename_columns(input.ncbi_dataset_tsv, output.ncbi_dataset_tsv)


def get_prepare_metadata(wildcard):
if wildcard:
return ("results/metadata_post_segment.tsv",)
else:
return ("results/metadata_post_rename.tsv",)


rule prepare_metadata:
input:
metadata="results/metadata_post_rename.tsv",
metadata=get_prepare_metadata(SEGMENTED),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a function here, we can just use an inline foo if boolean else baz expression. Calling the variable wildcard looks like a ChatGPT hallucination 😀

sequence_hashes="results/sequence_hashes.json",
config="results/config.yaml",
script="scripts/prepare_metadata.py",
Expand All @@ -127,6 +180,36 @@ rule prepare_metadata:
"""


rule group_segments:
input:
metadata="results/metadata_post_prepare.json",
sequences="results/sequences.json",
config="results/config.yaml",
script="scripts/group_segments.py",
output:
metadata="results/metadata_post_group.json",
sequences="results/sequences_post_group.json",
params:
log_level=LOG_LEVEL,
shell:
"""
python scripts/group_segments.py \
--config-file {input.config} \
--input-metadata {input.metadata} \
--input-seq {input.sequences} \
--output-metadata {output.metadata} \
--output-seq {output.sequences} \
--log-level {params.log_level} \
"""


def get_grouped_metadata(wildcard):
if wildcard:
return ("results/metadata_post_group.json",)
else:
return ("results/metadata_post_prepare.json",)


rule get_previous_submissions:
"""Download metadata and sequence hashes of all previously submitted sequences
Produces mapping from INSDC accession to loculus id/version/hash
Expand All @@ -142,7 +225,7 @@ rule get_previous_submissions:
...
"""
input:
prepped_metadata="results/metadata_post_prepare.json", # Reduce likelihood of race condition of multi-submission
prepped_metadata=get_grouped_metadata(SEGMENTED), # Reduce likelihood of race condition of multi-submission
config="results/config.yaml",
script="scripts/call_loculus.py",
output:
Expand All @@ -163,8 +246,9 @@ rule get_previous_submissions:

rule compare_hashes:
input:
config="results/config.yaml",
old_hashes="results/previous_submissions.json",
metadata="results/metadata_post_prepare.json",
metadata=get_grouped_metadata(SEGMENTED),
script="scripts/compare_hashes.py",
output:
to_submit="results/to_submit.json",
Expand All @@ -177,6 +261,7 @@ rule compare_hashes:
shell:
"""
python scripts/compare_hashes.py \
--config-file {input.config} \
--old-hashes {input.old_hashes} \
--metadata {input.metadata} \
--to-submit {output.to_submit} \
Expand All @@ -188,10 +273,18 @@ rule compare_hashes:
"""


def get_grouped_sequences(wildcard):
if wildcard:
return ("results/sequences_post_group.json",)
else:
return ("results/sequences.json",)


rule prepare_files:
input:
metadata="results/metadata_post_prepare.json",
sequences="results/sequences.json",
config="results/config.yaml",
metadata=get_grouped_metadata(SEGMENTED),
sequences=get_grouped_sequences(SEGMENTED),
to_submit="results/to_submit.json",
to_revise="results/to_revise.json",
script="scripts/prepare_files.py",
Expand All @@ -203,6 +296,7 @@ rule prepare_files:
shell:
"""
python scripts/prepare_files.py \
--config-file {input.config} \
--metadata-path {input.metadata} \
--sequences-path {input.sequences} \
--to-submit-path {input.to_submit} \
Expand Down
9 changes: 9 additions & 0 deletions ingest/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,12 @@ taxon_id: 186538
backend_url: https://backend-main.loculus.org/
keycloak_token_url: https://authentication-main.loculus.org/realms/loculus/protocol/openid-connect/token
organism: ebola-zaire

# taxon_id: 3052518
anna-parker marked this conversation as resolved.
Show resolved Hide resolved
# backend_url: http://localhost:8079/
# keycloak_token_url: http://localhost:8083/realms/loculus/protocol/openid-connect/token
# organism: cchf
# nucleotideSequences:
# - M
# - L
# - S
26 changes: 20 additions & 6 deletions ingest/config/defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,17 @@ log_level: DEBUG
compound_country_field: ncbi_geo_location
fasta_id_field: genbank_accession
rename:
bioprojects: bioproject_accessions
country: geo_loc_country
division: geo_loc_admin_1
genbank_accession: insdc_accession_full
ncbi_collection_date: collection_date
ncbi_isolate_name: isolate_name
ncbi_isolate_source: isolate_source
ncbi_sra_accessions: sra_accessions
ncbi_submitter_affiliation: author_affiliation
ncbi_submitter_country: submitter_country
ncbi_collection_date: sample_collection_date
ncbi_host_name: host_name_scientific
ncbi_host_tax_id: host_taxon_id
ncbi_is_lab_host: is_lab_host
ncbi_isolate_name: specimen_collector_sample_id
ncbi_sra_accessions: sra_run_accession
ncbi_submitter_affiliation: author_affiliations
ncbi_submitter_names: authors
keep:
- division
Expand All @@ -33,6 +37,16 @@ keep:
- ncbi_virus_tax_id
- sequence_md5
- genbank_accession
- joint_accession
segment_specific:
- biosample_accession
- bioproject_accessions
- ncbi_completeness
- sra_run_accession
- ncbi_protein_count
- insdc_accession_base
- insdc_version
- insdc_accession_full
all_fields:
- accession
- bioprojects
Expand Down
Loading
Loading