Skip to content

Commit

Permalink
feat(ingest): Map more ncbi fields to loculus fields, remove unnecess…
Browse files Browse the repository at this point in the history
…ary all_fields definition. (#2441)
  • Loading branch information
anna-parker authored Aug 17, 2024
1 parent 1cc44e1 commit 1ad50d4
Show file tree
Hide file tree
Showing 3 changed files with 4 additions and 53 deletions.
4 changes: 0 additions & 4 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ with open("results/config.yaml", "w") as f:

TAXON_ID = config["taxon_id"]
SEGMENTED = config["segmented"]
ALL_FIELDS = ",".join(config["all_fields"])
COLUMN_MAPPING = config["column_mapping"]
LOG_LEVEL = config.get("log_level", "INFO")
NCBI_API_KEY = os.getenv("NCBI_API_KEY")
Expand Down Expand Up @@ -74,13 +73,10 @@ rule format_ncbi_dataset_report:
dataset_package="results/ncbi_dataset.zip",
output:
ncbi_dataset_tsv="results/metadata_post_extract.tsv",
params:
fields_to_include=ALL_FIELDS,
shell:
"""
dataformat tsv virus-genome \
--package {input.dataset_package} \
--fields {params.fields_to_include:q} \
> {output.ncbi_dataset_tsv}
"""

Expand Down
49 changes: 0 additions & 49 deletions ingest/config/defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,55 +24,6 @@ keep:
- sequence_md5
- genbank_accession
- joint_accession
all_fields:
- accession
- bioprojects
- biosample-acc
- completeness
- gene-count
- geo-location
- geo-region
- host-common-name
- host-infraspecific-breed
- host-infraspecific-cultivar
- host-infraspecific-ecotype
- host-infraspecific-isolate
- host-infraspecific-sex
- host-infraspecific-strain
- host-name
- host-pangolin
- host-tax-id
- is-annotated
- is-complete
- is-lab-host
- is-vaccine-strain
- isolate-collection-date
- isolate-lineage
- isolate-lineage-source
- lab-host
- length
- matpeptide-count
- mol-type
- nucleotide-completeness
- protein-count
- purpose-of-sampling
- release-date
- sourcedb
- sra-accs
- submitter-affiliation
- submitter-country
- submitter-names
- update-date
- virus-common-name
- virus-infraspecific-breed
- virus-infraspecific-cultivar
- virus-infraspecific-ecotype
- virus-infraspecific-isolate
- virus-infraspecific-sex
- virus-infraspecific-strain
- virus-name
- virus-pangolin
- virus-tax-id
column_mapping:
Accession: genbank_accession
BioProjects: bioprojects
Expand Down
4 changes: 4 additions & 0 deletions kubernetes/loculus/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -397,6 +397,7 @@ defaultOrganismConfig: &defaultOrganismConfig
generateIndex: true
autocomplete: true
header: Sample details
ingest: ncbi_geo_region
- name: geo_loc_city
displayName: Collection city
generateIndex: true
Expand Down Expand Up @@ -494,6 +495,7 @@ defaultOrganismConfig: &defaultOrganismConfig
guidance: Select a value from the pick list in the template
example: Diagnostic testing
header: Sampling
ingest: ncbi_purpose_of_sampling
- name: presampling_activity
ontology_id: GENEPIO:0100433
definition: The activities or variables introduced upstream of sample collection that may affect the sample collected.
Expand Down Expand Up @@ -608,6 +610,7 @@ defaultOrganismConfig: &defaultOrganismConfig
example: Male [NCIT:C46109]
displayName: Host Gender
header: Host
ingest: ncbi_host_sex
- name: host_origin_country
ontology_id: GENEPIO:0100438
definition: The country of origin of the host.
Expand Down Expand Up @@ -889,6 +892,7 @@ defaultOrganismConfig: &defaultOrganismConfig
generateIndex: true
autocomplete: true
header: "Host"
ingest: ncbi_host_common_name
- name: host_taxon_id
type: int
autocomplete: true
Expand Down

0 comments on commit 1ad50d4

Please sign in to comment.