-
Notifications
You must be signed in to change notification settings - Fork 13
Strain mappings
(This is an extended version of the description of how to best prepare a set of strain mappings for use with NPLinker)
NPLinker requires a set of user-supplied strain mappings in order to correctly match up strain labels from genomics data with those from metabolomics data. The labels need to be placed in a simple CSV file called strain_mappings.csv
placed in the dataset folder. This page describes how to create and edit this file, and also how to label strains in such a way that NPLinker can potentially parse additional information (e.g. growth media).
Start by creating the strain_mappings.csv
file and open it in a text editor. You must add a single line in the file for each strain in the dataset. If you don't add a line for a particular strain, NPLinker will still be able to load the dataset but it will NOT include that strain, so the resulting sets of BGC/GCF/Spectrum/MolFam objects will be incomplete.
The first column of each line should contain the most relevant/useful/easy-to-read strain label. Each subsequent column should contain any other labels used to refer to the same strain throughout the dataset.
Note the number of columns on each line does not need to be consistent, and that NPLinker will automatically remove .mzML
and .mzXML
from strain labels it parses so you don't need to include them in the mappings file.
Here is a trivial example:
strain1,strain1A,strain1.B,strain1_C,strainONE
strain2,strainTWO,strainTWO_
strain3
Taking this line by line, it is saying:
- strain1 is also known as strain1A, strain1.B, strain1_C, and strainONE. NPLinker will therefore treat any instances of the latter 4 labels as equivalent to strain1
- strain2 is also known as strainTWO and strainTWO_, so again NPLinker will treat these as equivalent
- strain3 is only known as strain3 in the dataset, it has no other labels
To help populate this file, NPLinker will emit warning messages if it encounters any unknown strains while loading your data, and it will also generate a pair of CSV files called "unknown_strains_met.csv" and "unknown_strains_gen.csv" each time it is executed. These files will be located inside the dataset folder. Each file contains a list of the unknown strain labels found in the metabolomics and genomics data. When you have added all strains to the strain_mappings.csv
file, these 2 files should be empty except for their header lines.
On the metabolomics side, NPLinker matches strains using a relatively simple process. Given a raw strain label like Abc_12-34_DEF.mzML
, NPLinker will simply remove the .mzML
(or .mzXML
) part and treat Abc_12-34_DEF
as the name of the strain. This in turn means it will expect to find an entry for Abc_12-34_DEF
in strain_mappings.csv
. If you wanted to refer to this strain simply as "Abc", you could add the line Abc,Abc_12-34_DEF
, making it the primary label.
However for datasets where the metabolomics data has been generated using the GNPS Feature-based Molecular Networking (FBMN) workflow, NPLinker can optionally take advantage of the included "metadata table" file to automatically extract extra per-strain information. This relies on the existence of columns named ATTRIBUTE_Medium
and ATTRIBUTE_Strain
in the metadata table file (normally located under <dataset folder>/metadata_table/metadata_table-00000.txt
).
Where these columns exist and contain valid data, NPLinker can be configured to use an alternative method of parsing strain labels. There are two differences to note compared to the normal method:
- The content of the ATTRIBUTE_Strain column will automatically be treated as an alternative to the label in the
filename
column. This is easiest to explain with an example. Given a filename/label likeAbc_12-34_DEF.mzML
, NPLinker would normally expect to find an entry in the strain mappings file forAbc_12-34_DEF
. However in this mode it will also treat the value of the strain column in the corresponding row of the metadata table as an alternative label for the same strain. If the column containedAbc
, you could add that label to your strain mappings and it would match all rows in the table with the same value, e.g. if the same strain occurred with multiple different growth media. This may help to reduce the number of mappings that need to be manually created. - The second important change is that the ATTRIBUTE_Medium column can also be parsed (if present) in order to record the growth media for each strain. This should happen automatically as long as the column is present, and the information will be displayed in the web application when viewing strains.
To enable this mode, you need to add a line to the "[dataset]" section of your nplinker.toml
file. If the "[dataset]" section already exists, add the new line underneath it, otherwise simply paste in the following:
[dataset]
extended_metadata_table_parsing = true
This option defaults to being disabled.