Skip to content

LoadingLocalData

Cunliang Geng edited this page Jul 22, 2022 · 12 revisions

Loading a new dataset into NPLinker

This guide describes the steps that need to be taken by first-time users of NPLinker in order to load a new dataset.

Prerequisites:

  • Docker
  • Metabolomics data. Typically the contents of the GNPS "Clustered Spectra as MGF" .zip file for the job submitted.
  • Genomics data. At minimum a folder of antiSMASH .gbk files. BiG-SCAPE data can optionally be provided, but will be generated if not found.
  • A strain mappings CSV file (see below for details).

1. Create a shared folder for NPLinker files

When using the Docker version of NPLinker, the application has no direct access to the files stored on your system. Instead, you give Docker access to a chosen "shared folder" where one or more datasets are located. All files you wish to load into NPLinker must be inside this folder somewhere, but you can have any number of other folders inside the top level one to organise different datasets.

It doesn't matter where the shared folder is located, or what it is called. Simply pick a location and create a new empty folder. This guide assumes the folder is called nplinker_shared.

2. Create a dedicated folder for the dataset

Inside the nplinker_shared folder, create a new subfolder. Once again the name doesn't matter. This guide assumes the folder is called dataset_1 but feel free to substitute the name of your dataset.

3. Create a basic NPLinker configuration file

NPLinker has various options that can be configured. The easiest way to do this is by using a configuration file. The file contains text formatted using simple TOML syntax and so has a .toml extension. A complete example of an NPLinker configuration file can be found here, but a typical example will be only a few lines long.

To configure NPLinker to load a dataset from the dataset_1 folder, create a new file in nplinker_shared called nplinker.toml. The Docker version of NPLinker is already configured to look for a file in this location with this name.

Open the file in a text editor, and add a "root" value as shown below:

[dataset]
root = "/data/dataset_1"

The "root" value tells NPLinker the location of the folder containing all the files to be loaded as part of a dataset.

The path might look strange because it doesn't exist on your system. However inside the Docker container the "/data" folder is mapped to the nplinker_shared folder, so any files/folders that exist inside nplinker_shared will be visible to the NPLinker application in the container through "/data".

4. Populating the dataset folder

4.1 Metabolomics data

NPLinker is designed to work with the folder structure generated by the zip files available to download from completed GNPS jobs. The content may vary slightly depending on the GNPS workflow used but NPLinker is known to work with the following workflow outputs:

  • METABOLOMICS-SNETS (version 1.2.3); use the "Download clustered spectra as MGF" link
  • METABOLOMICS-SNETS-V2 (version 1.2.3); use the "Download clustered spectra as MGF" link
  • FEATURE-BASED-MOLECULAR-NETWORKING (version release_14, release 28.2); use the "Download Cytoscape Data" link

Extract the downloaded zip file inside the dataset_1 folder.

Some of the files/folders generated by GNPS are not used by NPLinker, but can safely be left in place. The files and folders which NPLinker expects to find are listed below. "*.tsv" and similar entries indicate that NPLinker will load any file with a ".tsv" extension, the exact filename is not important.

  • clusterinfosummarygroup_attributes_withIDs_withcomponentID/*.clustersummary OR clusterinfo_summary/*.tsv
  • networkedges_selfloop/*.selfloop
  • *.mgf OR spectra/*.mgf
  • (optional, not in all workflows) metadata_table/metadata_table*.txt
  • (optional, not in all workflows) quantification_table_reformatted/.csv
  • (optional) DB_result/*.tsv OR result_specnets_DB/*.tsv
  • (optional) params.xml
  • (optional): for using NPClassScore, you may want to provide CANOPUS output if you have it calculated already (based on the molecular networking MGF file), and/or MolNetEnhancer output downloaded from GNPS. CANOPUS output can also be generated by including run_canopus = true under the [docker] section in the toml file. CANOPUS and MolNetEnhancer output should be placed in folders called 'canopus' and 'molnetenhancer', respectively.

4.2 Genomics data

On the genomics side, NPLinker requires at minimum a folder of antiSMASH .gbk files. These may be in a single flat folder or in subfolders. Simply create an "antismash" folder inside dataset_1 and copy/move these files into that folder.

BiG-SCAPE files are not required, but if you already have them available create a new "bigscape" folder inside dataset_1 and copy/move them there.

If BiG-SCAPE files are not available, the tool will be run during the NPLinker loading process and store the results in the same location (this will only happen once per-dataset).

At this stage, you should have a dataset_1 folder which looks something like this:

5. Creating strain mappings

The next key step is to create a set of strain mappings so that NPLinker can correctly identify strains across the genomics and metabolomics data. To supply these mappings to the application, begin by creating a file called "strain_mappings.csv" in the dataset_1 folder. For more details see here.

Open the file in a text editor. You should add a single line for each strain in the dataset. The first column of each line should contain the most relevant/useful strain label. Each subsequent column should contain the other labels used to refer to the same strain throughout the dataset (NOTE: the number of columns on each line does NOT need to be consistent).

Here is a trivial example:

strain1,strain1A,strain1.B,strain1_C,strainONE
strain2,strainTWO,strainTWO_
strain3

Taking this line by line, it is saying:

  • strain1 is also known as strain1A, strain1.B, strain1_C, and strainONE. NPLinker will therefore treat any instances of the latter 4 labels as equivalent to strain1
  • strain2 is also known as strainTWO and strainTWO_, so again NPLinker will treat these as equivalent
  • strain3 is only known as strain3 in the dataset, it has no other labels

When NPLinker loads your data, it will warn if it encounters any strains that don't appear in the set of mappings you supply. To make it easier to determine if there are missing mappings, NPLinker generates a pair of CSV files called "unknown_strains_met.csv" and "unknown_strains_gen.csv" each time it is executed. These files will be located inside the dataset_1 folder. Each contains a list of the unknown strain labels found in the metabolomics and genomics data.

NOTE: it may also be necessary to adjust how NPLinker parses strain labels from BiG-SCAPE output. See this FAQ entry for more information.

6. Starting the NPLinker Docker image

At this point, you should be ready to run NPLinker. Assuming Docker is already installed, first make sure you have the latest version of the NPLinker image by running the command:

docker pull nlesc/nplinker:latest

To run NPLinker itself, use the command:

docker run --name webapp -p 5006:5006 -v your_shared_folder:/data:rw nlesc/nplinker

replacing "your_shared_folder" with the full path to your nplinker_shared folder. It should contain the dataset_1 folder and the nplinker.toml configuration file, which will be loaded automatically by the application if found in this location. For example, if your shared folder is "C:/Users/myusername/nplinker_shared", the Docker run command becomes:

docker run --name webapp -p 5006:5006 -v c:/Users/myusername/nplinker_shared:/data:rw nlesc/nplinker

(A brief explanation of the parameters used: --name tells Docker to assign the name webapp to this container, the -p ... tells it to open a port so you can connect to the web app running inside, and -v ... tells Docker where your shared folder is located)

7. Accessing the NPLinker web application

Depending on your dataset and hardware, the loading process may take anywhere from a few seconds to several minutes. When NPLinker loads a dataset for the first time it generates various data files that are cached for subsequent use, so loading the same dataset again will typically be significantly faster.

As NPLinker loads the dataset, it will display logging messages in the console/terminal window where the Docker container was started. If the loading process succeeds, the final few lines of output should be:

==========================
NPLinker server loading completed!
==========================

You should now be able to connect to the NPLinker web application by browsing to http://localhost:5006/nplinker.

For more information on the web application itself, see WebappUsage.

8. Stopping the web application

To force the Docker container to stop, run this command:

docker container rm -f webapp