From 3ee2bdb50a6738f7868bbdd1c9f3a3660281c49b Mon Sep 17 00:00:00 2001 From: Dominic Bennett Date: Wed, 27 Jun 2018 16:37:10 +0200 Subject: [PATCH] paper update --- paper/paper.bib | 2 +- paper/paper.html | 15 ++++++++------- paper/paper.md | 5 +++-- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/paper/paper.bib b/paper/paper.bib index 7baebe3..4555034 100644 --- a/paper/paper.bib +++ b/paper/paper.bib @@ -59,7 +59,7 @@ @online{restez_gh } @misc{restez_z, - author = {{Zenodo}}, + author = {{TBD}}, title = {restez: Create and Query a Local Copy of GenBank in R} } diff --git a/paper/paper.html b/paper/paper.html index ad69848..eb3d95d 100644 --- a/paper/paper.html +++ b/paper/paper.html @@ -10,7 +10,7 @@ - + restez: Create and Query a Local Copy of GenBank in R @@ -119,7 +119,7 @@

restez: Create and Query a Local Copy of GenBank in R

-

19 June 2018

+

27 June 2018

@@ -127,8 +127,9 @@

19 June 2018

Summary

Downloading sequences and sequence information from GenBank (Benson et al. 2013) and related NCBI databases is often performed via the NCBI API, Entrez (Ostell 2002). Entrez, however, has a limit on the number of requests, thus downloading large amounts of sequence data in this way can be inefficient. For situations where a large number of Entrez calls are made, downloading may take days, weeks or even months and could even result in a user’s IP address being blacklisted from the NCBI services due to server overload. Additionally, Entrez limits the number of entries that can be retrieved at once, requiring a user to develop code for querying in batches.

-

The restez package (Zenodo, n.d.) aims to make sequence retrieval more efficient by allowing a user to download the GenBank database, either in its entirety or in subsets, to their local machine and query this local database instead. This process is more time efficient as GenBank downloads are made via NCBI’s FTP server using compressed sequence files. With a good internet connection and a middle-of-the-road computer, a database comprising 7 GB of sequence information (i.e. the total sequence data available for Rodentia as of 27 June 2018) can be generated in less than 10 minutes.

-

restez outline Figure 1. The functions and file structure for downloading, setting up and querying a local copy of GenBank.`

+

The restez package (TBD, n.d.) aims to make sequence retrieval more efficient by allowing a user to download the GenBank database, either in its entirety or in subsets, to their local machine and query this local database instead. This process is more time efficient as GenBank downloads are made via NCBI’s FTP server using compressed sequence files. With a good internet connection and a middle-of-the-road computer, a database comprising 7 GB of sequence information (i.e. the total sequence data available for Rodentia as of 27 June 2018) can be generated in less than 10 minutes.

+

+

Figure 1. The functions and file structure for downloading, setting up and querying a local copy of GenBank.

Rentrez integration

rentrez (Winter 2017) is a popular R package for querying NCBI’s databases via Entrez in R. To maximize the compatibility of restez, we implemented wrapper functions with the same names and arguments as the rentrez equivalents. Whenever a wrapper function is called the local database copy is searched first. If IDs are missing in the local database a secondary call to Entrez is made via the internet. This allows for easy employment of restez in scripts and packages that are already using rentrez. At a minimum, a user currently using rentrez will only need to create a local, subset of the GenBank database and call restez instead of rentrez.

@@ -190,12 +191,12 @@

References

Ostell, J. 2002. “The Entrez search and retrieval system.” In The Ncbi Handbook, 1–6. http://www.ncbi.nlm.nih.gov/books/NBK21081/.

+
+

TBD. n.d. “Restez: Create and Query a Local Copy of Genbank in R.”

+

Winter, David J. 2017. “rentrez: An R package for the NCBI eUtils API.” The R Journal 9 (2): 520–26. doi:10.7287/peerj.preprints.3179v2.

-
-

Zenodo. n.d. “Restez: Create and Query a Local Copy of Genbank in R.”

-
diff --git a/paper/paper.md b/paper/paper.md index a732642..e51a339 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -43,8 +43,9 @@ Downloading sequences and sequence information from GenBank [@Benson2013] and re The `restez` package [@restez_z] aims to make sequence retrieval more efficient by allowing a user to download the GenBank database, either in its entirety or in subsets, to their local machine and query this local database instead. This process is more time efficient as GenBank downloads are made via NCBI’s FTP server using compressed sequence files. With a good internet connection and a middle-of-the-road computer, a database comprising 7 GB of sequence information (i.e. the total sequence data available for Rodentia as of 27 June 2018) can be generated in less than 10 minutes. -![restez outline](outline.png) -**Figure 1. The functions and file structure for downloading, setting up and querying a local copy of GenBank.`** + + +**Figure 1. The functions and file structure for downloading, setting up and querying a local copy of GenBank.** ##Rentrez integration