IJSEM-webscraper

Jupyter notebook webscraper (selenium and beautifulsoup) designed to take the content of the weekly html email sent by IJSEM. Open the link to each publication and extract the organism name, accession numbers and type strain information. This is then compared to what NCBI taxonomy has with srcchk. NOTE use of srcchk requires linux to access this internal NCBI tool. The outputs are compared in pandas looking for differing organism names. Final output is an excel file to allow taxonomy updates. Organism names, strain names and accessions are extracted from the species description for each species described in the paper. If strains are not found, these will be blank in the dataframe. These situations require manual inspection. Ongoing improvement hopes to make the strain matching more accurate. May eventually convert this to standard .py script executable by linux command line.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
IJSEMemail1.htm		IJSEMemail1.htm
IJSEMemailscraper.ipynb		IJSEMemailscraper.ipynb
README.md		README.md
Webscraper2.ipynb		Webscraper2.ipynb
selenium_test.py		selenium_test.py
selenium_webscraper-ver2.ipynb		selenium_webscraper-ver2.ipynb
selenium_webscraper-ver3.ipynb		selenium_webscraper-ver3.ipynb
selenium_webscraper-ver4.ipynb		selenium_webscraper-ver4.ipynb
selenium_webscraper.ipynb		selenium_webscraper.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IJSEM-webscraper

About

Releases

Packages

Languages

mcveigh-h16/IJSEM-webscraper

Folders and files

Latest commit

History

Repository files navigation

IJSEM-webscraper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages