Skip to content

TextCorpusLabs/NJGovNews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

New Jersey Government News

Python MIT license

Scrape news feeds from the New Jersey government

Punch List

Known bugs befor v1.0

  • GitHub Install
  • Encoding

Operation

Install

You can install the package using the following steps:

  1. pip install using an admin prompt
    pip uninstall NJGovNews
    pip install -v git+https://github.com/TextCorpusLabs/NJGovNews.git
    

Run

You can run the package as follows:

 NJGovNews SITE -out FILE_OUT

The scraper currently supports the following SITEs:

  1. The Department of the Treasury. I.E. NJGovNews treasury -out "c:/data/news/nj_treasury.csv"

Cache

This scraper uses requests-cache to improve performance. If you want to force a full reload of all the data, delete the file called 'SITE.cache.sqlite'. It will be in the same folder as the .csv the scraper created.

Development

Prerequisites

You can install the package for development using the following steps:

Note: You can replace steps 1-3 using the VSCode Git:Clone command

  1. Download the project from GitHub
    • Click the green "Code" button on the right. Select "Download Zip"
  2. Remove zip protections by right-clicking on the file, selecting properties, and checking "security: unblock"
  3. Unzip the folder. I recommend using the folder c:/repos/TextCorpusLabs/NJGovNews
  4. Run pip's edit install using an admin prompt
    pip uninstall NJGovNews
    pip install -v -e c:/repos/TextCorpusLabs/NJGovNews
    
  5. Install the nltk add-ons using an admin prompt
    python -c "import nltk;nltk.download('punkt')"
    

About

Web scraping of the New Jersey news feeds

Topics

Resources

License

Stars

Watchers

Forks

Languages