This is my first contact with scrapping, I did scraping of a website called Affluent and I learned a few things that we should considerate before use puppeteer or any automate browser.
- Always check if there's an API we can use
- Check if we can use reverse engineer to get use the api.
- If there's no api we can try using request only, the reasons for this is because the request uses less resources than puppeteer or something like selenium, so on.
- If we cannot find a api to use, we cannot do reverse engineer and use request because the site uses javascript to render, the last option is to use an automated browser.
Ensure you have the following installed on your local machine:
-
Make sure you have
nodejs
installed. -
Clone
- git clone https://github.com/GOlmedoFormosa/node-scraping - cd node-scraping - npm install
-
Create/configure
.env
environment with your credentials. A sample.env.example
file has been provided to get you started. Make a duplicate of.env.example
and rename to.env
, then configure your credentials. -
Run
npm run watch
to start the server and watch for changes -
Open your browser and go to
localhost:8080
- Run
npm run processUsers
here we are using request-promise to fetch users data and store it in the mysql database.
- Run
npm run processScraping
this will run puppeteer, get the data and store it in the mysql database.