Universal Webscraper is a powerful tool designed to extract data points such as company websites, descriptions, founders, emails, addresses, and more, based on a given entity name. The tool accepts input as a CSV file with a column named Entity containing the entity names to search and retrieve the requested data points..
- Jina AI for scrape
- Tavily AI for internet search
You can optionally switch to FireCrawl as needed.
- Clone this Repository:
git clone https://github.com/jayaraj/universal-scraper-langgraph.git
cd universal-scraper-langgraph
- Install Poetry & Create Environment:
- Install Poetry if you haven’t already:
pip install poetry
- Install dependencies and activate the virtual environment:
poetry install --no-root
poetry shell
- Create a .env File:
- Obtain your API keys for OpenAI, Tavily AI, and FireCrawl.
- Update the .env file with your API keys:
OPENAI_API_KEY = "xxxxxxxxxxxxxxxxxxx"
TAVILY_API_KEY = "xxxxxxxxxxxxxxxxxxx"
FIRECRAWL_API_KEY="xxxxxxxxxxxxxxxxxxx"
- Prepare Input File:
- Update input.csv with the entity names you want to search
Run the scraper with the following commands:
- Default run:
python app.py
- Specify an input file:
python app.py -f ./input.csv
or
python app.py --file ./input.csv