scraper

This repository contains our scraping code. The scraper pulls companies from /r/cscareerquestions and crawls the web for positions relating to those companies. We are also actively looking for other reliable sources of company names.

Installation

Make sure the following dependencies have been installed on your system.

Docker

You will also need to place a valid hibernate.cfg.xml file in the src/main/resources folder. This file is responsible for providing SQL database connection details, enabling the scraper to read and write companies/positions. Please see src/main/resources/hibernate.cfg.xml.example for an example.

Usage

The following commands are assumed to be run from the root of the repository directory.

To fetch all companies and save them to the database, ignoring duplicates, use:

scripts/start_docker.sh -c

To fetch all positions for each company in the database and then save them to the database, ignoring duplicates, use:

scripts/start_docker.sh -p

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
gradle/wrapper		gradle/wrapper
scripts		scripts
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scraper

Installation

Usage

About

Releases

Packages

Contributors 3

Languages

intern-hub/scraper

Folders and files

Latest commit

History

Repository files navigation

scraper

Installation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages