-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Guamaso edited this page Sep 24, 2014
·
3 revisions
The Cary has a page that displays show times but is 1) not very readable, 2) does not take full use of external data (like IMDB), and 3) does not allow others access to data.
The Cary Project seeks to fix these issues by scraping the data, storing it, and displaying it in a much more useful way.
Primary:
- Collect data and keep it as accurate as possible considering the nature of the state the data is in.
- Store show times of The Cary in a way where we can provide a REST API for others to use freely.
- Provide a simple and responsive display of show times in a clear and filterable format.
Secondary:
- Use third part API's (like IMDB) to provide the user's more information on events.
- Attempt to make the scraping tool as modular as possible, so it can be used on other projects simply by replacing the data source, data end point, and scraping modules.
- Provide tools for possible data analysis on past show times?
These are very rough, as project moves along, the descriptions and tasks can change.
- Proof of concept - create proof of concept that scrapes either/both The Cary showtimes page or Cary's calendar page and creates a fairly consistent JSON file that we can then store into a database.
- Scraper Process - Refactor proof of concept or rebuild so we can scrape raw page data, parse into a json object, validate into expected data, and shape into a valid JSON structure.
- Scrape Module - Scrapes page, stores result as a raw flat file.
- Parse Module - Picks up flat files and attempts to pull out pertinent data. Save as JSON file.
- Validate Module - Picks up JSON file and attempts to validate dates, movie titles (using IMDB), categories, etc. Updates JSON file.
- Storage Module - Picks up JSON file and saves to database.
- Server - Create a MongoDB server and add an API in order to easily save and pull data.
- Application - The actual front end client app. Start simple, list data into a more useful and filterable format.