forked from kevinwylder/collegeswimming.com-Data-Analysis
-
Notifications
You must be signed in to change notification settings - Fork 0
MattDBailey/collegeswimming.com-Data-Analysis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Written by Brad Beacham Adapted from code written by Kevin Wylder TL;DR - ####THIS NEEDS TO BE REWRITTEN WHEN DONE#### Set the parameters when you run the program. Run the following commands python get_swim_data.py python process_swim_data.py and look in the graphs directory for fun data. You can check the database for your own analysis, but you'd have to RTFM. Files: constants.py A python module can be used to control what is done in get_swim_data.py and process_swim_data.py. You can use it to choose what events, teams, genders, seasons, etc. are collected by get_swim_data.py. You can also change how calculations in process_swim_data are made. Global parameters - DATABASE_FILE_NAME: the output file name, it is where get_swim_data.py stores its data SEASON_LINE_MONTH: the first month of the year that starts off the season SEASON_LINE_DAY: the day of the month that corresponds to the first day of season URL's for pulling data: SWIMMER_URL = Base for URLs leading to individual swimmers pages SWIMMER_EVENT_URL = Base for URLs leading to dictionary of swimmer times in specific events ROSTER_URL = Base for URLs leading to a team's roster RESULTS_URL = Base for URLs leading to list of all meets a team competed in during a given season MEET_URL = Base for URLs going to webpage for a specific meet MEET_EVENT_URL = Base for URLs going to a webpage for specific event in a meet SPLASH_SPLITS_URL = Base for URLs going to webpage for split times in a relay event for a single relay team get_swim_data.py parameters - DEFAULT_EVENTS_TO_PULL: an array of events that will be searched for in each swimmer DEFAULT_GENDER: an array of which genders to search for. "M" and/or "F" DEFAULT_TEAMS_TO_PULL: an array of team names that will be searched for. You must input a full team/school name DEFAULT_YEAR_START: the starting year of this search DEFAULT_YEAR_END: the ending year of this search process_swim_data.py parameters - INDIVIDUAL_POINTS: A dictionary where the values are arrays of integers that correspond to the number of points a player would be awarded for placing in an individual event at a swim meet. The keys correspond to pools of different sizes RELAY_POINTS: A dictionary where values are arrays of integers corresponding to points a team is awarded for placing in a relay event at a swim meet. Keys correspond to pools of different sizes SCORER_LIMIT: A dictionary where values are arrays of integers corresponding to maximum number of players that can place in a single event from one team. First index of array is for individual events, second is for relays. Keys correspond to different pool sizes. get_swim_data.py A python module to create the database of swims. This is the main script of the project so you should definitely start by running this. It may take a while, and obviously requires an internet connection. The result of running this is a sqlite3 database in the working directory with the name set in constants.py process_swim_data.py A python module for processing data collected by get_swim_data. Contains functions for finding lineups used by teams at different meets. Can be used for calculating predicted score matrix between two teams each using any number of lineups. Important Structures: database structure The table "Swims" will hold all the swims pulled off collegeswimming.com. the columns of these tables in the database are as follows (in order) ------------------------------------------------------------------------------ | swimmer | team | time | scaled | meet_id | event | date | taper | snapshot | ------------------------------------------------------------------------------ swimmer: an int for the swimmer's ID on collegeswimming.com team: an int for the team's teamId on collegeswimming.com time: a float representing the number of seconds the swim took scaled: a float z score according to this event's season meet_id: an int for the ID of the meet this swim came from on collegeswimming.com event: a gender char followed by collegeswimming.com's event structure (ex: M150Y) date: the day of the swim represented by the number of seconds since unix epoch taper: was this a taper swim? We have to guess b/c it's not online. possible values: 0 - we haven't determined whether or not this was a taper swim (the initial value) 1 - this swim appears to be a taper swim (based off season times) 2 - this swim appears to not be a taper swim (based off season times) 3 - this is an outlier swim (more than 3 sd from their mean) snapshot: an integer that corresponds to when this row was added to the database. This is just to help control duplicates and have more info about the farming. The table "Swimmers" holds all swimmers pulled off collegeswimming.com, the columns of the table are as follows (in order) ---------------------------------------- | name | gender | swimmer_id | team_id | ---------------------------------------- name: String spelling out swimmer's name gender: Character "M" or "F" representing gender ("Male" or "Female") of swimmer swimmer_id: Integer ID that matches this swimmer. Uniquely identifies any swimmer. team_id: Integer ID that corresponds to the team this swimmer is from. Uniquely identifies any team. The "Teams" table matches Strings to team identification ints from collegeswimming.com. This table has the following columns: ------------- | name | id | ------------- name: A string representation of an entity (ie. "Kevin Wylder" or "UC San Diego") id: the integer that matches this entity (ex. "UC San Diego" has 121 as it's id) "Snapshots" is a table with information about the parameters of the search when this data was extracted. It has the columns: ------------------------------------ | snapshot | date | teams | events | ------------------------------------ snapshot: a integer for this pull that will match many rows in the event tables date: a human readable string displaying specified time range for this pull (ex: "2014.9.15-2015.9.15") teams: a string list of teamIds specified in this pull, separated by "," (ex: "121,122,123,125") events: a string list of event table names in this pull, separated by "," (ex: "M150Y,F150Y,M4100Y,F4100Y") collegeswimming.com event structure the definition an event is as follows ABBBBC with characters representing A: The stroke id number. An integer between 1 and 5 1 - Freestyle 2 - Backstroke 3 - Breaststroke 4 - Butterfly 5 - Individual Medley F - Freestyle Relay M - Medley Relay B: The distance of the event. While it shows there being 4 B's above, there is no specification, it can be from 2 to 4 digits C: The pool type Y - short course yards L - long course meters S - I think short course meters, but I've never seen or tested this examples: 150Y - 50 Yard Freestyle 4100Y - 100 Yard Butterfly 3200L - 200 Meter Breaststroke collegeswimming.com url structure There are two urls we'll use from the website. The first is a list of the team roster. http://www.collegeswimming.com/team/{A}/mod/teamroster?season={B}&gender={C} This url returns an HTML page, which must be scraped for swimmer names and ids. The curly braces "{}" are omitted and replaced with A: The team's id number. This is an index on the website. You can manually look this up by searching for a team and looking at the number in the url B: The range of years to get swimmers. In the format SSSS-EEEE SSSS - the starting season year EEEE - the ending season year Using the selected dropdown spinner on the website, you're only allowed to do two consecutive years (ex: 2014-2015), but strings are allowed to have unlimited range C: The gender of the roster. A 1 char representation. M - Men's team roster F - Women's team roster The second url we'll use is for a specific swimmer's event http://www.collegeswimming.com/swimmer/{A}/times/byeventid/{B} This url returns all the swims of the specified event json encoded. The curly braces "{}" are omitted and replaced with A: The swimmer's id number. This should be retreived from the team roster B: the event structure. This is defined above as "collegeswimming.com event structure" A note from Brad: All of the R/analysis files were deleted on June 21st, 2019. They can be found in any of the branches that predate brad-task-014.
About
A python script to pull data from collegeswimming.com and perform data analysis on selected teams
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Jupyter Notebook 54.2%
- HTML 39.4%
- Python 6.4%