GitHub - MattDBailey/collegeswimming.com-Data-Analysis: A python script to pull data from collegeswimming.com and perform data analysis on selected teams

MattDBailey / collegeswimming.com-Data-Analysis Public

forked from kevinwylder/collegeswimming.com-Data-Analysis

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A python script to pull data from collegeswimming.com and perform data analysis on selected teams

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.vscode		.vscode
.gitignore		.gitignore
AssignmentHeatmap_12.png		AssignmentHeatmap_12.png
BucknellPerf.csv		BucknellPerf.csv
BucknellPerf_2.csv		BucknellPerf_2.csv
BucknellPerf_SwimCloud18_19.csv		BucknellPerf_SwimCloud18_19.csv
BucknellPerfwNAs.csv		BucknellPerfwNAs.csv
BucknellPerfwNAs_GreedyRank.csv		BucknellPerfwNAs_GreedyRank.csv
Bucknell_1_Lineup.csv		Bucknell_1_Lineup.csv
Bucknell_1_Lineup_old.csv		Bucknell_1_Lineup_old.csv
Bucknell_Lehigh_Team_Data.csv		Bucknell_Lehigh_Team_Data.csv
Bucknell_Lineup_Greedy.csv		Bucknell_Lineup_Greedy.csv
Bucknell_Predicted_Performance.xlsx		Bucknell_Predicted_Performance.xlsx
Bucknell_SMI.csv		Bucknell_SMI.csv
FieldNamesRelationToMeetopt.xlsx		FieldNamesRelationToMeetopt.xlsx
GameConvergence_12LUs.png		GameConvergence_12LUs.png
GameConvergence_12LUs_hotstart.png		GameConvergence_12LUs_hotstart.png
LehighPerf.csv		LehighPerf.csv
LehighPerf_2.csv		LehighPerf_2.csv
LehighPerfwNAs.csv		LehighPerfwNAs.csv
Lehigh_1_Lineup.csv		Lehigh_1_Lineup.csv
Lehigh_1_Lineup_old.csv		Lehigh_1_Lineup_old.csv
Lehigh_Lineup_Greedy.csv		Lehigh_Lineup_Greedy.csv
MeetOptPythonCode_ConvertToFunction_2_20_20.py		MeetOptPythonCode_ConvertToFunction_2_20_20.py
MeetOptPythonCode_mdbEdit_7_23_2019.py		MeetOptPythonCode_mdbEdit_7_23_2019.py
NCAA Swim Meet Order of Events.pdf		NCAA Swim Meet Order of Events.pdf
NCAA Swim Rules.pdf		NCAA Swim Rules.pdf
Payoff3x4.png		Payoff3x4.png
Payoff4x4.png		Payoff4x4.png
Payoff_GreedyStart9x9.png		Payoff_GreedyStart9x9.png
README		README
SMI_Chart_5_27_2022.png		SMI_Chart_5_27_2022.png
SimpleGame.py		SimpleGame.py
SwimGameTheoryProject.vsdx		SwimGameTheoryProject.vsdx
SwimMeetOpt.lp		SwimMeetOpt.lp
SwimmerNames.csv		SwimmerNames.csv
SwimmingGame.html		SwimmingGame.html
SwimmingGame.ipynb		SwimmingGame.ipynb
SwimmingGame.zip		SwimmingGame.zip
SwimmingGame_old.ipynb		SwimmingGame_old.ipynb
SwimmingGame_pdf_05_27_22.pdf		SwimmingGame_pdf_05_27_22.pdf
Variables_and_Assumptions.xlsx		Variables_and_Assumptions.xlsx
collegeswimming.db		collegeswimming.db
collegeswimming_11_8_19.db		collegeswimming_11_8_19.db
collegeswimming_11_8_19.db-journal		collegeswimming_11_8_19.db-journal
collegeswimming_test.db		collegeswimming_test.db
constants.py		constants.py
get_swim_data.py		get_swim_data.py
helperfunctions.py		helperfunctions.py
process_swim_data.py		process_swim_data.py
team_dict.py		team_dict.py
team_dict_generator.py		team_dict_generator.py
~$FieldNamesRelationToMeetopt.xlsx		~$FieldNamesRelationToMeetopt.xlsx

Repository files navigation

Written by Brad Beacham
Adapted from code written by Kevin Wylder

TL;DR - ####THIS NEEDS TO BE REWRITTEN WHEN DONE####
Set the parameters when you run the program. Run the following commands
python get_swim_data.py
python process_swim_data.py
and look in the graphs directory for fun data.
You can check the database for your own analysis, but you'd have to RTFM.

Files:
constants.py
A python module can be used to control what is done in get_swim_data.py and process_swim_data.py. You can use it to
choose what events, teams, genders, seasons, etc. are collected by get_swim_data.py. You can also change how
calculations in process_swim_data are made.
Global parameters -
DATABASE_FILE_NAME: the output file name, it is where get_swim_data.py stores its data
SEASON_LINE_MONTH: the first month of the year that starts off the season
SEASON_LINE_DAY: the day of the month that corresponds to the first day of season
URL's for pulling data:
SWIMMER_URL = Base for URLs leading to individual swimmers pages
SWIMMER_EVENT_URL = Base for URLs leading to dictionary of swimmer times in specific events
ROSTER_URL = Base for URLs leading to a team's roster
RESULTS_URL = Base for URLs leading to list of all meets a team competed in during a given season
MEET_URL = Base for URLs going to webpage for a specific meet
MEET_EVENT_URL = Base for URLs going to a webpage for specific event in a meet
SPLASH_SPLITS_URL = Base for URLs going to webpage for split times in a relay event for a single relay team
get_swim_data.py parameters -
DEFAULT_EVENTS_TO_PULL: an array of events that will be searched for in each swimmer
DEFAULT_GENDER: an array of which genders to search for. "M" and/or "F"
DEFAULT_TEAMS_TO_PULL: an array of team names that will be searched for. You must input a full team/school name
DEFAULT_YEAR_START: the starting year of this search
DEFAULT_YEAR_END: the ending year of this search
process_swim_data.py parameters -
INDIVIDUAL_POINTS: A dictionary where the values are arrays of integers that correspond to the number of points
a player would be awarded for placing in an individual event at a swim meet. The keys correspond to pools
of different sizes
RELAY_POINTS: A dictionary where values are arrays of integers corresponding to points a team is awarded for
placing in a relay event at a swim meet. Keys correspond to pools of different sizes
SCORER_LIMIT: A dictionary where values are arrays of integers corresponding to maximum number of players that
can place in a single event from one team. First index of array is for individual events, second is for
relays. Keys correspond to different pool sizes.

get_swim_data.py
A python module to create the database of swims. This is the main script of the
project so you should definitely start by running this. It may take a while, and
obviously requires an internet connection. The result of running this is a sqlite3
database in the working directory with the name set in constants.py

process_swim_data.py
A python module for processing data collected by get_swim_data. Contains functions for finding lineups used by
teams at different meets. Can be used for calculating predicted score matrix between two teams each using any number
of lineups.

Important Structures:
database structure
The table "Swims" will hold all the swims pulled off collegeswimming.com.
the columns of these tables in the database are as follows (in order)
------------------------------------------------------------------------------
| swimmer | team | time | scaled | meet_id | event | date | taper | snapshot |
------------------------------------------------------------------------------
swimmer: an int for the swimmer's ID on collegeswimming.com
team: an int for the team's teamId on collegeswimming.com
time: a float representing the number of seconds the swim took
scaled: a float z score according to this event's season
meet_id: an int for the ID of the meet this swim came from on collegeswimming.com
event: a gender char followed by collegeswimming.com's event structure (ex: M150Y)
date: the day of the swim represented by the number of seconds since unix epoch
taper: was this a taper swim? We have to guess b/c it's not online. possible values:
0 - we haven't determined whether or not this was a taper swim (the initial value)
1 - this swim appears to be a taper swim (based off season times)
2 - this swim appears to not be a taper swim (based off season times)
3 - this is an outlier swim (more than 3 sd from their mean)
snapshot: an integer that corresponds to when this row was added to the database. This
is just to help control duplicates and have more info about the farming.

The table "Swimmers" holds all swimmers pulled off collegeswimming.com,
the columns of the table are as follows (in order)
----------------------------------------
| name | gender | swimmer_id | team_id |
----------------------------------------
name: String spelling out swimmer's name
gender: Character "M" or "F" representing gender ("Male" or "Female") of swimmer
swimmer_id: Integer ID that matches this swimmer. Uniquely identifies any swimmer.
team_id: Integer ID that corresponds to the team this swimmer is from. Uniquely identifies
any team.

The "Teams" table matches Strings to team identification ints from collegeswimming.com.
This table has the following columns:
-------------
| name | id |
-------------
name: A string representation of an entity (ie. "Kevin Wylder" or "UC San Diego")
id: the integer that matches this entity (ex. "UC San Diego" has 121 as it's id)

"Snapshots" is a table with information about the parameters of the search when this
data was extracted. It has the columns:
------------------------------------
| snapshot | date | teams | events |
------------------------------------
snapshot: a integer for this pull that will match many rows in the event tables
date: a human readable string displaying specified time range for this pull
(ex: "2014.9.15-2015.9.15")
teams: a string list of teamIds specified in this pull, separated by ","
(ex: "121,122,123,125")
events: a string list of event table names in this pull, separated by ","
(ex: "M150Y,F150Y,M4100Y,F4100Y")

collegeswimming.com event structure
the definition an event is as follows
ABBBBC
with characters representing
A: The stroke id number. An integer between 1 and 5
1 - Freestyle
2 - Backstroke
3 - Breaststroke
4 - Butterfly
5 - Individual Medley
F - Freestyle Relay
M - Medley Relay
B: The distance of the event. While it shows there being 4 B's above, there is no
specification, it can be from 2 to 4 digits
C: The pool type
Y - short course yards
L - long course meters
S - I think short course meters, but I've never seen or tested this
examples:
150Y - 50 Yard Freestyle
4100Y - 100 Yard Butterfly
3200L - 200 Meter Breaststroke

collegeswimming.com url structure
There are two urls we'll use from the website. The first is a list of the team roster.
http://www.collegeswimming.com/team/{A}/mod/teamroster?season={B}&gender={C}
This url returns an HTML page, which must be scraped for swimmer names and ids.
The curly braces "{}" are omitted and replaced with
A: The team's id number. This is an index on the website. You can manually look this
up by searching for a team and looking at the number in the url
B: The range of years to get swimmers. In the format
SSSS-EEEE
SSSS - the starting season year
EEEE - the ending season year
Using the selected dropdown spinner on the website, you're only allowed to do two
consecutive years (ex: 2014-2015), but strings are allowed to have unlimited range
C: The gender of the roster. A 1 char representation.
M - Men's team roster
F - Women's team roster

The second url we'll use is for a specific swimmer's event
http://www.collegeswimming.com/swimmer/{A}/times/byeventid/{B}
This url returns all the swims of the specified event json encoded.
The curly braces "{}" are omitted and replaced with
A: The swimmer's id number. This should be retreived from the team roster
B: the event structure. This is defined above as "collegeswimming.com event structure"

A note from Brad:
All of the R/analysis files were deleted on June 21st, 2019. They can be found in any of
the branches that predate brad-task-014.