forked from kevinwylder/collegeswimming.com-Data-Analysis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
161 lines (146 loc) · 8.84 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
Written by Brad Beacham
Adapted from code written by Kevin Wylder
TL;DR - ####THIS NEEDS TO BE REWRITTEN WHEN DONE####
Set the parameters when you run the program. Run the following commands
python get_swim_data.py
python process_swim_data.py
and look in the graphs directory for fun data.
You can check the database for your own analysis, but you'd have to RTFM.
Files:
constants.py
A python module can be used to control what is done in get_swim_data.py and process_swim_data.py. You can use it to
choose what events, teams, genders, seasons, etc. are collected by get_swim_data.py. You can also change how
calculations in process_swim_data are made.
Global parameters -
DATABASE_FILE_NAME: the output file name, it is where get_swim_data.py stores its data
SEASON_LINE_MONTH: the first month of the year that starts off the season
SEASON_LINE_DAY: the day of the month that corresponds to the first day of season
URL's for pulling data:
SWIMMER_URL = Base for URLs leading to individual swimmers pages
SWIMMER_EVENT_URL = Base for URLs leading to dictionary of swimmer times in specific events
ROSTER_URL = Base for URLs leading to a team's roster
RESULTS_URL = Base for URLs leading to list of all meets a team competed in during a given season
MEET_URL = Base for URLs going to webpage for a specific meet
MEET_EVENT_URL = Base for URLs going to a webpage for specific event in a meet
SPLASH_SPLITS_URL = Base for URLs going to webpage for split times in a relay event for a single relay team
get_swim_data.py parameters -
DEFAULT_EVENTS_TO_PULL: an array of events that will be searched for in each swimmer
DEFAULT_GENDER: an array of which genders to search for. "M" and/or "F"
DEFAULT_TEAMS_TO_PULL: an array of team names that will be searched for. You must input a full team/school name
DEFAULT_YEAR_START: the starting year of this search
DEFAULT_YEAR_END: the ending year of this search
process_swim_data.py parameters -
INDIVIDUAL_POINTS: A dictionary where the values are arrays of integers that correspond to the number of points
a player would be awarded for placing in an individual event at a swim meet. The keys correspond to pools
of different sizes
RELAY_POINTS: A dictionary where values are arrays of integers corresponding to points a team is awarded for
placing in a relay event at a swim meet. Keys correspond to pools of different sizes
SCORER_LIMIT: A dictionary where values are arrays of integers corresponding to maximum number of players that
can place in a single event from one team. First index of array is for individual events, second is for
relays. Keys correspond to different pool sizes.
get_swim_data.py
A python module to create the database of swims. This is the main script of the
project so you should definitely start by running this. It may take a while, and
obviously requires an internet connection. The result of running this is a sqlite3
database in the working directory with the name set in constants.py
process_swim_data.py
A python module for processing data collected by get_swim_data. Contains functions for finding lineups used by
teams at different meets. Can be used for calculating predicted score matrix between two teams each using any number
of lineups.
Important Structures:
database structure
The table "Swims" will hold all the swims pulled off collegeswimming.com.
the columns of these tables in the database are as follows (in order)
------------------------------------------------------------------------------
| swimmer | team | time | scaled | meet_id | event | date | taper | snapshot |
------------------------------------------------------------------------------
swimmer: an int for the swimmer's ID on collegeswimming.com
team: an int for the team's teamId on collegeswimming.com
time: a float representing the number of seconds the swim took
scaled: a float z score according to this event's season
meet_id: an int for the ID of the meet this swim came from on collegeswimming.com
event: a gender char followed by collegeswimming.com's event structure (ex: M150Y)
date: the day of the swim represented by the number of seconds since unix epoch
taper: was this a taper swim? We have to guess b/c it's not online. possible values:
0 - we haven't determined whether or not this was a taper swim (the initial value)
1 - this swim appears to be a taper swim (based off season times)
2 - this swim appears to not be a taper swim (based off season times)
3 - this is an outlier swim (more than 3 sd from their mean)
snapshot: an integer that corresponds to when this row was added to the database. This
is just to help control duplicates and have more info about the farming.
The table "Swimmers" holds all swimmers pulled off collegeswimming.com,
the columns of the table are as follows (in order)
----------------------------------------
| name | gender | swimmer_id | team_id |
----------------------------------------
name: String spelling out swimmer's name
gender: Character "M" or "F" representing gender ("Male" or "Female") of swimmer
swimmer_id: Integer ID that matches this swimmer. Uniquely identifies any swimmer.
team_id: Integer ID that corresponds to the team this swimmer is from. Uniquely identifies
any team.
The "Teams" table matches Strings to team identification ints from collegeswimming.com.
This table has the following columns:
-------------
| name | id |
-------------
name: A string representation of an entity (ie. "Kevin Wylder" or "UC San Diego")
id: the integer that matches this entity (ex. "UC San Diego" has 121 as it's id)
"Snapshots" is a table with information about the parameters of the search when this
data was extracted. It has the columns:
------------------------------------
| snapshot | date | teams | events |
------------------------------------
snapshot: a integer for this pull that will match many rows in the event tables
date: a human readable string displaying specified time range for this pull
(ex: "2014.9.15-2015.9.15")
teams: a string list of teamIds specified in this pull, separated by ","
(ex: "121,122,123,125")
events: a string list of event table names in this pull, separated by ","
(ex: "M150Y,F150Y,M4100Y,F4100Y")
collegeswimming.com event structure
the definition an event is as follows
ABBBBC
with characters representing
A: The stroke id number. An integer between 1 and 5
1 - Freestyle
2 - Backstroke
3 - Breaststroke
4 - Butterfly
5 - Individual Medley
F - Freestyle Relay
M - Medley Relay
B: The distance of the event. While it shows there being 4 B's above, there is no
specification, it can be from 2 to 4 digits
C: The pool type
Y - short course yards
L - long course meters
S - I think short course meters, but I've never seen or tested this
examples:
150Y - 50 Yard Freestyle
4100Y - 100 Yard Butterfly
3200L - 200 Meter Breaststroke
collegeswimming.com url structure
There are two urls we'll use from the website. The first is a list of the team roster.
http://www.collegeswimming.com/team/{A}/mod/teamroster?season={B}&gender={C}
This url returns an HTML page, which must be scraped for swimmer names and ids.
The curly braces "{}" are omitted and replaced with
A: The team's id number. This is an index on the website. You can manually look this
up by searching for a team and looking at the number in the url
B: The range of years to get swimmers. In the format
SSSS-EEEE
SSSS - the starting season year
EEEE - the ending season year
Using the selected dropdown spinner on the website, you're only allowed to do two
consecutive years (ex: 2014-2015), but strings are allowed to have unlimited range
C: The gender of the roster. A 1 char representation.
M - Men's team roster
F - Women's team roster
The second url we'll use is for a specific swimmer's event
http://www.collegeswimming.com/swimmer/{A}/times/byeventid/{B}
This url returns all the swims of the specified event json encoded.
The curly braces "{}" are omitted and replaced with
A: The swimmer's id number. This should be retreived from the team roster
B: the event structure. This is defined above as "collegeswimming.com event structure"
A note from Brad:
All of the R/analysis files were deleted on June 21st, 2019. They can be found in any of
the branches that predate brad-task-014.