Skip to content

A python script to pull data from collegeswimming.com and perform data analysis on selected teams

Notifications You must be signed in to change notification settings

MattDBailey/collegeswimming.com-Data-Analysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Written by Brad Beacham
Adapted from code written by Kevin Wylder

TL;DR - ####THIS NEEDS TO BE REWRITTEN WHEN DONE####
        Set the parameters when you run the program. Run the following commands
            python get_swim_data.py
            python process_swim_data.py
        and look in the graphs directory for fun data.
        You can check the database for your own analysis, but you'd have to RTFM.

Files:
constants.py
    A python module can be used to control what is done in get_swim_data.py and process_swim_data.py. You can use it to
    choose what events, teams, genders, seasons, etc. are collected by get_swim_data.py. You can also change how
    calculations in process_swim_data are made.
    Global parameters -
        DATABASE_FILE_NAME: the output file name, it is where get_swim_data.py stores its data
        SEASON_LINE_MONTH: the first month of the year that starts off the season
        SEASON_LINE_DAY: the day of the month that corresponds to the first day of season
    URL's for pulling data:
        SWIMMER_URL = Base for URLs leading to individual swimmers pages
        SWIMMER_EVENT_URL = Base for URLs leading to dictionary of swimmer times in specific events
        ROSTER_URL = Base for URLs leading to a team's roster
        RESULTS_URL = Base for URLs leading to list of all meets a team competed in during a given season
        MEET_URL = Base for URLs going to webpage for a specific meet
        MEET_EVENT_URL = Base for URLs going to a webpage for specific event in a meet
        SPLASH_SPLITS_URL = Base for URLs going to webpage for split times in a relay event for a single relay team
    get_swim_data.py parameters -
        DEFAULT_EVENTS_TO_PULL: an array of events that will be searched for in each swimmer
        DEFAULT_GENDER: an array of which genders to search for. "M" and/or "F"
        DEFAULT_TEAMS_TO_PULL: an array of team names that will be searched for. You must input a full team/school name
        DEFAULT_YEAR_START: the starting year of this search
        DEFAULT_YEAR_END: the ending year of this search
    process_swim_data.py parameters -
    	INDIVIDUAL_POINTS: A dictionary where the values are arrays of integers that correspond to the number of points
            a player would be awarded for placing in an individual event at a swim meet. The keys correspond to pools
            of different sizes
        RELAY_POINTS: A dictionary where values are arrays of integers corresponding to points a team is awarded for
            placing in a relay event at a swim meet. Keys correspond to pools of different sizes
        SCORER_LIMIT: A dictionary where values are arrays of integers corresponding to maximum number of players that
            can place in a single event from one team. First index of array is for individual events, second is for
            relays. Keys correspond to different pool sizes.

get_swim_data.py
    A python module to create the database of swims. This is the main script of the
    project so you should definitely start by running this. It may take a while, and
    obviously requires an internet connection. The result of running this is a sqlite3
    database in the working directory with the name set in constants.py

process_swim_data.py
    A python module for processing data collected by get_swim_data. Contains functions for finding lineups used by
    teams at different meets. Can be used for calculating predicted score matrix between two teams each using any number
    of lineups.

Important Structures:
database structure
    The table "Swims" will hold all the swims pulled off collegeswimming.com.
    the columns of these tables in the database are as follows (in order)
    ------------------------------------------------------------------------------
    | swimmer | team | time | scaled | meet_id | event | date | taper | snapshot |
    ------------------------------------------------------------------------------
    swimmer: an int for the swimmer's ID on collegeswimming.com
    team: an int for the team's teamId on collegeswimming.com
    time: a float representing the number of seconds the swim took
    scaled: a float z score according to this event's season
    meet_id: an int for the ID of the meet this swim came from on collegeswimming.com
    event: a gender char followed by collegeswimming.com's event structure (ex: M150Y)
    date: the day of the swim represented by the number of seconds since unix epoch
    taper: was this a taper swim? We have to guess b/c it's not online. possible values:
        0 - we haven't determined whether or not this was a taper swim (the initial value)
        1 - this swim appears to be a taper swim (based off season times)
        2 - this swim appears to not be a taper swim (based off season times)
        3 - this is an outlier swim (more than 3 sd from their mean)
    snapshot: an integer that corresponds to when this row was added to the database. This
              is just to help control duplicates and have more info about the farming.

    The table "Swimmers" holds all swimmers pulled off collegeswimming.com,
    the columns of the table are as follows (in order)
    ----------------------------------------
    | name | gender | swimmer_id | team_id |
    ----------------------------------------
    name: String spelling out swimmer's name
    gender: Character "M" or "F" representing gender ("Male" or "Female") of swimmer
    swimmer_id: Integer ID that matches this swimmer. Uniquely identifies any swimmer.
    team_id: Integer ID that corresponds to the team this swimmer is from. Uniquely identifies
             any team.

    The "Teams" table matches Strings to team identification ints from collegeswimming.com.
    This table has the following columns:
    -------------
    | name | id |
    -------------
    name: A string representation of an entity (ie. "Kevin Wylder" or "UC San Diego")
    id: the integer that matches this entity (ex. "UC San Diego" has 121 as it's id)

    "Snapshots" is a table with information about the parameters of the search when this
    data was extracted. It has the columns:
    ------------------------------------
    | snapshot | date | teams | events |
    ------------------------------------
    snapshot: a integer for this pull that will match many rows in the event tables
    date: a human readable string displaying specified time range for this pull
          (ex: "2014.9.15-2015.9.15")
    teams: a string list of teamIds specified in this pull, separated by ","
          (ex: "121,122,123,125")
    events: a string list of event table names in this pull, separated by ","
            (ex: "M150Y,F150Y,M4100Y,F4100Y")


collegeswimming.com event structure
    the definition an event is as follows
                 ABBBBC
    with characters representing
    A: The stroke id number. An integer between 1 and 5
        1 - Freestyle
        2 - Backstroke
        3 - Breaststroke
        4 - Butterfly
        5 - Individual Medley
        F - Freestyle Relay
        M - Medley Relay
    B: The distance of the event. While it shows there being 4 B's above, there is no
       specification, it can be from 2 to 4 digits
    C: The pool type
        Y - short course yards
        L - long course meters
        S - I think short course meters, but I've never seen or tested this
    examples:
        150Y - 50 Yard Freestyle
        4100Y - 100 Yard Butterfly
        3200L - 200 Meter Breaststroke

collegeswimming.com url structure
    There are two urls we'll use from the website. The first is a list of the team roster.
        http://www.collegeswimming.com/team/{A}/mod/teamroster?season={B}&gender={C}
    This url returns an HTML page, which must be scraped for swimmer names and ids.
    The curly braces "{}" are omitted and replaced with
    A: The team's id number. This is an index on the website. You can manually look this
       up by searching for a team and looking at the number in the url
    B: The range of years to get swimmers. In the format
                        SSSS-EEEE
       SSSS - the starting season year
       EEEE - the ending season year
       Using the selected dropdown spinner on the website, you're only allowed to do two
       consecutive years (ex: 2014-2015), but strings are allowed to have unlimited range
    C: The gender of the roster. A 1 char representation.
        M - Men's team roster
        F - Women's team roster

    The second url we'll use is for a specific swimmer's event
    http://www.collegeswimming.com/swimmer/{A}/times/byeventid/{B}
    This url returns all the swims of the specified event json encoded.
    The curly braces "{}" are omitted and replaced with
    A: The swimmer's id number. This should be retreived from the team roster
    B: the event structure. This is defined above as "collegeswimming.com event structure"


A note from Brad:
All of the R/analysis files were deleted on June 21st, 2019. They can be found in any of
the branches that predate brad-task-014.

About

A python script to pull data from collegeswimming.com and perform data analysis on selected teams

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 54.2%
  • HTML 39.4%
  • Python 6.4%