Add functionality to list StatsBomb open competitions and available matches #363

UnravelSports · 2024-11-12T08:59:33Z

Hi!

While working on the docs I realized it wasn't very straightforward to see which competitions and matches are available in the StatsBomb open data repository, at least not from within Kloppy.

I have added functionality such that we can do:

from kloppy import statsbomb

statsbomb.list_open_competitions(detailed=True)
statsbomb.list_open_matches(223, 282, detailed=True)

A couple things to consider:

Both these functions return a pandas dataframe, but Kloppy has no explicit dependency on pandas. I have inserted a warning (similar to the one shown when doing dataset.to_df() if pandas has not been installed.
I have added a detailed (bool=False) parameter to both functions. Setting it to False will provide all available columns, leaving it to the default False will give us only the most relevant columns
I have moved the underlying functions to kloppy.infra.serializers.event.statsbomb.helpers. I'm not sure if this is desirable.
Because of point 1, we cannot explicitly state the return type of our functions, ie we cannot do the following:

def list_open_matches(
    competition_id: int, season_id: int, detailed: bool = False
) -> pd.DataFrame:

Instead we simply do:

def list_open_matches(
    competition_id: int, season_id: int, detailed: bool = False
):

UnravelSports · 2024-11-12T11:49:08Z

Just to add, if the basic functionality is approved I'll add some tests

probberechts · 2024-11-12T12:40:28Z

A few remarks:

I prefer the initial proposal of adding a list_open_data function that immediately retrieves all match ids for all competitions. It can have optional arguments to filter on a competition and / or season. With list_open_competitions and list_open_matches it's again a two-step operation to retrieve a certain match id.
StatsBomb has a great well-maintained Python API. Would it be an option to simply wrap their API?

def list_open_data(fmt="dataframe"):
    try:
        from statsbombpy import sb
    except ImportError:
        print("Please install the statsbombpy library to use this function.")
        return

    try:
        # Fetch all available competitions as a dictionary
        competitions = sb.competitions(fmt="dict")

        # Initialize an empty list to store DataFrames for matches
        all_matches = []

        # Loop through each competition and fetch matches using competition_id and season_id
        for competition in competitions.values():
            matches = sb.matches(
                competition_id=competition['competition_id'], 
                season_id=competition['season_id'],
                fmt=fmt
            )
            all_matches.append(matches)

        # Combine all matches into a single DataFrame
        if fmt == "dataframe":
            import pandas as pd
            combined_matches = pd.concat(all_matches, ignore_index=True)
            return combined_matches
        elif fmt == "dict":
            return all_matches
        else:
            raise ValueError("Invalid format. Use 'dataframe' or 'dict'.")

    except Exception as e:
        raise RuntimeError(f"An error occurred while fetching data: {e}")

# Example usage:
df = list_open_data(fmt="dataframe")
print(df.head())

UnravelSports · 2024-11-12T13:10:07Z

This seems fine to me too! I wasn't sure about relying on the sb package initially, but why not.

UnravelSports · 2024-11-14T08:20:31Z

@probberechts do you have any suggestions for passing the competition / season filters? It might make sense to have this be a dict with comp / season_id value pairs, but it might also make sense to simply be able to provide a competition_id and or a competition / season_id pair?

Also, thinking about the functionality I have built and that you are suggesting, the first step is always going to be a two step process. You will need to identify the competition_ids first to even know what you want. So, would it be an idea to have both as an option?

probberechts · 2024-11-15T09:25:08Z

do you have any suggestions for passing the competition / season filters?

Maybe simply allow everything? You can define a StatsBombPartitionIdentifier object and then try to parse a dict or tuple into this object.

@dataclass
class StatsBombPartitionIdentifier:
    competition_id: Optional[int] = None
    season_id: Optional[int] = None

    @classmethod
    def from_dict(cls, data: Dict):
        """Create a PartitionIdentifier from a dictionary."""
        return cls(
            competition_id=data.get('competition_id'),
            season_id=data.get('season_id')
        )

    @classmethod
    def from_tuple(cls, data: Tuple[Optional[int], Optional[int]]):
        """Create a PartitionIdentifier from a tuple."""
        if len(data) != 2:
            raise ValueError("Tuple must have exactly two elements: (competition_id, season_id)")
        return cls(
            competition_id=data[0],
            season_id=data[1]
        )

Although, I don't think I would use these identifiers a lot. I can see it might be nice to have this option since it will make data retrieval more efficient (you won't have to retrieve the file with matches for each competition-season pair). However, the only ID that I know by hard is the one of the EPL and would probably rather wait an extra second rather than typing another parameter 🙃. So, maybe we don't have to add this option.

the first step is always going to be a two step process

Yes, typically it will be. But I prefer to do these two steps directly in Pandas. Below are a few use cases in which I think this functionality would be useful:

🤔 I would like to do some analysis with 360 data, which datasets can I use?

df = list_open_data(fmt="dataframe") 
df["has_360"] = df["match_status_360"] == "available"
print(df.groupby(["competition", "season"])["has_360"].sum().sort_values(ascending=False))

🤔 What is the ID of the 2022 World Cup final?

df = list_open_data(fmt="dataframe") 
print(df[df.match_date == "2022-12-18"])

🤔 Get the metadata of a particular game.

df = list_open_data(fmt="dataframe") 
print(df[df.match_id == 7584])

UnravelSports · 2024-11-18T10:30:40Z

I've changed it to your suggested implementation. Where we allow a single combination of competition_id and season_id or we pull everything.

Some notes:

Have suppressed the NoAuthWarning in the sb.matches() for-loop because one warning is enough, and we get that warning when we grab the competitions.
I had to add the competition_id and season_id to the dataframe implementation, because for some reason statsbomb was not giving that by default.

@koenvo do we need tests for this? And if so, what would they look like?

@probberechts note: I see now that I missed your previous message, not sure this is something we need. I'm also trying to keep it a bit simple, in the end it's just a wrapper. And if people are now getting the full list of data first, then they can do the rest themselves from that.

koenvo · 2024-11-18T11:13:24Z

kloppy/infra/serializers/event/statsbomb/helpers.py

+import requests as re
+
+
+def get_response(path):


Please move this to a more generic place like utils.py

Thanks for pointing this out. These lines weren't necessary anymore so I've removed them.

koenvo · 2024-11-18T11:21:41Z

@probberechts what do you think of putting ingestify in between? I might have broken the current statsbomb open data Source, but when it's fixed it provides all necessary parts to work with the open data.

It might need an additional "get or fetch" method to either get it directly from (local) storage, or retrieve it first.

probberechts · 2024-11-18T16:00:21Z

@probberechts what do you think of putting ingestify in between? I might have broken the current statsbomb open data Source, but when it's fixed it provides all necessary parts to work with the open data.

It might need an additional "get or fetch" method to either get it directly from (local) storage, or retrieve it first.

I think adding ingestify might be overkill for our needs. We only need to download ~10 small JSON files, maybe once in a while. The main reason I proposed using the statsbombpy package is that it already parses everything neatly into dataframes. I am not very familiar with ingestify, but I guess it does not do that part.

UnravelSports · 2024-11-18T16:16:04Z

I also don't know what ingestify does, but does indeed feel like overkill. The statsbombpy package works fine, plus it allows users to directly grab non open data too if they have access

koenvo · 2024-11-18T17:12:49Z

Ok. Let’s go for this solution.

I do think it would be good to introduce some models for competition, seasons and match etc to prevent having StatsBomb specific structures in through the code.

UnravelSports · 2024-11-18T18:15:49Z

How/where do you see this?

probberechts · 2024-11-18T18:40:01Z

I do think it would be good to introduce some models for competition, seasons and match etc to prevent having StatsBomb specific structures in through the code.

Yes, this is a good point! I guess what you mean is that we should add something like a DatasetCollection entity to kloppy?

@dataclass
class Game:
    game_id: int
    home_team: Team
    away_team: Team
    date: datetime

@dataclass
class Season:
    season_year: str  # e.g., "2023-2024"
    season_id: int
    games: List[Game] = field(default_factory=list)

@dataclass
class Competition:
    name: str  # e.g., "Premier League"
    competition_id: int
    seasons: List[Season] = field(default_factory=list)

@dataclass
class DatasetCollection:
    """Represents a dataset containing games from multiple competitions."""
    competitions: List[Competition] = field(default_factory=list)
        
   def to_df():
       """Retrieve all games from all competitions and seasons as a dataframe."""
       raise NotImplementedException()

UnravelSports [JB] added 2 commits November 12, 2024 09:49

list open comps and matches

0edb471

add ordering by match_date

57a4881

UnravelSports [JB] added 2 commits November 18, 2024 11:25

changed to statsbombpy

3935110

fixed useless lines

99c5903

koenvo reviewed Nov 18, 2024

View reviewed changes

removed lines

0c20e12

probberechts marked this pull request as draft December 17, 2024 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to list StatsBomb open competitions and available matches #363

Add functionality to list StatsBomb open competitions and available matches #363

UnravelSports commented Nov 12, 2024 •

edited

Loading

UnravelSports commented Nov 12, 2024

probberechts commented Nov 12, 2024

UnravelSports commented Nov 12, 2024

UnravelSports commented Nov 14, 2024 •

edited

Loading

probberechts commented Nov 15, 2024

UnravelSports commented Nov 18, 2024 •

edited

Loading

koenvo Nov 18, 2024

UnravelSports Nov 18, 2024

koenvo commented Nov 18, 2024

probberechts commented Nov 18, 2024

UnravelSports commented Nov 18, 2024

koenvo commented Nov 18, 2024

UnravelSports commented Nov 18, 2024

probberechts commented Nov 18, 2024

Add functionality to list StatsBomb open competitions and available matches #363

Are you sure you want to change the base?

Add functionality to list StatsBomb open competitions and available matches #363

Conversation

UnravelSports commented Nov 12, 2024 • edited Loading

UnravelSports commented Nov 12, 2024

probberechts commented Nov 12, 2024

UnravelSports commented Nov 12, 2024

UnravelSports commented Nov 14, 2024 • edited Loading

probberechts commented Nov 15, 2024

UnravelSports commented Nov 18, 2024 • edited Loading

koenvo Nov 18, 2024

Choose a reason for hiding this comment

UnravelSports Nov 18, 2024

Choose a reason for hiding this comment

koenvo commented Nov 18, 2024

probberechts commented Nov 18, 2024

UnravelSports commented Nov 18, 2024

koenvo commented Nov 18, 2024

UnravelSports commented Nov 18, 2024

probberechts commented Nov 18, 2024

UnravelSports commented Nov 12, 2024 •

edited

Loading

UnravelSports commented Nov 14, 2024 •

edited

Loading

UnravelSports commented Nov 18, 2024 •

edited

Loading