-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: vastly reduces number of db queries #252
base: main
Are you sure you want to change the base?
Conversation
Problem: The creation of some tables was causing a large number of queries to be run which itself scaled with the size of the database, i.e., larger databases (such as the one in production) caused more queries to be made the the db itself (a problem which componds on itself). This issue was not apparent in development because of the smaller size of the development db. First, we can verify the above problem by increasing the size of the development db. We can do this by temporarily editing `scripts/entrypoint.sh`, chanding line python make_fake_fixtures.py 1000 1000 1000 > /tmp/fake_agagd_data.json with python make_fake_fixtures.py 100 10000 100 > /tmp/fake_agagd_data.json and then deleting and recreating the docker images/containers/volumes. Before the change, the development database was creating data for 1000 players, 1000 games, and 1000 tournaments. Note that this meant that, on average, each player only played one game and each tournament only contained one game. This was helping mask the primary issue. After the change, the development database is creating 100 players, 10_000 games, and 100 tournaments, meaning that, on average, each player played 100 games, and each tournament contained 100 games. We note that, although this new development db is still significantly smaller than the one in production, the local development app is already having trouble responding to requests in involving player and tournament pages. With the new larger development database in place, we observe the following concerning facts: * a random player page requires ~1000 queries to be made. * a random tournament page requires ~100 queries to be made. Solution: This commit addresses the above issues by making 2 primary changes in how some tables are computed or rendered: 1. The first involves the way player names and ids were rendered in the game tables using the custom `agagd_core.tables.players.LinkFullMembersNameColumn`. This custom column had a `.render()` method which internally made a query to the db. Then, for the rendering of the whole table, the `.render()` method was being called for every entry in the table data, causing many queries to be made. To solve this issue, we removed `LinkFullMembersNameColumn` altogether, and constructed the appropriate queries to construct the "player-name (player-id)" label once for all entries in the table. 2. The second involves the way opponent and tournament data was collected to create the opponent and tournament tables in the player_profile view. This data was collected and manipulated in pure python, using explicit python loops over the Games model, and at least one db query per iteration. To solve this issue, we created appropriate queries which gathered and aggregated the required data using a couple of large queries (rather than many small ones). After the above changes: * the number of queries made in a random player page dropped from ~1000 to ~10. * the number of queries made in a random tournament page dropped from ~100 to ~5.
I will note that this commit gets rid of a complicated part of code which was constructing the data for some tables, and reimplements similar logic using more straightforward queries. However, in doing so, some facets of the previous data-collection may have been changed. I will add some comments in the places where I know this is happening, so that somebody can review the importance of those aspects. |
# Set default game_date to None | ||
game_date = None | ||
|
||
# Check for 0000-00-00 dates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this check is not entirely clear; supposedly all games should have a valid date? This piece of logic, which leaves the game_date as None
in selected cases, is not reflected in the new implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the records in the games database are not clean or unified. generally a good idea to assume that the checks are because there's bad data in prod that are >20 years old :) I'll see if i can run some queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very fair, but I'm still a bit confused as to how what exactly this check is doing;
the Game table uses DateField to represent dates, which returns datetime.date
objects. So this comparison against a string would always return True, meaning that the field is always used anyway.. Further, you cannot create a `datetime.date(0, 0, 0)' object to begin with, as the year/month/day combination is out of range, so it's not like there's any chance that any of the dates actually match '0000-00-00'. I'm a bit skeptical this check is doing anything at all; if there's a way for you to access the database and find some examples of problematic game dates, that'll help clear a light into what's going on here, and how I can integrate the special cases in the current query.
tourney_data.values(), | ||
key=lambda d: d.get("date", date.today()) or date.today(), | ||
reverse=True, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of giving tourney_data
twice to the PlayersTournamentTable
is unclear to me. Especially since the first time is unsorted while the second time is sorted. This may represent some kind of logic which is not reflected in the new implementation.
Before this commit, the construction of "player_name (player_id)" was performed at the db query level by creating a new composite data field. After this commit, the construction of "player_name (player_id)" is performed post-query as a simpler rendering function based on given fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is definitely faster but the tournament totals are not working correctly.
# Set default game_date to None | ||
game_date = None | ||
|
||
# Check for 0000-00-00 dates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the records in the games database are not clean or unified. generally a good idea to assume that the checks are because there's bad data in prod that are >20 years old :) I'll see if i can run some queries.
tournament_date=F("tournament_code__tournament_date"), | ||
tournament_total_players=F("tournament_code__total_players"), | ||
date=F("game_date"), | ||
won=Count("game_id", filter=Q_won), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be working -- all tournaments report just a single game. Possibly missing a group-by?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see them always as being a single game with the fixture data, but I do see something fishy going on; a bit hard to diagnose because the fixture data is not realistic, e.g., many players share the same name, and a tournament is made up from games from different dates. Currently looking into this to see what is the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found the issue; the entire aggregation was wrong because I was trying to construct the data starting from the Games table rather than the Tournament table.. Already have a fix; I'll push it after I address the other feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually nvm; I've pushed this one for now, since I'm not sure what to make of the other issue. Latest commit should fix the tournament table in the player profile.
the previous commit which reduces the number of db queries was constructing tournament tables in player profiles with wrong numbers of won/lost games. This commit fixes that issue.
Problem: The creation of some tables was causing a large number of
queries to be run which itself scaled with the size of the database,
i.e., larger databases (such as the one in production) caused more
queries to be made the the db itself (a problem which componds on
itself). This issue was not apparent in development because of the
smaller size of the development db.
First, we can verify the above problem by increasing the size of the
development db. We can do this by temporarily editing
scripts/entrypoint.sh
, chanding linewith
and then deleting and recreating the docker images/containers/volumes.
Before the change, the development database was creating data for 1000
players, 1000 games, and 1000 tournaments. Note that this meant that,
on average, each player only played one game and each tournament only
contained one game. This was helping mask the primary issue. After the
change, the development database is creating 100 players, 10_000 games,
and 100 tournaments, meaning that, on average, each player played 100
games, and each tournament contained 100 games. We note that, although
this new development db is still significantly smaller than the one in
production, the local development app is already having trouble
responding to requests in involving player and tournament pages.
With the new larger development database in place, we observe the
following concerning facts:
Solution:
This commit addresses the above issues by making 2 primary changes in
how some tables are computed or rendered:
The first involves the way player names and ids were rendered in the
game tables using the custom
agagd_core.tables.players.LinkFullMembersNameColumn
. This customcolumn had a
.render()
method which internally made a query to thedb. Then, for the rendering of the whole table, the
.render()
method was being called for every entry in the table data, causing
many queries to be made. To solve this issue, we removed
LinkFullMembersNameColumn
altogether, and constructed theappropriate queries to construct the "player-name (player-id)" label
once for all entries in the table.
The second involves the way opponent and tournament data was
collected to create the opponent and tournament tables in the
player_profile view. This data was collected and manipulated in pure
python, using explicit python loops over the Games model, and at
least one db query per iteration. To solve this issue, we created
appropriate queries which gathered and aggregated the required data
using a couple of large queries (rather than many small ones).
After the above changes:
the number of queries made in a random player page dropped from ~1000
to ~10.
the number of queries made in a random tournament page dropped from
~100 to ~5.