-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature optimize database inserts #104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added my questions/comments. Let me know if you think we need to address any of these before trying it out!
ALSO it's failing test:
alembic/versions/c7c033c1cdb5_populate_aoi_chunks_table.py:20 in public function
upgrade
:
D103: Missing docstring in public function
alembic/versions/c7c033c1cdb5_populate_aoi_chunks_table.py:43 in public functiondowngrade
:
D103: Missing docstring in public function
"aoi.id", ondelete="CASCADE", deferrable=True, initially="DEFERRED" | ||
), | ||
), | ||
sa.Column("geometry", Geometry("POLYGON", srid=4326), nullable=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that it will always be POLYGON and never MULTIPOLYGON?
Also will be important to confirm that we don't suffer any accuracy loss by using Geometry instead of Geography here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bitner any concerns with using POLYGON
instead of MULTIPOLYGON
in the aoi_chunks
table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are building the aoi_chunks table using the script I made that has both the st_dump (this takes any MULTI geometry and splits it into separate row per single geometry) and st_subdivide calls, then aoi_chunks will always be POLYGON.
"id", | ||
sa.BigInteger, | ||
sa.ForeignKey( | ||
"aoi.id", ondelete="CASCADE", deferrable=True, initially="DEFERRED" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unfamiliar with these
ondelete="CASCADE", deferrable=True, initially="DEFERRED"
Should those be added to our other foreign keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ondelete="CASCADE"
is a behaviour which will delete items in the dependent table (slick_to_aoi
) if the item that the foreign key references from the parent table (slick
) is deleted. Basically, if you delete the slick, it will automatically delete the corresponding slick_to_aoi
entry. I think this is default behaviour in SQLAlchemy, but leaving this explicitely defined just to make sure we're following David's recommendations as closely as possible
deferrable=True, initially="DEFERRED"
allows all the foreign key constraints to be checked at the end of the transactions, instead of on each insert (reduces overhead).
Probably not a bad thing to add to other tables for consistency, but not critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typical use case for deferrable foreign keys is if you need to insert foreign keys in a way that creates a circular dependency. Since the foreign key constraint would fail on the insert, we can insert all the items in the transaction, and, at the end of the transaction check that everything is in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great--let's add these to the other columns as appropriate then, or add a comment to that extent? I think it is more confusing if the code is left in a state where otherwise identical variables are treated differently, when the reason for that choice is absent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deferred constraints can definitely make data loads quite a bit faster in some cases
op.create_table( | ||
"slick_to_aoi", | ||
sa.Column("id", sa.BigInteger, primary_key=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging that this may break things on our front end, if the UI ever utilizes this field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - I think this might also help marginally with insert times so we can check with Aemon on Monday
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a table that is just the link table for a many-to-may join that is automatically managed using triggers, is there any use case where the surrogate id key was ever exposed? Typically for cases like this where there is usually nothing that needs to refer or edit the table directly (it's always done through the foreign relationships), the surrogate key is just "dead weight".
stack/database.py
Outdated
database_flags=[dict(name="max_connections", value=500)], | ||
# Postgres tuning values ref: https://github.com/developmentseed/how/tree/main/dev/postgresql | ||
database_flags=[ | ||
dict(name="pg_stat_statments.track", value="ALL"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the the expensive one that should not be on PRODUCTION?
It seems like this value (or any other that needs to be set verbose on TEST, but quiet on PRODUCTION) should be set in our pulumi.yaml for each stack, rather than hardcoded here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bitner do you know if pg_stat_statements.track
is too expensive to leave in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pg_stat_statements will have a small performance penalty, but far smaller than anything that is adding any kind of logging into the postgres logs. The benefits of having it in your database far outweigh the hit in my opinion as it gives you the hook to see where any slow downs are happening much more easily even as your database evolves.
I would go ahead and always leave it enabled in both TEST and PRODUCTION.
# Postgres tuning values ref: https://github.com/developmentseed/how/tree/main/dev/postgresql | ||
database_flags=[ | ||
dict(name="pg_stat_statments.track", value="ALL"), | ||
# Should be 1/4 of total system memory (15Gb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I LOVE these comments--so helpful!!!
Also, where does this 15GB come from? and is that a good size for the system memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my question is more like "how do we right size the database system memory?" and "is 15GB enough or overkill for our scope?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two questions here that really become iterative.
- How do you make sure you have the right settings for your database instance
- What size database instance do you need
SHARED_BUFFERS and EFFECTIVE_CACHE_SIZE are going to be dictated directly by the database instance size. MAX_CONNECTIONS and WORK_MEM need to be set together relative to the value of SHARED_BUFFERS.
First step would be to start at an instance size, determine (or make a guess) at the concurrency you need, set your WORK_MEM to be SHARED_BUFFERS / MAX_CONNECTIONS. Then try to get as real a load on the database as possible. You can use pg_stat_statements to track what queries are taking the longest. If you dig into those queries with EXPLAIN ANALYZE and you see any indications that data is getting swapped to disk in things like sort/merge operations, that indicates that you need more WORK_MEM and so you would either need to drop your MAX_CONNECTIONS or bump up your instance size. Similarly, if you find that you are hitting issues about running out of connections, you would need to either adjust your settings within that instance or bump up to the next sized instance.
I always hesitate just giving any kind of blanket recommendations for PG settings as they almost always depend very heavily on how the database is actually getting used, more so than on the shape of the data itself. You should always expect some degree of iteration - and the right answer for today's load may well be different than as your data and usage patterns change, so it is always something that should be monitored and periodically tweaked.
stack/database.py
Outdated
# Only used by temp tables | ||
dict(name="temp_buffers", value="512MB"), | ||
# Max number of concurrent i/o processes | ||
dict(name="effective_io_concurrency", value="100"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be related to (or higher than) the number of max_connections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. EFFECTIVE_IO_CONCURRENCY is really about what the underlying storage technology is. The PG defaults assume the worst case scenario of spinning disk. When using SSDs, both concurrency and random access are vastly different than on spinning disk. This setting along with RANDOM_PAGE_COST both relate to the type of disk and influence how PG goes about making query plans.
🍹
|
🍹
|
"id", | ||
sa.BigInteger, | ||
sa.ForeignKey( | ||
"aoi.id", ondelete="CASCADE", deferrable=True, initially="DEFERRED" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deferred constraints can definitely make data loads quite a bit faster in some cases
"aoi.id", ondelete="CASCADE", deferrable=True, initially="DEFERRED" | ||
), | ||
), | ||
sa.Column("geometry", Geometry("POLYGON", srid=4326), nullable=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are building the aoi_chunks table using the script I made that has both the st_dump (this takes any MULTI geometry and splits it into separate row per single geometry) and st_subdivide calls, then aoi_chunks will always be POLYGON.
op.create_table( | ||
"slick_to_aoi", | ||
sa.Column("id", sa.BigInteger, primary_key=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a table that is just the link table for a many-to-may join that is automatically managed using triggers, is there any use case where the surrogate id key was ever exposed? Typically for cases like this where there is usually nothing that needs to refer or edit the table directly (it's always done through the foreign relationships), the surrogate key is just "dead weight".
@@ -57,7 +57,7 @@ def upgrade() -> None: | |||
aoi_eez = database_schema.AoiEez( | |||
type=1, | |||
name=feat["properties"]["GEONAME"], | |||
geometry=to_wkt(from_geojson(json.dumps(feat["geometry"]))), | |||
geometry=to_wkt(from_geojson(json.dumps(feat["geometry"])).buffer(0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatically using things like st_makevalid or buffer(0) can be dangerous as you don't necessarily know "why" the geometry is invalid and it is possible that this just allows a bad record (typically where simplification has done something like down-classed a polygon to a linestring). You should make sure you understand where these invalid geometries are coming from further up if at all possible and use tricks like this as a last resort.
BEGIN | ||
RAISE NOTICE '---------------------------------------------------------'; | ||
RAISE NOTICE 'In slick_before_trigger_func. %', (clock_timestamp() - timer)::interval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is OK to leave them in as long as the overall server settings for CLIENT_MIN_MESSAGES is set to warning in which case this type of logging will have no appreciable affect, but will allow you to debug things when running manually by running SET CLIENT_MIN_MESSAGES TO NOTICE;
in your session.
stack/database.py
Outdated
database_flags=[dict(name="max_connections", value=500)], | ||
# Postgres tuning values ref: https://github.com/developmentseed/how/tree/main/dev/postgresql | ||
database_flags=[ | ||
dict(name="pg_stat_statments.track", value="ALL"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pg_stat_statements will have a small performance penalty, but far smaller than anything that is adding any kind of logging into the postgres logs. The benefits of having it in your database far outweigh the hit in my opinion as it gives you the hook to see where any slow downs are happening much more easily even as your database evolves.
I would go ahead and always leave it enabled in both TEST and PRODUCTION.
# Postgres tuning values ref: https://github.com/developmentseed/how/tree/main/dev/postgresql | ||
database_flags=[ | ||
dict(name="pg_stat_statments.track", value="ALL"), | ||
# Should be 1/4 of total system memory (15Gb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two questions here that really become iterative.
- How do you make sure you have the right settings for your database instance
- What size database instance do you need
SHARED_BUFFERS and EFFECTIVE_CACHE_SIZE are going to be dictated directly by the database instance size. MAX_CONNECTIONS and WORK_MEM need to be set together relative to the value of SHARED_BUFFERS.
First step would be to start at an instance size, determine (or make a guess) at the concurrency you need, set your WORK_MEM to be SHARED_BUFFERS / MAX_CONNECTIONS. Then try to get as real a load on the database as possible. You can use pg_stat_statements to track what queries are taking the longest. If you dig into those queries with EXPLAIN ANALYZE and you see any indications that data is getting swapped to disk in things like sort/merge operations, that indicates that you need more WORK_MEM and so you would either need to drop your MAX_CONNECTIONS or bump up your instance size. Similarly, if you find that you are hitting issues about running out of connections, you would need to either adjust your settings within that instance or bump up to the next sized instance.
I always hesitate just giving any kind of blanket recommendations for PG settings as they almost always depend very heavily on how the database is actually getting used, more so than on the shape of the data itself. You should always expect some degree of iteration - and the right answer for today's load may well be different than as your data and usage patterns change, so it is always something that should be monitored and periodically tweaked.
stack/database.py
Outdated
# Only used by temp tables | ||
dict(name="temp_buffers", value="512MB"), | ||
# Max number of concurrent i/o processes | ||
dict(name="effective_io_concurrency", value="100"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. EFFECTIVE_IO_CONCURRENCY is really about what the underlying storage technology is. The PG defaults assume the worst case scenario of spinning disk. When using SSDs, both concurrency and random access are vastly different than on spinning disk. This setting along with RANDOM_PAGE_COST both relate to the type of disk and influence how PG goes about making query plans.
Feature optimize database inserts
This is an initial implementation of several database improvements as suggested by @bitner:
work_mem
,max_connections
, etc)aoi_chunks
table which partitions the AOIs into manageable geometries (ref.buffer(0)
to shapely geometries to reduce number of invalid geometries in the database