Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature optimize database inserts #104

Merged
merged 8 commits into from
Nov 21, 2023

Conversation

leothomas
Copy link
Collaborator

@leothomas leothomas commented Nov 18, 2023

This is an initial implementation of several database improvements as suggested by @bitner:

  • Setting database tuning parameters (work_mem, max_connections, etc)
  • Add an aoi_chunks table which partitions the AOIs into manageable geometries (ref
  • Combine insert triggers into a single trigger
  • Remove generated columns in favor of trigger
  • Update foreign keys to deferrable
  • Add .buffer(0) to shapely geometries to reduce number of invalid geometries in the database

Copy link
Collaborator

@jonaraphael jonaraphael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added my questions/comments. Let me know if you think we need to address any of these before trying it out!

ALSO it's failing test:

alembic/versions/c7c033c1cdb5_populate_aoi_chunks_table.py:20 in public function upgrade:
D103: Missing docstring in public function
alembic/versions/c7c033c1cdb5_populate_aoi_chunks_table.py:43 in public function downgrade:
D103: Missing docstring in public function

alembic/versions/3c4693517ef6_add_tables.py Outdated Show resolved Hide resolved
"aoi.id", ondelete="CASCADE", deferrable=True, initially="DEFERRED"
),
),
sa.Column("geometry", Geometry("POLYGON", srid=4326), nullable=False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that it will always be POLYGON and never MULTIPOLYGON?
Also will be important to confirm that we don't suffer any accuracy loss by using Geometry instead of Geography here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bitner any concerns with using POLYGON instead of MULTIPOLYGON in the aoi_chunks table?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are building the aoi_chunks table using the script I made that has both the st_dump (this takes any MULTI geometry and splits it into separate row per single geometry) and st_subdivide calls, then aoi_chunks will always be POLYGON.

"id",
sa.BigInteger,
sa.ForeignKey(
"aoi.id", ondelete="CASCADE", deferrable=True, initially="DEFERRED"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfamiliar with these
ondelete="CASCADE", deferrable=True, initially="DEFERRED"
Should those be added to our other foreign keys?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ondelete="CASCADE" is a behaviour which will delete items in the dependent table (slick_to_aoi) if the item that the foreign key references from the parent table (slick) is deleted. Basically, if you delete the slick, it will automatically delete the corresponding slick_to_aoi entry. I think this is default behaviour in SQLAlchemy, but leaving this explicitely defined just to make sure we're following David's recommendations as closely as possible

deferrable=True, initially="DEFERRED" allows all the foreign key constraints to be checked at the end of the transactions, instead of on each insert (reduces overhead).

Probably not a bad thing to add to other tables for consistency, but not critical

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typical use case for deferrable foreign keys is if you need to insert foreign keys in a way that creates a circular dependency. Since the foreign key constraint would fail on the insert, we can insert all the items in the transaction, and, at the end of the transaction check that everything is in place

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great--let's add these to the other columns as appropriate then, or add a comment to that extent? I think it is more confusing if the code is left in a state where otherwise identical variables are treated differently, when the reason for that choice is absent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deferred constraints can definitely make data loads quite a bit faster in some cases

op.create_table(
"slick_to_aoi",
sa.Column("id", sa.BigInteger, primary_key=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging that this may break things on our front end, if the UI ever utilizes this field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - I think this might also help marginally with insert times so we can check with Aemon on Monday

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a table that is just the link table for a many-to-may join that is automatically managed using triggers, is there any use case where the surrogate id key was ever exposed? Typically for cases like this where there is usually nothing that needs to refer or edit the table directly (it's always done through the foreign relationships), the surrogate key is just "dead weight".

alembic/versions/3c4693517ef6_add_tables.py Show resolved Hide resolved
cerulean_cloud/database_client.py Show resolved Hide resolved
database_flags=[dict(name="max_connections", value=500)],
# Postgres tuning values ref: https://github.com/developmentseed/how/tree/main/dev/postgresql
database_flags=[
dict(name="pg_stat_statments.track", value="ALL"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the the expensive one that should not be on PRODUCTION?
It seems like this value (or any other that needs to be set verbose on TEST, but quiet on PRODUCTION) should be set in our pulumi.yaml for each stack, rather than hardcoded here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bitner do you know if pg_stat_statements.track is too expensive to leave in production?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_stat_statements will have a small performance penalty, but far smaller than anything that is adding any kind of logging into the postgres logs. The benefits of having it in your database far outweigh the hit in my opinion as it gives you the hook to see where any slow downs are happening much more easily even as your database evolves.

I would go ahead and always leave it enabled in both TEST and PRODUCTION.

# Postgres tuning values ref: https://github.com/developmentseed/how/tree/main/dev/postgresql
database_flags=[
dict(name="pg_stat_statments.track", value="ALL"),
# Should be 1/4 of total system memory (15Gb)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I LOVE these comments--so helpful!!!

Also, where does this 15GB come from? and is that a good size for the system memory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 15Gb comes from the datbase configuration in GCP:
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my question is more like "how do we right size the database system memory?" and "is 15GB enough or overkill for our scope?"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two questions here that really become iterative.

  1. How do you make sure you have the right settings for your database instance
  2. What size database instance do you need

SHARED_BUFFERS and EFFECTIVE_CACHE_SIZE are going to be dictated directly by the database instance size. MAX_CONNECTIONS and WORK_MEM need to be set together relative to the value of SHARED_BUFFERS.

First step would be to start at an instance size, determine (or make a guess) at the concurrency you need, set your WORK_MEM to be SHARED_BUFFERS / MAX_CONNECTIONS. Then try to get as real a load on the database as possible. You can use pg_stat_statements to track what queries are taking the longest. If you dig into those queries with EXPLAIN ANALYZE and you see any indications that data is getting swapped to disk in things like sort/merge operations, that indicates that you need more WORK_MEM and so you would either need to drop your MAX_CONNECTIONS or bump up your instance size. Similarly, if you find that you are hitting issues about running out of connections, you would need to either adjust your settings within that instance or bump up to the next sized instance.

I always hesitate just giving any kind of blanket recommendations for PG settings as they almost always depend very heavily on how the database is actually getting used, more so than on the shape of the data itself. You should always expect some degree of iteration - and the right answer for today's load may well be different than as your data and usage patterns change, so it is always something that should be monitored and periodically tweaked.

# Only used by temp tables
dict(name="temp_buffers", value="512MB"),
# Max number of concurrent i/o processes
dict(name="effective_io_concurrency", value="100"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be related to (or higher than) the number of max_connections?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. EFFECTIVE_IO_CONCURRENCY is really about what the underlying storage technology is. The PG defaults assume the worst case scenario of spinning disk. When using SSDs, both concurrency and random access are vastly different than on spinning disk. This setting along with RANDOM_PAGE_COST both relate to the type of disk and influence how PG goes about making query plans.

stack/database.py Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Nov 18, 2023

🍹 preview on cerulean-cloud-images/test

Pulumi report
Previewing update (test):
@ previewing update........

@ previewing update..................
pulumi:pulumi:Stack cerulean-cloud-images-test running 
docker:image:Image cerulean-cloud-images-test-cloud-run-offset-tile-image  
docker:image:Image cerulean-cloud-images-test-cloud-run-orchestrator-image  
docker:image:Image cerulean-cloud-images-test-cloud-run-tipg-image  
@ previewing update....
gcp:container:Registry cerulean-cloud-images-test-registry  
@ previewing update....
docker:image:Image cerulean-cloud-images-test-cloud-run-offset-tile-image  Building image '../'...
docker:image:Image cerulean-cloud-images-test-cloud-run-orchestrator-image  Building image '../'...
docker:image:Image cerulean-cloud-images-test-cloud-run-tipg-image  Building image '../'...
@ previewing update................................................................................................
docker:image:Image cerulean-cloud-images-test-cloud-run-tipg-image  sha256:aa489d8f63a6f166736c98c90eea425cdf9a2ef08c0a47f0498e400d397acbef
docker:image:Image cerulean-cloud-images-test-cloud-run-tipg-image  sha256:aa489d8f63a6f166736c98c90eea425cdf9a2ef08c0a47f0498e400d397acbef
docker:image:Image cerulean-cloud-images-test-cloud-run-tipg-image  Image build succeeded.
@ previewing update....................
docker:image:Image cerulean-cloud-images-test-cloud-run-orchestrator-image  sha256:439ef6c96358d2e5003a2923c306937e8a7429b2b2cfe6fec8d0bfb7c945c350
docker:image:Image cerulean-cloud-images-test-cloud-run-orchestrator-image  sha256:439ef6c96358d2e5003a2923c306937e8a7429b2b2cfe6fec8d0bfb7c945c350
docker:image:Image cerulean-cloud-images-test-cloud-run-orchestrator-image  Image build succeeded.
@ previewing update............
docker:image:Image cerulean-cloud-images-test-cloud-run-offset-tile-image  sha256:f7f9fa4e6ffc09432e54737d066e447550a4c987eecc93d07cd09698a9f94f8f
docker:image:Image cerulean-cloud-images-test-cloud-run-offset-tile-image  sha256:f7f9fa4e6ffc09432e54737d066e447550a4c987eecc93d07cd09698a9f94f8f
docker:image:Image cerulean-cloud-images-test-cloud-run-offset-tile-image  Image build succeeded.
pulumi:pulumi:Stack cerulean-cloud-images-test  
Resources:
5 unchanged

Copy link

github-actions bot commented Nov 18, 2023

🍹 preview on cerulean-cloud/test

Pulumi report
Previewing update (test):
@ previewing update....

@ previewing update.....
pulumi:pulumi:Stack cerulean-cloud-test running 
@ previewing update....
pulumi:providers:docker cerulean-cloud-images-test-gcr  
gcp:storage:Bucket cerulean-cloud-test-bucket-cloud-function-ais  
gcp:cloudtasks:Queue cerulean-cloud-test-queue-cloud-tasks-ais-analysis  
gcp:serviceaccount:Account cerulean-cloud-test-cloud-function-ais  
~  gcp:sql:DatabaseInstance cerulean-cloud-test-database-instance update [diff: ~settings]
gcp:projects:IAMMember cerulean-cloud-test-cloud-function-ais-iam  
-- gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-ais delete original 
+- gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-ais replace [diff: ~detectMd5hash,name,source]
++ gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-ais create replacement [diff: ~detectMd5hash,name,source]
gcp:sql:Database cerulean-cloud-test-database  
gcp:sql:User cerulean-cloud-test-database-users  
~  gcp:cloudfunctions:Function cerulean-cloud-test-cloud-function-ais update [diff: ~secretEnvironmentVariables,sourceArchiveObject]
gcp:cloudfunctions:FunctionIamMember cerulean-cloud-test-cloud-function-ais-invoker  
@ previewing update.....
pulumi:pulumi:Stack cerulean-cloud-test running warning: serving_state is deprecated: `serving_state` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
pulumi:pulumi:Stack cerulean-cloud-test running warning: env_froms is deprecated: `env_from` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
pulumi:pulumi:Stack cerulean-cloud-test running warning: working_dir is deprecated: `working_dir` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
@ previewing update.............................................................................................................................................................................................
pulumi:pulumi:Stack cerulean-cloud-test running warning: serving_state is deprecated: `serving_state` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
pulumi:pulumi:Stack cerulean-cloud-test running warning: env_froms is deprecated: `env_from` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
pulumi:pulumi:Stack cerulean-cloud-test running warning: working_dir is deprecated: `working_dir` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
docker:index:RemoteImage cerulean-cloud-images-test-remote-orchestrator  
docker:index:RemoteImage cerulean-cloud-images-test-remote-tipg  
docker:index:RemoteImage cerulean-cloud-images-test-remote-offset  
@ previewing update.....
aws:iam:Role cerulean-cloud-test-lambda-titiler-role  
aws:s3:Bucket cerulean-cloud-test-titiler-lambda-archive  
aws:iam:Policy cerulean-cloud-test-lambda-titiler-policy  
aws:apigatewayv2:Api cerulean-cloud-test-lambda-titiler-api  
gcp:storage:Bucket cerulean-cloud-test-bucket-cloud-function-sr  
gcp:cloudtasks:Queue cerulean-cloud-test-queue-cloud-run-orchestrator  
aws:iam:Role cerulean-cloud-test-lambda-sentinel1-iam  
gcp:serviceaccount:Account cerulean-cloud-test-cloud-run-orchestrator  
gcp:serviceaccount:Account cerulean-cloud-test-cloud-function-sr  
~  gcp:cloudrun:Service cerulean-cloud-test-cloud-run-offset-tiles update [diff: ~metadata]
pulumi:pulumi:Stack cerulean-cloud-test running warning: serving_state is deprecated: `serving_state` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
@ previewing update....
pulumi:pulumi:Stack cerulean-cloud-test running warning: env_froms is deprecated: `env_from` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
pulumi:pulumi:Stack cerulean-cloud-test running warning: working_dir is deprecated: `working_dir` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
~  gcp:cloudrun:Service cerulean-cloud-test-cloud-run-tipg update [diff: ~metadata]
aws:iam:RolePolicyAttachment cerulean-cloud-test-lambda-titiler-attachment2  
aws:iam:RolePolicyAttachment cerulean-cloud-test-lambda-titiler-attachment  
-- gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-sr delete original 
+- gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-sr replace [diff: ~name]
++ gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-sr create replacement [diff: ~name]
-- gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-historical-run delete original 
+- gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-historical-run replace [diff: ~name]
++ gcp:storage:BucketObject cerulean-cloud-test-source-cloud-function-historical-run create replacement [diff: ~name]
gcp:projects:IAMMember cerulean-cloud-test-cloud-function-sr-iam  
gcp:projects:IAMMember cerulean-cloud-test-cloud-run-orchestrator-cloudTasksEnqueuer  
gcp:projects:IAMMember cerulean-cloud-test-cloud-run-orchestrator-cloudSqlClient  
gcp:cloudrun:IamPolicy cerulean-cloud-test-cloud-run-noauth-iam-policy-offset  
~  gcp:cloudrun:Service cerulean-cloud-test-cloud-run-orchestrator update [diff: ~metadata,template]
gcp:cloudrun:IamPolicy cerulean-cloud-test-cloud-run-noauth-iam-policy-tipg  
~  aws:s3:BucketObject cerulean-cloud-test-titiler-lambda-archive update [diff: ~source]
gcp:cloudrun:IamPolicy cerulean-cloud-test-cloud-run-noauth-iam-policy-orchestrator  
~  gcp:cloudfunctions:Function cerulean-cloud-test-cloud-function-historical-run update [diff: ~sourceArchiveObject]
~  gcp:cloudfunctions:Function cerulean-cloud-test-cloud-function-sr update [diff: ~sourceArchiveObject]
~  aws:lambda:Function cerulean-cloud-test-lambda-titiler-sentinel update [diff: ~sourceCodeHash]
gcp:cloudfunctions:FunctionIamMember cerulean-cloud-test-cloud-function-historical-run-invoker  
aws:lambda:Function cerulean-cloud-test-lambda-sentinel1-subscription  
gcp:cloudfunctions:FunctionIamMember cerulean-cloud-test-cloud-function-sr-invoker  
aws:apigatewayv2:Integration cerulean-cloud-test-lambda-titiler-integration  
aws:lambda:Permission cerulean-cloud-test-lambda-titiler-permission  
aws:sns:TopicSubscription cerulean-cloud-test-sentinel1-subscription  
aws:lambda:Permission cerulean-cloud-test-lambda-sentinel1-permission  
aws:apigatewayv2:Route cerulean-cloud-test-lambda-titiler-route  
aws:apigatewayv2:Stage cerulean-cloud-test-lambda-titiler-stage  
pulumi:pulumi:Stack cerulean-cloud-test running Creating lambda package [running in Docker]...
pulumi:pulumi:Stack cerulean-cloud-test running Building docker image...
pulumi:pulumi:Stack cerulean-cloud-test running Copying package.zip ...
pulumi:pulumi:Stack cerulean-cloud-test running Copied package package.zip ...
pulumi:pulumi:Stack cerulean-cloud-test  9 warnings; 4 messages
Diagnostics:
pulumi:pulumi:Stack (cerulean-cloud-test):
warning: serving_state is deprecated: `serving_state` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
warning: env_froms is deprecated: `env_from` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
warning: working_dir is deprecated: `working_dir` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
warning: serving_state is deprecated: `serving_state` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
warning: env_froms is deprecated: `env_from` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
warning: working_dir is deprecated: `working_dir` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
warning: serving_state is deprecated: `serving_state` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
warning: env_froms is deprecated: `env_from` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.
warning: working_dir is deprecated: `working_dir` is deprecated and will be removed in a future major release. This field is not supported by the Cloud Run API.

Creating lambda package [running in Docker]...
Building docker image...
Copying package.zip ...
Copied package package.zip ...

Resources:
~ 9 to update
+-3 to replace
12 changes. 38 unchanged

"id",
sa.BigInteger,
sa.ForeignKey(
"aoi.id", ondelete="CASCADE", deferrable=True, initially="DEFERRED"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deferred constraints can definitely make data loads quite a bit faster in some cases

"aoi.id", ondelete="CASCADE", deferrable=True, initially="DEFERRED"
),
),
sa.Column("geometry", Geometry("POLYGON", srid=4326), nullable=False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are building the aoi_chunks table using the script I made that has both the st_dump (this takes any MULTI geometry and splits it into separate row per single geometry) and st_subdivide calls, then aoi_chunks will always be POLYGON.

op.create_table(
"slick_to_aoi",
sa.Column("id", sa.BigInteger, primary_key=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a table that is just the link table for a many-to-may join that is automatically managed using triggers, is there any use case where the surrogate id key was ever exposed? Typically for cases like this where there is usually nothing that needs to refer or edit the table directly (it's always done through the foreign relationships), the surrogate key is just "dead weight".

@@ -57,7 +57,7 @@ def upgrade() -> None:
aoi_eez = database_schema.AoiEez(
type=1,
name=feat["properties"]["GEONAME"],
geometry=to_wkt(from_geojson(json.dumps(feat["geometry"]))),
geometry=to_wkt(from_geojson(json.dumps(feat["geometry"])).buffer(0)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically using things like st_makevalid or buffer(0) can be dangerous as you don't necessarily know "why" the geometry is invalid and it is possible that this just allows a bad record (typically where simplification has done something like down-classed a polygon to a linestring). You should make sure you understand where these invalid geometries are coming from further up if at all possible and use tricks like this as a last resort.

BEGIN
RAISE NOTICE '---------------------------------------------------------';
RAISE NOTICE 'In slick_before_trigger_func. %', (clock_timestamp() - timer)::interval;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is OK to leave them in as long as the overall server settings for CLIENT_MIN_MESSAGES is set to warning in which case this type of logging will have no appreciable affect, but will allow you to debug things when running manually by running SET CLIENT_MIN_MESSAGES TO NOTICE; in your session.

database_flags=[dict(name="max_connections", value=500)],
# Postgres tuning values ref: https://github.com/developmentseed/how/tree/main/dev/postgresql
database_flags=[
dict(name="pg_stat_statments.track", value="ALL"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_stat_statements will have a small performance penalty, but far smaller than anything that is adding any kind of logging into the postgres logs. The benefits of having it in your database far outweigh the hit in my opinion as it gives you the hook to see where any slow downs are happening much more easily even as your database evolves.

I would go ahead and always leave it enabled in both TEST and PRODUCTION.

# Postgres tuning values ref: https://github.com/developmentseed/how/tree/main/dev/postgresql
database_flags=[
dict(name="pg_stat_statments.track", value="ALL"),
# Should be 1/4 of total system memory (15Gb)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two questions here that really become iterative.

  1. How do you make sure you have the right settings for your database instance
  2. What size database instance do you need

SHARED_BUFFERS and EFFECTIVE_CACHE_SIZE are going to be dictated directly by the database instance size. MAX_CONNECTIONS and WORK_MEM need to be set together relative to the value of SHARED_BUFFERS.

First step would be to start at an instance size, determine (or make a guess) at the concurrency you need, set your WORK_MEM to be SHARED_BUFFERS / MAX_CONNECTIONS. Then try to get as real a load on the database as possible. You can use pg_stat_statements to track what queries are taking the longest. If you dig into those queries with EXPLAIN ANALYZE and you see any indications that data is getting swapped to disk in things like sort/merge operations, that indicates that you need more WORK_MEM and so you would either need to drop your MAX_CONNECTIONS or bump up your instance size. Similarly, if you find that you are hitting issues about running out of connections, you would need to either adjust your settings within that instance or bump up to the next sized instance.

I always hesitate just giving any kind of blanket recommendations for PG settings as they almost always depend very heavily on how the database is actually getting used, more so than on the shape of the data itself. You should always expect some degree of iteration - and the right answer for today's load may well be different than as your data and usage patterns change, so it is always something that should be monitored and periodically tweaked.

# Only used by temp tables
dict(name="temp_buffers", value="512MB"),
# Max number of concurrent i/o processes
dict(name="effective_io_concurrency", value="100"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. EFFECTIVE_IO_CONCURRENCY is really about what the underlying storage technology is. The PG defaults assume the worst case scenario of spinning disk. When using SSDs, both concurrency and random access are vastly different than on spinning disk. This setting along with RANDOM_PAGE_COST both relate to the type of disk and influence how PG goes about making query plans.

@jonaraphael jonaraphael marked this pull request as ready for review November 21, 2023 15:22
@jonaraphael jonaraphael merged commit 5948700 into main Nov 21, 2023
10 checks passed
@jonaraphael jonaraphael deleted the feature-optimize-database-inserts branch November 21, 2023 15:23
leothomas pushed a commit that referenced this pull request Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants