Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PG17 - SQLSmith - Server Crash Due to MERGE Command on Distributed Table with Schema-Based Sharding #7846

Closed
m3hm3t opened this issue Jan 13, 2025 · 5 comments · Fixed by #7853

Comments

@m3hm3t
Copy link
Contributor

m3hm3t commented Jan 13, 2025

Executing a MERGE command on a distributed table (dist_sc_1.schema_based_sharding_1) caused the PostgreSQL server to crash with signal 6: Aborted, forcing the termination of active server processes.

Steps to Reproduce

CREATE SCHEMA dist_sc_1;

CREATE TABLE dist_sc_1.schema_based_sharding_1 (
    id INTEGER
);

-- Distributed table configuration
SELECT create_distributed_table('dist_sc_1.schema_based_sharding_1', 'id');

MERGE INTO dist_sc_1.schema_based_sharding_1 AS target_0
USING pg_catalog.pg_class AS ref_0
  INNER JOIN information_schema.check_constraints AS ref_1
  ON (
    CASE 
      WHEN 
        CASE 
          WHEN (ref_0.reltuples = (SELECT null_frac FROM pg_catalog.pg_stats LIMIT 1 OFFSET 4)) 
          OR (CAST(NULL AS xid8) >= CAST(NULL AS xid8)) 
          THEN CAST(NULL AS lseg) 
          ELSE CAST(NULL AS lseg) 
        END > CAST(NULL AS lseg) 
      THEN CAST(NULL AS jsonb) 
      ELSE CAST(NULL AS jsonb) 
    END 
    <= 
    CASE 
      WHEN FALSE OR FALSE 
      THEN CAST(NULL AS jsonb) 
      ELSE CAST(NULL AS jsonb) 
    END
  )
ON target_0.id = ref_0.relpages 
WHEN NOT MATCHED
  AND CAST(NULL AS circle) <@ CAST(NULL AS circle)
THEN DO NOTHING;

Schema Details

\d+ dist_sc_1.schema_based_sharding_1
                                Table "dist_sc_1.schema_based_sharding_1"
┌────────┬─────────┬───────────┬──────────┬─────────┬─────────┬─────────────┬──────────────┬─────────────┐
│ Column │  Type   │ Collation │ Nullable │ Default │ Storage │ Compression │ Stats target │ Description │
├────────┼─────────┼───────────┼──────────┼─────────┼─────────┼─────────────┼──────────────┼─────────────┤
│ id     │ integer │           │          │         │ plain   │             │              │             │
└────────┴─────────┴───────────┴──────────┴─────────┴─────────┴─────────────┴──────────────┴─────────────┘
Access method: heap

Crash Logs

2025-01-10 22:02:11.451 UTC [11846] LOG:  server process (PID 59459) was terminated by signal 6: Aborted
2025-01-10 22:02:11.451 UTC [11846] DETAIL:  Failed process was running: MERGE INTO dist_sc_1.schema_based_sharding_1 as target_0
        USING pg_catalog.pg_class as ref_0
          inner join information_schema.check_constraints as ref_1
          on (case when case when (ref_0.reltuples = (select null_frac from pg_catalog.pg_stats limit 1 offset 4)
                          ) 
                      or (cast(null as xid8) >= cast(null as xid8)) then cast(null as lseg) else cast(null as lseg) end
                     > cast(null as lseg) then cast(null as jsonb) else cast(null as jsonb) end
                 <= case when (false) 
                  or (false) then cast(null as jsonb) else cast(null as jsonb) end
                )
        ON target_0.id = ref_0.relpages 
        WHEN NOT MATCHED
          AND cast(null as circle) <@ cast(null as circle)
           THEN DO NOTHING

2025-01-10 22:02:11.451 UTC [11846] LOG:  terminating any other active server processes
@m3hm3t
Copy link
Contributor Author

m3hm3t commented Jan 20, 2025

[local] citus@citus:9700-83795=# SELECT version();
┌───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                version                                                │
├───────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 16.3 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit │
└───────────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)

Time: 10.758 ms

[local] citus@citus:9700-83795=# MERGE INTO dist_sc_1.schema_based_sharding_1 AS target_0
USING pg_catalog.pg_class AS ref_0
  INNER JOIN information_schema.check_constraints AS ref_1
  ON (
    CASE 
      WHEN 
        CASE 
          WHEN (ref_0.reltuples = (SELECT null_frac FROM pg_catalog.pg_stats LIMIT 1 OFFSET 4)) 
          OR (CAST(NULL AS xid8) >= CAST(NULL AS xid8)) 
          THEN CAST(NULL AS lseg) 
          ELSE CAST(NULL AS lseg) 
        END > CAST(NULL AS lseg) 
      THEN CAST(NULL AS jsonb) 
      ELSE CAST(NULL AS jsonb) 
    END 
    <= 
    CASE 
      WHEN FALSE OR FALSE 
      THEN CAST(NULL AS jsonb) 
      ELSE CAST(NULL AS jsonb) 
    END
  )
ON target_0.id = ref_0.relpages 
WHEN NOT MATCHED
  AND CAST(NULL AS circle) <@ CAST(NULL AS circle)
THEN DO NOTHING;
ERROR:  non-IMMUTABLE functions are not yet supported in MERGE sql with distributed tables
Time: 31.655 ms

@m3hm3t
Copy link
Contributor Author

m3hm3t commented Jan 20, 2025

[local] citus@citus:9700-89140=# SELECT version();
┌───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                version                                                │
├───────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 17.0 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit │
└───────────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)

Time: 1.625 ms

[local] citus@citus:9700-88881=# MERGE INTO dist_sc_1.schema_based_sharding_1 AS target_0
USING pg_catalog.pg_class AS ref_0
  INNER JOIN information_schema.check_constraints AS ref_1
  ON (
    CASE 
      WHEN 
        CASE 
          WHEN (ref_0.reltuples = (SELECT null_frac FROM pg_catalog.pg_stats LIMIT 1 OFFSET 4)) 
          OR (CAST(NULL AS xid8) >= CAST(NULL AS xid8)) 
          THEN CAST(NULL AS lseg) 
          ELSE CAST(NULL AS lseg) 
        END > CAST(NULL AS lseg) 
      THEN CAST(NULL AS jsonb) 
      ELSE CAST(NULL AS jsonb) 
    END 
    <= 
    CASE 
      WHEN FALSE OR FALSE 
      THEN CAST(NULL AS jsonb) 
      ELSE CAST(NULL AS jsonb) 
    END
  )
ON target_0.id = ref_0.relpages 
WHEN NOT MATCHED
  AND CAST(NULL AS circle) <@ CAST(NULL AS circle)
THEN DO NOTHING;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 25.230 ms

[local] citus@citus:9700-7081=# SELECT version();
┌───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                version                                                │
├───────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 17.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit │
└───────────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)

[local] citus@citus:9700-7081=# MERGE INTO dist_sc_1.schema_based_sharding_1 AS target_0
USING pg_catalog.pg_class AS ref_0
  INNER JOIN information_schema.check_constraints AS ref_1
  ON (
    CASE 
      WHEN 
        CASE 
          WHEN (ref_0.reltuples = (SELECT null_frac FROM pg_catalog.pg_stats LIMIT 1 OFFSET 4)) 
          OR (CAST(NULL AS xid8) >= CAST(NULL AS xid8)) 
          THEN CAST(NULL AS lseg) 
          ELSE CAST(NULL AS lseg) 
        END > CAST(NULL AS lseg) 
      THEN CAST(NULL AS jsonb) 
      ELSE CAST(NULL AS jsonb) 
    END 
    <= 
    CASE 
      WHEN FALSE OR FALSE 
      THEN CAST(NULL AS jsonb) 
      ELSE CAST(NULL AS jsonb) 
    END
  )
ON target_0.id = ref_0.relpages 
WHEN NOT MATCHED
  AND CAST(NULL AS circle) <@ CAST(NULL AS circle)
THEN DO NOTHING;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 24.990 ms

@m3hm3t
Copy link
Contributor Author

m3hm3t commented Jan 20, 2025

[local] citus@citus:9700-9626=# CREATE TABLE non_dist_table (id INTEGER);
CREATE TABLE
Time: 31.731 ms

[local] citus@citus:9700-9626=# MERGE INTO non_dist_table AS target_0
USING pg_catalog.pg_class AS ref_0
ON target_0.id = ref_0.relpages
WHEN NOT MATCHED THEN DO NOTHING;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 30.947 ms

2025-01-20 11:07:13.126 UTC [9626] LOG:  statement: MERGE INTO non_dist_table AS target_0
	USING pg_catalog.pg_class AS ref_0
	ON target_0.id = ref_0.relpages
	WHEN NOT MATCHED THEN DO NOTHING;
TRAP: failed Assert("root->hasLateralRTEs"), File: "initsplan.c", Line: 2238, PID: 9626
postgres: citus citus [local] MERGE(ExceptionalCondition+0x6e)[0x5590ae9d19bc]
postgres: citus citus [local] MERGE(+0x42f3e4)[0x5590ae7613e4]
postgres: citus citus [local] MERGE(+0x42f92d)[0x5590ae76192d]
postgres: citus citus [local] MERGE(+0x42fb75)[0x5590ae761b75]
postgres: citus citus [local] MERGE(deconstruct_jointree+0x10e)[0x5590ae7620ce]
postgres: citus citus [local] MERGE(query_planner+0x103)[0x5590ae76343e]
postgres: citus citus [local] MERGE(+0x439827)[0x5590ae76b827]
postgres: citus citus [local] MERGE(subquery_planner+0xb42)[0x5590ae76d3b6]
postgres: citus citus [local] MERGE(standard_planner+0x1a1)[0x5590ae76da1c]
/home/citus/.pgenv/pgsql-17.2/lib/citus.so(+0xd084e)[0x7f71589cf84e]
/home/citus/.pgenv/pgsql-17.2/lib/pg_stat_statements.so(+0x79d2)[0x7f7157dde9d2]
postgres: citus citus [local] MERGE(planner+0x16)[0x5590ae76e024]
postgres: citus citus [local] MERGE(pg_plan_query+0x48)[0x5590ae86a9b1]
postgres: citus citus [local] MERGE(pg_plan_queries+0x44)[0x5590ae86aa67]
postgres: citus citus [local] MERGE(+0x538f92)[0x5590ae86af92]
postgres: citus citus [local] MERGE(PostgresMain+0x7d8)[0x5590ae86d1a2]
postgres: citus citus [local] MERGE(BackendMain+0x51)[0x5590ae866235]
postgres: citus citus [local] MERGE(postmaster_child_launch+0xc7)[0x5590ae7b2dc2]
postgres: citus citus [local] MERGE(+0x4855ff)[0x5590ae7b75ff]
postgres: citus citus [local] MERGE(+0x485882)[0x5590ae7b7882]
postgres: citus citus [local] MERGE(BackgroundWorkerInitializeConnection+0x0)[0x5590ae7b8eea]
postgres: citus citus [local] MERGE(main+0x220)[0x5590ae6cc59f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f715ac65d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f715ac65e40]
postgres: citus citus [local] MERGE(_start+0x25)[0x5590ae40ff05]
2025-01-20 11:07:13.155 UTC [6984] LOG:  server process (PID 9626) was terminated by signal 6: Aborted
2025-01-20 11:07:13.155 UTC [6984] DETAIL:  Failed process was running: MERGE INTO non_dist_table AS target_0
	USING pg_catalog.pg_class AS ref_0
	ON target_0.id = ref_0.relpages
	WHEN NOT MATCHED THEN DO NOTHING;
2025-01-20 11:07:13.155 UTC [6984] LOG:  terminating any other active server processes
2025-01-20 11:07:13.157 UTC [9803] FATAL:  the database system is in recovery mode

the error originates from this assertion in PostgreSQL 17's source code:

Assert(root->hasLateralRTEs);

The issue here seems to stem from the PostgreSQL planner's handling of the MERGE statement rather than any interaction with Citus or distributed table logic. Since the crash occurs even with a non-distributed table, it indicates a potential bug in PostgreSQL's query planning logic for MERGE statements.

@m3hm3t
Copy link
Contributor Author

m3hm3t commented Jan 20, 2025

libc.so.6!__pthread_kill_implementation(int no_tid, int signo, pthread_t threadid) (pthread_kill.c:44)
libc.so.6!__pthread_kill_internal(int signo, pthread_t threadid) (pthread_kill.c:78)
libc.so.6!__GI___pthread_kill(pthread_t threadid, int signo) (pthread_kill.c:89)
libc.so.6!__GI_raise(int sig) (raise.c:26)
libc.so.6!__GI_abort() (abort.c:79)
ExceptionalCondition(const char * conditionName, const char * fileName, int lineNumber) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\utils\error\assert.c:66)
distribute_qual_to_rels(PlannerInfo * root, Node * clause, JoinTreeItem * jtitem, SpecialJoinInfo * sjinfo, Index security_level, Relids qualscope, Relids ojscope, Relids outerjoin_nonnullable, Relids incompatible_relids, _Bool allow_equivalence, _Bool has_clone, _Bool is_clone, List ** postponed_oj_qual_list) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\initsplan.c:2238)
distribute_quals_to_rels(PlannerInfo * root, List * clauses, JoinTreeItem * jtitem, SpecialJoinInfo * sjinfo, Index security_level, Relids qualscope, Relids ojscope, Relids outerjoin_nonnullable, Relids incompatible_relids, _Bool allow_equivalence, _Bool has_clone, _Bool is_clone, List ** postponed_oj_qual_list) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\initsplan.c:2138)
deconstruct_distribute(PlannerInfo * root, JoinTreeItem * jtitem) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\initsplan.c:1154)
deconstruct_jointree(PlannerInfo * root) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\initsplan.c:782)
query_planner(PlannerInfo * root, query_pathkeys_callback qp_callback, void * qp_extra) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\planmain.c:188)
grouping_planner(PlannerInfo * root, double tuple_fraction, SetOperationStmt * setops) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\planner.c:1520)
subquery_planner(PlannerGlobal * glob, Query * parse, PlannerInfo * parent_root, _Bool hasRecursion, double tuple_fraction, SetOperationStmt * setops) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\planner.c:1089)
standard_planner(Query * parse, const char * query_string, int cursorOptions, ParamListInfo boundParams) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\planner.c:415)
citus.so!distributed_planner(Query * parse, const char * query_string, int cursorOptions, ParamListInfo boundParams) (\workspaces\citus\src\backend\distributed\planner\distributed_planner.c:272)
pg_stat_statements.so!pgss_planner(Query * parse, const char * query_string, int cursorOptions, ParamListInfo boundParams) (\home\citus\.pgenv\src\postgresql-17.2\contrib\pg_stat_statements\pg_stat_statements.c:962)
planner(Query * parse, const char * query_string, int cursorOptions, ParamListInfo boundParams) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\optimizer\plan\planner.c:280)
pg_plan_query(Query * querytree, const char * query_string, int cursorOptions, ParamListInfo boundParams) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\tcop\postgres.c:908)
pg_plan_queries(List * querytrees, const char * query_string, int cursorOptions, ParamListInfo boundParams) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\tcop\postgres.c:1000)
exec_simple_query(const char * query_string) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\tcop\postgres.c:1197)
PostgresMain(const char * dbname, const char * username) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\tcop\postgres.c:4767)
BackendMain(char * startup_data, size_t startup_data_len) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\tcop\backend_startup.c:105)
postmaster_child_launch(BackendType child_type, char * startup_data, size_t startup_data_len, ClientSocket * client_sock) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\postmaster\launch_backend.c:277)
BackendStartup(ClientSocket * client_sock) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\postmaster\postmaster.c:3593)
ServerLoop() (\home\citus\.pgenv\src\postgresql-17.2\src\backend\postmaster\postmaster.c:1674)
PostmasterMain(int argc, char ** argv) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\postmaster\postmaster.c:1372)
main(int argc, char ** argv) (\home\citus\.pgenv\src\postgresql-17.2\src\backend\main\main.c:197)

@m3hm3t
Copy link
Contributor Author

m3hm3t commented Jan 20, 2025

postgres=# SELECT version();
                                                       version
---------------------------------------------------------------------------------------------------------------------
 PostgreSQL 17.2 (Debian 17.2-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
(1 row)

postgres=# CREATE TABLE non_dist_table (id INTEGER);
CREATE TABLE
postgres=# MERGE INTO non_dist_table AS target_0
USING pg_catalog.pg_class AS ref_0
ON target_0.id = ref_0.relpages
WHEN NOT MATCHED THEN DO NOTHING;
MERGE 0
postgres=#

@m3hm3t m3hm3t linked a pull request Jan 20, 2025 that will close this issue
m3hm3t added a commit that referenced this issue Jan 21, 2025
This pull request addresses Issue #7846, where specific MERGE queries on
non-distributed and distributed tables can result in crashes in certain
scenarios. The issue stems from the usage of `pg_class` catalog table,
and the `FilterShardsFromPgclass` function in Citus. This function goes
through the query's jointree to hide the shards. However, in PG17,
MERGE's join quals are in a separate structure called
`mergeJoinCondition`. Therefore FilterShardsFromPgclass was not
filtering correctly in a `MERGE` command that involves `pg_class`. To
fix the issue, we handle `mergeJoinCondition` separately in PG17.

Relevant PG commit:

postgres/postgres@0294df2

**Non-Distributed Tables:**
A MERGE query involving a non-distributed table using
`pg_catalog.pg_class` as the source may execute successfully but needs
testing to ensure stability.

**Distributed Tables:**
Performing a MERGE on a distributed table using `pg_catalog.pg_class` as
the source raises an error:
`ERROR: MERGE INTO a distributed table from Postgres table is not yet
supported`
However, in some cases, this can lead to a server crash if the
unsupported operation is not properly handled.

This is the test output from the same test conducted prior to the code
changes being implemented.

```
-- Issue #7846: Test crash scenarios with MERGE on non-distributed and distributed tables
-- Step 1: Connect to a worker node to verify shard visibility
\c postgresql://postgres@localhost::worker_1_port/regression?application_name=psql
SET search_path TO pg17;
-- Step 2: Create and test a non-distributed table
CREATE TABLE non_dist_table_12345 (id INTEGER);
-- Test MERGE on the non-distributed table
MERGE INTO non_dist_table_12345 AS target_0
USING pg_catalog.pg_class AS ref_0
ON target_0.id = ref_0.relpages
WHEN NOT MATCHED THEN DO NOTHING;
SSL SYSCALL error: EOF detected
connection to server was lost
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants