-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could not find pathkey item to sort #2232
Comments
Do you still get this with the latest version? If so could you provide a minimal test script that reproduces the behaviour in a fresh database. |
I've been unable to produce a repro script. I do not know what manner of data poisons the planner thusly. There are no null values or anything particularly interesting among the data and it crashes long before any tuples are considered, during a simple |
This error vacillates around the Of particular note: When using explicit time stamps in the filter this happens. If you replace the equivalent explicit time filters with |
Is there any fix around this issue ? Because we have such an issue in our monitoring (grafana) and we see it often. When time passes for example two hours, it works again, we are using time_bucket and group by based on that. Also I checked it now, while I remove other conditions it works ! Not working:
Working:
|
For a workaround instead of |
Thank you very much, by the way if there are more information required to fix the issue, I can try my best to reproduce it using a SQL file. |
I just ran into this issue as well when attempting to order with a join:
I tried the workaround suggested above, but (perhaps obviously) it produces the wrong aggregates even though it does not throw an error:
In this particular case I'm not using Grafana and the result set isn't very big I'm able to workaround by sorting outside of Timescaledb by dropping the
@WarriorOfWire thanks for reporting and suggesting the workaround above. |
@edsinclair Are you using
|
Getting similar issues on Timescale 2.0.0, for example: CREATE TABLE bugs (
observed timestamptz not null,
metric bigint not null
);
SELECT create_hypertable('bugs', 'observed', chunk_time_interval => '1 week'::interval);
INSERT INTO bugs VALUES ('2021-01-12T14:00', 4);
INSERT INTO bugs VALUES ('2021-01-19T14:01', 4);
INSERT INTO bugs VALUES ('2021-01-19T14:05', 6); And then running the following query: SELECT
time_bucket_gapfill('60 seconds', observed) as observed
FROM
bugs
WHERE
observed >= '2021-01-18T14:05'
AND observed <= now()
GROUP BY
1
ORDER BY
1 DESC; Results in: ERROR: XX000: could not find pathkey item to sort
LOCATION: prepare_sort_from_pathkeys, createplan.c:5896 PostgreSQL 12.5 with TimescaleDB 2.0.0 |
Got this result with PG12.5 and current master running using timescaledb test-suite, but was able to reproduce using psql. +CREATE TABLE bugs (
+ observed timestamptz not null,
+ metric bigint not null
+);
+SELECT create_hypertable('bugs', 'observed', chunk_time_interval => '1 week'::interval);
+ create_hypertable
+-------------------
+ (2,public,bugs,t)
+(1 row)
+
+INSERT INTO bugs VALUES ('2021-01-12T14:00', 4);
+INSERT INTO bugs VALUES ('2021-01-19T14:01', 4);
+INSERT INTO bugs VALUES ('2021-01-19T14:05', 6);
+SELECT
+ time_bucket_gapfill('60 seconds', observed) as observed
+FROM
+ bugs
+WHERE
+ observed >= '2021-01-18T14:05'
+ AND observed <= now()
+GROUP BY
+ 1
+ORDER BY
+ 1 DESC;
+ observed
+----------
+(0 rows)
+ |
This change fixes a bug with gapfill that lead to certain query plans failing with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan caused by removing the last two arguments to the `time_bucket_gapfill` during query planning. Since the function's expression was modified via a reference to the original expression in the query plan, it affected also the target list. The error occurred because the planner couldn't match the function (including arguments) with the corresponding equivalence member (which still included the two removed arguments). The reason the two arguments were removed, originally, was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes timescale#2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes #2232
This change fixes a bug with gapfill that caused certain query plans to fail with the error "could not find pathkey item to sort". This was caused by a corruption in the query plan which happened as a result of removing the last two arguments to the `time_bucket_gapfill` function expression during query planning. Since the function expression was modified via a reference to the original query plan tree, it affected also the expression in the target list. When the planner couldn't match the target list with the corresponding equivalence member (which still included the two removed arguments), the error was generated. The original reason for removing the two arguments was to avoid passing them on to `time_bucket`, which is called internally by `time_bucket_gapfill`. However, the last to arguments aren't passed on anyway, so it isn't necessary to modify the original argument list. Fixes #2232
Hmm, my query has evolved and been simplified and casting to ::INTEGER is no longer needed and it's not reproducing. If it happens again I will provide steps to reproduce, or possibly I can dig through the history... |
Dear Author, This issue has been automatically marked as stale due to lack of activity. With only the issue description that is currently provided, we do not have enough information to take action. If you have or find the answers we would need, please reach out. Otherwise, this issue will be closed in 30 days. Thank you! |
Hi! Postgres 15, TimescaleDB 2.12 - 2.13 The query contain next parts: ...
FROM hypertable AS ht
INNER JOIN postgrestable AS pt ON pt.id = ht.id
WHERE ht.ts >= TIMESTAMP '2023-10-01' AND ht.ts < TIMESTAMP '2023-11-15'
GROUP BY ... Mentioned period gives After recreation of table with INTEGER fields instead of SMALLINT the query works on any period. Also this error occurs when JOIN or/and GROUP BY contain field with NULL values. As a workaround I used COALESCE(field, 0) |
Hey, Any chance we can get more details in order to reproduce this issue? Thanks. |
For some examples, you can check out the original post, this post, this post, or this post. There’s quite a bit of information in here:
FYI, the original issue here is not limited to gapfill queries and this issue should not have been closed with that rationale #2232 (comment) |
@kvc0 thanks for the in-depth explanation of the history of the issue. I do have a suggestion for you (or anybody else) when you come across this issue, try doing: If that fixes the issue, it helps narrow down the root cause. Thanks! |
That’s a great help to try! When I was still maintaining a large timescale database restarting the service would make the error go away for a while. So having an online setting adjustment that helps narrow it is super helpful! Are there any other helpful online settings to try? I don’t have an affected database currently ($last_job) but I’m sure others would be happy to try. And I will too if my small NAS timescaledb hits this issue ;-) |
That's about it for now, hopefully we get new least so we can get to the bottom of this. |
I just ran into this bug, and I thought I had a reproduction for you but it disappeared on me... In the hope that this can be helpful: on the period of time where the bug appeared, I was able to make the bug happen or not by changing one line in my query. I'll give the full query (because I was unable to drill down to a minimal example), even though I doubt most of the columns are relevant. This triggers the bug: select
m.sender,
coalesce(sum(m.recipients), 0) tot_rcpt,
coalesce(sum(m.recipients) filter (where m.accepted), 0) acc_rcpt,
coalesce(sum(m.recipients) filter (where not m.accepted), 0) rej_rcpt,
to_char(coalesce(sum(m.recipients) filter (where not m.accepted) / (sum(m.recipients)::float + 0.001)), '0.00') rej_ratio,
coalesce(count(m.*), 0) tot_msg,
coalesce(count(m.*) filter (where m.accepted), 0) acc_msg,
coalesce(count(m.*) filter (where not m.accepted), 0) rej_msg,
count(c.*) conns,
count(distinct(c.host)) distinct_hosts
from
messages m
join connections c on (c.icid = m.icid)
where
not m.sender_is_list
and m.sender_dom = 'xxxxxx'
and m.time > '2023-12-15'
group by
m.sender But removing the line count(distinct(c.host)) distinct_hosts (and the "," above) made the bug disappear. I was able to go back and forth between the two queries, and reliably trigger the bug or not. So it was not a fluke at the time I tested it. However, I also had an (apparently wrong) hypothesis that it would be related to something else: I'm working on a log parsing script that populates the db with data, and it appeared to me that it seemed to happen only when I parsed a single day of the testing logs I had... So I mistakenly went for recreating the db a few times, and the bug disappeared in both conditions (single day of logs or multiple days), and now I seem unable to experiment and reproduce it. I also unfortunately was unable to try the suggested |
It happened to me again! I was able to narrow down a bit more stuff on working / non-working queries, and try
select
count(c.*),
count(distinct(c.host)) hosts
from
messages m
join connections c on (c.icid = m.icid)
where
m.time > now() - interval '1 hours'
group by
m.sender
;
count(distinct(c.host)) hosts effectively makes the bug disappear.
select
count(c.*),
count(distinct(c.host)) hosts
from
connections c
where
c.time > now() - interval '1 hours'
;
select
count(c.*),
count(distinct(c.host)) hosts
from
messages m
left join connections c on (c.icid = m.icid)
where
m.time > now() - interval '1 hours'
group by
m.sender
;
I have to leave, but for now, the issue did not disappear by itself by ingesting a bit more data. I have not tried recreating the db yet. It will probably make it disappear I guess. If you have more ideas on things to try, don't hesitate. For a little bit more context if that can help: I'm ingesting logs, putting them in the db. But as I'm "backfilling" a few days / weeks of logs, my program is applying timescale compression "manually" after each day of log file ingested, to save space. There are probably a few things being inserted into already compressed chunks... No idea if that is in any way related... |
Hi @nguiard, Thanks for the detailed information. Is it possible to share the definition of the hypertable as well? Do you use a compressed hypertable or are all chunks uncompressed? When the issue happens the next time, could you share the generated query plan (this can be done by executing |
Hi @jnidzwetzki, thanks a lot for your answer. It seems I have the issue happening right now :)
I can share the related table definitions (with a tiny bit of redaction): create extension plpgsql_check;
create extension pg_stat_statements;
create extension timescaledb;
-----------------------------------------------
--
create table connections (
time timestamptz not null,
icid int8 not null,
ip inet not null,
host varchar(512) not null,
country varchar(512) ,
accepted boolean not null default false,
relay boolean not null default false,
is_exchange boolean generated always as
(ip << '10.xx.xx/24' or ip << '10.xx.xx/24') stored
);
select create_hypertable('connections', by_range('time', interval '1 day'));
alter table connections set (
timescaledb.compress,
timescaledb.compress_segmentby = 'host',
timescaledb.compress_orderby = 'icid desc'
);
select add_compression_policy('connections', interval '3 days');
select add_retention_policy('connections', interval '1 year');
create index on connections (icid, time desc);
create index on connections (ip, time desc);
create index on connections (host, time desc);
create index on connections (country, time desc);
create index on connections (accepted, time desc);
create index on connections (relay, time desc);
create index on connections (is_exchange, time desc);
-----------------------------------------------
--
create table messages (
time timestamptz not null,
mid int8 not null,
icid int8 not null,
sender varchar(512) ,
sender_dom varchar(256) generated always as
(split_part(sender, '@', 2)) stored,
sender_is_list boolean generated always as
(right(sender, 3) = 'xxx'
or right(sender, 3) = 'yyy') stored,
accepted boolean default false,
aborted boolean default false,
avg_fl_entropy float4 ,
recipients int2 ,
subject varchar(1024) ,
spf_helo varchar(512) ,
spf_helo_st varchar(16) ,
spf_mailfrom varchar(512) ,
spf_mailfrom_st varchar(16)
);
select create_hypertable('messages', by_range('time', interval '1 day'));
alter table messages set (
timescaledb.compress,
timescaledb.compress_segmentby = 'sender',
timescaledb.compress_orderby = 'mid desc'
);
select add_compression_policy('messages', interval '3 days');
select add_retention_policy('messages', interval '1 year');
create index on messages (icid, time desc);
create index on messages (mid, time desc);
create index on messages (sender, time desc);
create index on messages (sender_dom, time desc);
create index on messages (sender_is_list, time desc);
create index on messages (accepted, time desc);
create index on messages (aborted, time desc);
create index on messages (recipients, time desc);
create index on messages (avg_fl_entropy, time desc);
create index on messages (subject, time desc);
create index on messages (sender, subject, time desc);
create index on messages (sender, accepted, time desc);
create index on messages (spf_helo, time desc);
create index on messages (spf_helo_st, time desc);
create index on messages (spf_mailfrom, time desc);
create index on messages (spf_mailfrom_st, time desc);
-- I also have a msg_recipients table but I highly doubt it's relevant, since the failing queries don't use it.
-- And a few view... When I'm "backfilling" (reading for example 30 days of logs) I run this function after each log file (corresponding to 1 day of logs) is parsed and loaded into the db (the files are parsed sequentially, in ascending time order, working up to current day): create or replace function backfill_maintenance() returns void
language plpgsql as $$
begin
perform compress_chunk(i, if_not_compressed => true)
from show_chunks('connections', older_than => interval '3 days') i;
perform compress_chunk(i, if_not_compressed => true)
from show_chunks('messages', older_than => interval '3 days') i;
perform compress_chunk(i, if_not_compressed => true)
from show_chunks('msg_recipients', older_than => interval '3 days') i;
end
$$; This is so:
However, as the log files are not stopped exactly at midnight, I clearly do have some inserts happening on already compressed chunks. I have no idea if this is related to the issue but at least you have a bit of context. Seems I have the issue right now on the live db. Switching to a Tell me if you want me to try something or have an further idea! :) (edit: I had an unclosed parenthesis in the messages table, fixed) |
Hi @nguiard, Thanks for all the information. A college of mine made me aware that this problem could be triggered by a new optimization in PostgreSQL 16. Could you disable the setting |
Hey @jnidzwetzki, thanks for the suggestion. I just tried it and it did not solve the issue (still same behaviour, getting |
Hi @nguiard, Thanks for getting back. I tried to reproduce the behavior with PG 16 in my local environment, but I was not able to trigger the error so far. Could you perform the following steps when the error happens again:
|
Closing this since it is an issue for a very old version. Feel free to open new ticket if this can be reproduced in most recent version. |
Relevant system information:
Describe the bug
When querying a hypertable and
GROUP BY
the hypertable's time column, the width of the time bucket determines whether the postgresql planner returns the errorCould not find pathkey item to sort
.To Reproduce
It is not known if this is reproducible on other environments.
Example failure, note the time_bucket value:
Example success, note the time bucket value:
Expected behavior
For any valid width of time_bucket, the planner can produce a query plan. Alternatively, date_trunc can work with both
minute
andhour
.Actual behavior
Any grouping of the hypertable's time column that is larger than 83021 milliseconds causes the query planner to error.
Additional context
Certain selections of the table for this time range do this while others do not.
text
ordouble precision
.The text was updated successfully, but these errors were encountered: