-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align gapfill bucket generation with time_bucket #6155
Conversation
@mkindahl, @gayyappan: please review this pull request.
|
Codecov Report
@@ Coverage Diff @@
## main #6155 +/- ##
==========================================
- Coverage 65.05% 65.03% -0.03%
==========================================
Files 246 246
Lines 56949 56955 +6
Branches 12620 12621 +1
==========================================
- Hits 37048 37040 -8
- Misses 18030 18066 +36
+ Partials 1871 1849 -22
... and 60 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Hmm the interval handling seems a bit convoluted, can't we initialize it with gapfill_interval and then add to it, no need to ever memset or palloc |
9f159b1
to
ed138f2
Compare
ed138f2
to
6827718
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the computation is correct. Doing a naïve version of the computation gives me a completely different result.
DateADTGetDatum(state->next_timestamp), | ||
IntervalPGetDatum(state->gapfill_interval)); | ||
DateADTGetDatum(state->gapfill_start), | ||
IntervalPGetDatum(state->next_offset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly unsure what was causing the previous issue, but it looks like this approach suffer from the same problem as the previous one, or I might be misunderstanding what difference it makes. Was the original problem due to a precision issue, or something else?
Don't you want to do a computation along the lines gapfill_start + n * gapfill_interval
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous implementation depended on previous timestamp to calculate the next one. This becomes a problem in case of variable bucket sizes like months when the bucket lands on the last day of the month (which is a different day every time).
The computation you suggest is the same one I'm doing here but instead of maintaining the position (n), I maintain the offset by incrementing it for every generated bucket.
My implementation seemed simpler in terms of calculating the next timestamp so that's why I went with it.
tsl/test/shared/expected/gapfill.out
Outdated
Sat Apr 29 15:00:00 2000 PDT | ||
Thu Jun 29 15:00:00 2000 PDT | ||
Tue Aug 29 15:00:00 2000 PDT | ||
Sun Oct 29 15:00:00 2000 PST | ||
Fri Dec 29 15:00:00 2000 PST | ||
Sun Apr 30 15:00:00 2000 PDT | ||
Fri Jun 30 15:00:00 2000 PDT | ||
Thu Aug 31 15:00:00 2000 PDT | ||
Tue Oct 31 15:00:00 2000 PST | ||
Sun Dec 31 15:00:00 2000 PST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is the last day of the month rather than the 29:th, but is correct? When I do a test I get the following result:
mats=# select day::date, day::date + '1 month'::interval, day::date + '2 month'::interval from (values ('2023-02-28'), ('2023-08-31'), ('2023-01-29')) as d(day);
day | ?column? | ?column?
------------+---------------------+---------------------
2023-02-28 | 2023-03-28 00:00:00 | 2023-04-28 00:00:00
2023-08-31 | 2023-09-30 00:00:00 | 2023-10-31 00:00:00
2023-01-29 | 2023-02-28 00:00:00 | 2023-03-29 00:00:00
(3 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why is it not correct? If these are monthly buckets, should they not start/end on the last day of the month for every bucket?
Even your example shows that here:
2023-08-31 | 2023-09-30 00:00:00 | 2023-10-31 00:00:00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the case above is actually correct so it is not a good illustration of the problem I was wondering about.
I implemented this in PL/pgSQL to convince myself this works the way it should and it seems like that.
CREATE OR REPLACE FUNCTION advance_offset(start date, steps int, ival interval)
RETURNS date
LANGUAGE plpgsql AS $$
DECLARE
off interval;
count int := steps - 1;
BEGIN
off := ival;
LOOP
off := off + ival;
count := count - 1;
IF count = 0 THEN
EXIT;
END IF;
END LOOP;
RETURN start + off;
END
$$;
CREATE OR REPLACE FUNCTION advance_date(start date, steps int, ival interval)
RETURNS date
LANGUAGE plpgsql AS $$
DECLARE
ts date;
count int := steps;
BEGIN
ts := start;
LOOP
ts := ts + ival;
count := count - 1;
IF count = 0 THEN
EXIT;
END IF;
END LOOP;
RETURN ts;
END
$$;
Which then gives the following result:
mats=# SELECT start,
advance_date(start::date, 10, '1 month'::interval),
advance_offset(start::date, 10, '1 month'::interval),
CAST(start::date + 10 * '1 month'::interval AS date)
FROM (VALUES ('2023-01-28'), ('2023-01-29'), ('2023-01-30'), ('2023-01-31')) AS tbl(start);
start | advance_date | advance_offset | date
------------+--------------+----------------+------------
2023-01-28 | 2023-11-28 | 2023-11-28 | 2023-11-28
2023-01-29 | 2023-11-28 | 2023-11-29 | 2023-11-29
2023-01-30 | 2023-11-28 | 2023-11-30 | 2023-11-30
2023-01-31 | 2023-11-28 | 2023-11-30 | 2023-11-30
(4 rows)
The middle one is the original computation, which illustrates the problem, and the last one is the new implementation. I will approve the patch, but it would be good if you could use some extra dates to the test similar to the ones I have here so that we make sure that it works both when the date lands on the "last day of the month" as part of the computation and when the starting date is not the last day of the month (and the end result should not end on the last day of the month).
tsl/test/shared/expected/gapfill.out
Outdated
Sat Apr 29 15:00:00 2000 PDT | ||
Thu Jun 29 15:00:00 2000 PDT | ||
Tue Aug 29 15:00:00 2000 PDT | ||
Sun Oct 29 15:00:00 2000 PST | ||
Fri Dec 29 15:00:00 2000 PST | ||
Sun Apr 30 15:00:00 2000 PDT | ||
Fri Jun 30 15:00:00 2000 PDT | ||
Thu Aug 31 15:00:00 2000 PDT | ||
Tue Oct 31 15:00:00 2000 PST | ||
Sun Dec 31 15:00:00 2000 PST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the case above is actually correct so it is not a good illustration of the problem I was wondering about.
I implemented this in PL/pgSQL to convince myself this works the way it should and it seems like that.
CREATE OR REPLACE FUNCTION advance_offset(start date, steps int, ival interval)
RETURNS date
LANGUAGE plpgsql AS $$
DECLARE
off interval;
count int := steps - 1;
BEGIN
off := ival;
LOOP
off := off + ival;
count := count - 1;
IF count = 0 THEN
EXIT;
END IF;
END LOOP;
RETURN start + off;
END
$$;
CREATE OR REPLACE FUNCTION advance_date(start date, steps int, ival interval)
RETURNS date
LANGUAGE plpgsql AS $$
DECLARE
ts date;
count int := steps;
BEGIN
ts := start;
LOOP
ts := ts + ival;
count := count - 1;
IF count = 0 THEN
EXIT;
END IF;
END LOOP;
RETURN ts;
END
$$;
Which then gives the following result:
mats=# SELECT start,
advance_date(start::date, 10, '1 month'::interval),
advance_offset(start::date, 10, '1 month'::interval),
CAST(start::date + 10 * '1 month'::interval AS date)
FROM (VALUES ('2023-01-28'), ('2023-01-29'), ('2023-01-30'), ('2023-01-31')) AS tbl(start);
start | advance_date | advance_offset | date
------------+--------------+----------------+------------
2023-01-28 | 2023-11-28 | 2023-11-28 | 2023-11-28
2023-01-29 | 2023-11-28 | 2023-11-29 | 2023-11-29
2023-01-30 | 2023-11-28 | 2023-11-30 | 2023-11-30
2023-01-31 | 2023-11-28 | 2023-11-30 | 2023-11-30
(4 rows)
The middle one is the original computation, which illustrates the problem, and the last one is the new implementation. I will approve the patch, but it would be good if you could use some extra dates to the test similar to the ones I have here so that we make sure that it works both when the date lands on the "last day of the month" as part of the computation and when the starting date is not the last day of the month (and the end result should not end on the last day of the month).
In certain scenarios, when generating buckets with monthly buckets and different timezones, gapfill would create timestamps which don't align with time_bucket and thus potentially generating multiple rows for an individual month. Instead of relying on previous timestamp to generate the next one, now we generate them always from the start point which will make us align with time_bucket buckets.
6827718
to
ef75c52
Compare
Automated backport to 2.12.x not done: cherry-pick failed. Git status
|
This release contains bug fixes since the 2.12.1 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#6155 Align gapfill bucket generation with time_bucket * timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
This release contains bug fixes since the 2.12.1 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#6155 Align gapfill bucket generation with time_bucket * timescale#6181 Ensure fixed_schedule field is populated * timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
This release contains bug fixes since the 2.12.1 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#6155 Align gapfill bucket generation with time_bucket * timescale#6181 Ensure fixed_schedule field is populated * timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
This release contains bug fixes since the 2.12.1 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#6155 Align gapfill bucket generation with time_bucket * timescale#6181 Ensure fixed_schedule field is populated * timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
This release contains bug fixes since the 2.12.1 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#6155 Align gapfill bucket generation with time_bucket * timescale#6181 Ensure fixed_schedule field is populated * timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
In certain scenarios, when generating buckets with monthly buckets and different timezones, gapfill
would create timestamps which don't align with time_bucket and thus potentially generating multiple
rows for an individual month. Instead of relying on previous timestamp to generate the next one, now
we generate them always from the start point which will make us align with time_bucket buckets.