Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align gapfill bucket generation with time_bucket #6155

Merged
merged 1 commit into from
Oct 19, 2023

Conversation

antekresic
Copy link
Contributor

In certain scenarios, when generating buckets with monthly buckets and different timezones, gapfill
would create timestamps which don't align with time_bucket and thus potentially generating multiple
rows for an individual month. Instead of relying on previous timestamp to generate the next one, now
we generate them always from the start point which will make us align with time_bucket buckets.

@antekresic antekresic added the bug label Oct 5, 2023
@antekresic antekresic requested a review from svenklemm October 5, 2023 12:28
@antekresic antekresic self-assigned this Oct 5, 2023
@github-actions github-actions bot requested review from gayyappan and mkindahl October 5, 2023 12:28
@github-actions
Copy link

github-actions bot commented Oct 5, 2023

@mkindahl, @gayyappan: please review this pull request.

Powered by pull-review

@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

Merging #6155 (ef75c52) into main (a409065) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #6155      +/-   ##
==========================================
- Coverage   65.05%   65.03%   -0.03%     
==========================================
  Files         246      246              
  Lines       56949    56955       +6     
  Branches    12620    12621       +1     
==========================================
- Hits        37048    37040       -8     
- Misses      18030    18066      +36     
+ Partials     1871     1849      -22     
Files Coverage Δ
tsl/src/nodes/gapfill/gapfill_exec.c 94.68% <100.00%> (+0.05%) ⬆️

... and 60 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@svenklemm svenklemm added this to the TimescaleDB 2.12.1 milestone Oct 5, 2023
@svenklemm
Copy link
Member

Hmm the interval handling seems a bit convoluted, can't we initialize it with gapfill_interval and then add to it, no need to ever memset or palloc

@antekresic antekresic force-pushed the gapfill-fix branch 2 times, most recently from 9f159b1 to ed138f2 Compare October 11, 2023 10:10
@antekresic antekresic requested a review from svenklemm October 11, 2023 10:11
Copy link
Contributor

@mkindahl mkindahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the computation is correct. Doing a naïve version of the computation gives me a completely different result.

Comment on lines -643 to +644
DateADTGetDatum(state->next_timestamp),
IntervalPGetDatum(state->gapfill_interval));
DateADTGetDatum(state->gapfill_start),
IntervalPGetDatum(state->next_offset));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly unsure what was causing the previous issue, but it looks like this approach suffer from the same problem as the previous one, or I might be misunderstanding what difference it makes. Was the original problem due to a precision issue, or something else?

Don't you want to do a computation along the lines gapfill_start + n * gapfill_interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous implementation depended on previous timestamp to calculate the next one. This becomes a problem in case of variable bucket sizes like months when the bucket lands on the last day of the month (which is a different day every time).

The computation you suggest is the same one I'm doing here but instead of maintaining the position (n), I maintain the offset by incrementing it for every generated bucket.

My implementation seemed simpler in terms of calculating the next timestamp so that's why I went with it.

Comment on lines 3295 to 3301
Sat Apr 29 15:00:00 2000 PDT
Thu Jun 29 15:00:00 2000 PDT
Tue Aug 29 15:00:00 2000 PDT
Sun Oct 29 15:00:00 2000 PST
Fri Dec 29 15:00:00 2000 PST
Sun Apr 30 15:00:00 2000 PDT
Fri Jun 30 15:00:00 2000 PDT
Thu Aug 31 15:00:00 2000 PDT
Tue Oct 31 15:00:00 2000 PST
Sun Dec 31 15:00:00 2000 PST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is the last day of the month rather than the 29:th, but is correct? When I do a test I get the following result:

mats=# select day::date, day::date + '1 month'::interval, day::date + '2 month'::interval from (values ('2023-02-28'), ('2023-08-31'), ('2023-01-29')) as d(day);
    day     |      ?column?       |      ?column?       
------------+---------------------+---------------------
 2023-02-28 | 2023-03-28 00:00:00 | 2023-04-28 00:00:00
 2023-08-31 | 2023-09-30 00:00:00 | 2023-10-31 00:00:00
 2023-01-29 | 2023-02-28 00:00:00 | 2023-03-29 00:00:00
(3 rows)

Copy link
Contributor Author

@antekresic antekresic Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why is it not correct? If these are monthly buckets, should they not start/end on the last day of the month for every bucket?

Even your example shows that here:
2023-08-31 | 2023-09-30 00:00:00 | 2023-10-31 00:00:00

Copy link
Contributor

@mkindahl mkindahl Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the case above is actually correct so it is not a good illustration of the problem I was wondering about.

I implemented this in PL/pgSQL to convince myself this works the way it should and it seems like that.

CREATE OR REPLACE FUNCTION advance_offset(start date, steps int, ival interval)
RETURNS date
LANGUAGE plpgsql AS $$
DECLARE
    off interval;
    count int := steps - 1;
BEGIN
    off := ival;
    LOOP
        off := off + ival;
	count := count - 1;
	IF count = 0 THEN
	    EXIT;
	END IF;
    END LOOP;
    RETURN start + off;
END
$$;

CREATE OR REPLACE FUNCTION advance_date(start date, steps int, ival interval)
RETURNS date
LANGUAGE plpgsql AS $$
DECLARE
    ts date;
    count int := steps;
BEGIN
    ts := start;
    LOOP
        ts := ts + ival;
	count := count - 1;
	IF count = 0 THEN
	    EXIT;
	END IF;
    END LOOP;
    RETURN ts;
END
$$;

Which then gives the following result:

mats=# SELECT start,
       advance_date(start::date, 10, '1 month'::interval),
       advance_offset(start::date, 10, '1 month'::interval),
       CAST(start::date + 10 * '1 month'::interval AS date)
  FROM (VALUES ('2023-01-28'), ('2023-01-29'), ('2023-01-30'), ('2023-01-31')) AS tbl(start);
   start    | advance_date | advance_offset |    date    
------------+--------------+----------------+------------
 2023-01-28 | 2023-11-28   | 2023-11-28     | 2023-11-28
 2023-01-29 | 2023-11-28   | 2023-11-29     | 2023-11-29
 2023-01-30 | 2023-11-28   | 2023-11-30     | 2023-11-30
 2023-01-31 | 2023-11-28   | 2023-11-30     | 2023-11-30
(4 rows)

The middle one is the original computation, which illustrates the problem, and the last one is the new implementation. I will approve the patch, but it would be good if you could use some extra dates to the test similar to the ones I have here so that we make sure that it works both when the date lands on the "last day of the month" as part of the computation and when the starting date is not the last day of the month (and the end result should not end on the last day of the month).

Comment on lines 3295 to 3301
Sat Apr 29 15:00:00 2000 PDT
Thu Jun 29 15:00:00 2000 PDT
Tue Aug 29 15:00:00 2000 PDT
Sun Oct 29 15:00:00 2000 PST
Fri Dec 29 15:00:00 2000 PST
Sun Apr 30 15:00:00 2000 PDT
Fri Jun 30 15:00:00 2000 PDT
Thu Aug 31 15:00:00 2000 PDT
Tue Oct 31 15:00:00 2000 PST
Sun Dec 31 15:00:00 2000 PST
Copy link
Contributor

@mkindahl mkindahl Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the case above is actually correct so it is not a good illustration of the problem I was wondering about.

I implemented this in PL/pgSQL to convince myself this works the way it should and it seems like that.

CREATE OR REPLACE FUNCTION advance_offset(start date, steps int, ival interval)
RETURNS date
LANGUAGE plpgsql AS $$
DECLARE
    off interval;
    count int := steps - 1;
BEGIN
    off := ival;
    LOOP
        off := off + ival;
	count := count - 1;
	IF count = 0 THEN
	    EXIT;
	END IF;
    END LOOP;
    RETURN start + off;
END
$$;

CREATE OR REPLACE FUNCTION advance_date(start date, steps int, ival interval)
RETURNS date
LANGUAGE plpgsql AS $$
DECLARE
    ts date;
    count int := steps;
BEGIN
    ts := start;
    LOOP
        ts := ts + ival;
	count := count - 1;
	IF count = 0 THEN
	    EXIT;
	END IF;
    END LOOP;
    RETURN ts;
END
$$;

Which then gives the following result:

mats=# SELECT start,
       advance_date(start::date, 10, '1 month'::interval),
       advance_offset(start::date, 10, '1 month'::interval),
       CAST(start::date + 10 * '1 month'::interval AS date)
  FROM (VALUES ('2023-01-28'), ('2023-01-29'), ('2023-01-30'), ('2023-01-31')) AS tbl(start);
   start    | advance_date | advance_offset |    date    
------------+--------------+----------------+------------
 2023-01-28 | 2023-11-28   | 2023-11-28     | 2023-11-28
 2023-01-29 | 2023-11-28   | 2023-11-29     | 2023-11-29
 2023-01-30 | 2023-11-28   | 2023-11-30     | 2023-11-30
 2023-01-31 | 2023-11-28   | 2023-11-30     | 2023-11-30
(4 rows)

The middle one is the original computation, which illustrates the problem, and the last one is the new implementation. I will approve the patch, but it would be good if you could use some extra dates to the test similar to the ones I have here so that we make sure that it works both when the date lands on the "last day of the month" as part of the computation and when the starting date is not the last day of the month (and the end result should not end on the last day of the month).

In certain scenarios, when generating buckets with
monthly buckets and different timezones, gapfill
would create timestamps which don't align with
time_bucket and thus potentially generating multiple
rows for an individual month. Instead of relying on
previous timestamp to generate the next one, now
we generate them always from the start point
which will make us align with time_bucket buckets.
@antekresic antekresic enabled auto-merge (rebase) October 19, 2023 09:35
@antekresic antekresic merged commit 65ecda9 into timescale:main Oct 19, 2023
36 checks passed
@timescale-automation
Copy link

Automated backport to 2.12.x not done: cherry-pick failed.

Git status

HEAD detached at origin/2.12.x
You are currently cherry-picking commit 65ecda93e.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   .unreleased/PR_6155
	modified:   tsl/src/nodes/gapfill/gapfill_exec.c
	modified:   tsl/src/nodes/gapfill/gapfill_internal.h
	modified:   tsl/test/shared/expected/gapfill.out
	modified:   tsl/test/shared/sql/gapfill.sql

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	deleted by us:   tsl/test/shared/expected/gapfill-14.out
	deleted by us:   tsl/test/shared/expected/gapfill-15.out
	deleted by us:   tsl/test/shared/expected/gapfill-16.out


Job log

@timescale-automation timescale-automation added the auto-backport-not-done Automated backport of this PR has failed non-retriably (e.g. conflicts) label Oct 19, 2023
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Oct 19, 2023
This release contains bug fixes since the 2.12.1 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#6155 Align gapfill bucket generation with time_bucket
* timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
@svenklemm svenklemm mentioned this pull request Oct 19, 2023
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Oct 19, 2023
This release contains bug fixes since the 2.12.1 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#6155 Align gapfill bucket generation with time_bucket
* timescale#6181 Ensure fixed_schedule field is populated
* timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
svenklemm added a commit that referenced this pull request Oct 19, 2023
This release contains bug fixes since the 2.12.1 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* #6155 Align gapfill bucket generation with time_bucket
* #6181 Ensure fixed_schedule field is populated
* #6210 Fix EXPLAIN ANALYZE for compressed DML
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Oct 19, 2023
This release contains bug fixes since the 2.12.1 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#6155 Align gapfill bucket generation with time_bucket
* timescale#6181 Ensure fixed_schedule field is populated
* timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
@svenklemm svenklemm mentioned this pull request Oct 19, 2023
svenklemm added a commit that referenced this pull request Oct 19, 2023
This release contains bug fixes since the 2.12.1 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* #6155 Align gapfill bucket generation with time_bucket
* #6181 Ensure fixed_schedule field is populated
* #6210 Fix EXPLAIN ANALYZE for compressed DML
jnidzwetzki pushed a commit to jnidzwetzki/timescaledb that referenced this pull request Nov 9, 2023
This release contains bug fixes since the 2.12.1 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#6155 Align gapfill bucket generation with time_bucket
* timescale#6181 Ensure fixed_schedule field is populated
* timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
jnidzwetzki pushed a commit to jnidzwetzki/timescaledb that referenced this pull request Nov 19, 2023
This release contains bug fixes since the 2.12.1 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#6155 Align gapfill bucket generation with time_bucket
* timescale#6181 Ensure fixed_schedule field is populated
* timescale#6210 Fix EXPLAIN ANALYZE for compressed DML
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-not-done Automated backport of this PR has failed non-retriably (e.g. conflicts) backported-2.12.x bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants