[Bug]: Compressed chunks getting inflated after upgrading to 2.12.2 #6255

fetchezar · 2023-10-31T13:04:16Z

What type of bug is this?

Unexpected error

What subsystems and features are affected?

Background worker, Compression

What happened?

Hi! I'm following up after reporting this on the Timescale Slack server. This issue prevents normal function of the automatic compression policies, safe ingestion of old data into the architecture, and due the exponential increase in disk size, problems in keeping the database and streaming replicas working.

After upgrading from 2.10.0 to 2.12.2, it seems like the compression policy worker is "inflating" old compressed chunks without doing anything, and seemingly failing on a deadlock while doing so. The behavior we see is the worker grabbing a partially compressed chunk, "inflating" the related uncompressed chunk table to its previously uncompressed size, and then failing to recompress anything, with a deadlock error showing in the log.

In the Postgres log it seems I'm also missing some exception details from these two exception handlers here:

timescaledb/sql/policy_internal.sql

Line 115 in 98623ea

USING DETAIL = format('Message: (%s), Detail: (%s).', _message, _detail),

timescaledb/sql/policy_internal.sql

Line 126 in 98623ea

USING DETAIL = format('Message: (%s), Detail: (%s).', _message, _detail),

Looks like _message and _detail aren't set unlike the block above here:

timescaledb/sql/policy_internal.sql

Line 101 in 98623ea

USING DETAIL = format('Message: (%s), Detail: (%s).', _message, _detail),

Thus the log mentioning the deadlock message in one case, but empty "()" details and messages in the others.

These are the statistics of one of the partially compressed chunks for example:

An uncompressed chunk is roughly 14GB, 4 days of data, ~30 million rows per day, ~120 million rows total over the span of the chunk, which ends up compressed to 900MB. It's a very narrow schema, the table definition is the following, with the compression segmented by the id column:

 CREATE TABLE public.sample (
	id int8 NOT NULL,
	tstamp timestamptz NOT NULL,
	value_int64 int8 NULL,
	value_double float8 NULL,
	value_float float4 NULL,
	value_bool bool NULL,
	quality bool NOT NULL,
	CONSTRAINT sample_pkey PRIMARY KEY (id, tstamp),
	CONSTRAINT sample_id_fkey FOREIGN KEY (id) REFERENCES public.signal_base(id)
);

The amount of ids present in a single chunk is in the order of 300k, which gives us around 400 rows per signal id on average. We have these chunks with very, very few sporadic inserts of old data. A couple hundred rows inserted, compared to the existing ~120 million rows in the span of the entire chunk. From what I see around 10 chunks are partially compressed, from the couple hundred we have in total.

The bussiness use case makes it so insertion of old data is almost inexistent (we continually ingest oil field installation sensor data, continually being polled from the field). Deletes/updates over old data is an usecase we outright dont perform nor support within the system. There is the odd case where some particular instrument is a little bit more complicated and has a log, that could possibly have some older values, but that's it. It's impossible we've reinserted 4 days of values again over the same compressed chunk without noticing, like the statistics say (240 million inserts), I doubt the system could even handle it, so it makes no sense for the background job to try to inflate the chunk to its full size. At most it should inflate the impacted segments, which for a couple hundred uncompressed samples inserted post-compression it would mean probably less than a hundred id segments were affected.

After the compression policy job is done with a partially compressed chunk (verified the status flag is 9 in the chunks catalog, which I understand it means compressed | partially_compressed ), we're end up with a compressed chunk with the 120 million rows, of 900MB and an uncompressed 14GB chunk with a couple hundred rows alive. So practically all of it is "dead" tuples for postgres.

If we perform a VACUUM (FULL) of the chunk, its size returns to normal, but we're still left with most of the data in the compressed chunk, plus barely a couple hundred uncompressed rows in this other adjacent chunk. This means that if we enable the compression policy again, and the background worker picks up this chunk, it tries to do exactly the same, that is, inflating it to 14GB and blowing up our disk usage and itself in the process. So for now we've disabled all automatic compression policies.

recompress_chunk doesn't finishes. It locks the table, with Postgres registering an ever increasing amount of sessions, and after almost 2h I have to stop the function due this. This table is mostly constantly queried

We have at least another database, albeit smaller, with the exact same problem: The background job inflates a couple of partially compressed chunks to it's whole size, blowing up the disk usage of the master and streaming replica, causing issues on both systems.

TimescaleDB version affected

2.12.2

PostgreSQL version used

14.2

What operating system did you use?

Debian 10 x64

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

2023-10-26 06:04:07 -03 [30519]WARNING:  decompressing chunk "_timescaledb_internal._hyper_91_1553_chunk" failed when compression policy is executed
2023-10-26 06:04:07 -03 [30519]DETAIL:  Message: (), Detail: ().
2023-10-26 06:04:07 -03 [30519]CONTEXT:  PL/pgSQL function _timescaledb_functions.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 70 at RAISE
	SQL statement "CALL _timescaledb_functions.policy_compression_execute(
	        job_id, htid, lag_value::INTERVAL,
	        maxchunks, verbose_log, recompress_enabled
	      )"
	PL/pgSQL function _timescaledb_functions.policy_compression(integer,jsonb) line 51 at CALL
2023-10-26 06:04:07 -03 [30519]WARNING:  compressing chunk "_timescaledb_internal._hyper_91_1553_chunk" failed when compression policy is executed
2023-10-26 06:04:07 -03 [30519]DETAIL:  Message: (), Detail: ().
2023-10-26 06:04:07 -03 [30519]CONTEXT:  PL/pgSQL function _timescaledb_functions.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 81 at RAISE
	SQL statement "CALL _timescaledb_functions.policy_compression_execute(
	        job_id, htid, lag_value::INTERVAL,
	        maxchunks, verbose_log, recompress_enabled
	      )"
	PL/pgSQL function _timescaledb_functions.policy_compression(integer,jsonb) line 51 at CALL

 2023-10-26 06:10:42 -03 [30519]WARNING:  compressing chunk "_timescaledb_internal._hyper_91_1850_chunk" failed when compression policy is executed
2023-10-26 06:10:42 -03 [30519]DETAIL:  Message: (deadlock detected), Detail: (Process 30519 waits for AccessExclusiveLock on relation 26960 of database 16402; blocked by process 32373.
	Process 32373 waits for AccessShareLock on relation 96361368 of database 16402; blocked by process 30519.).
2023-10-26 06:10:42 -03 [30519]CONTEXT:  PL/pgSQL function _timescaledb_functions.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 56 at RAISE
	SQL statement "CALL _timescaledb_functions.policy_compression_execute(
	        job_id, htid, lag_value::INTERVAL,
	        maxchunks, verbose_log, recompress_enabled
	      )"
	PL/pgSQL function _timescaledb_functions.policy_compression(integer,jsonb) line 51 at CALL


2023-10-26 06:10:42 -03 [30519]LOG:  job 1052 (Compression Policy [1052]) exiting with success: execution time 2220932.13 ms

How can we reproduce the bug?

I don't have a simple reproduction of this bug to share.

The text was updated successfully, but these errors were encountered:

fetchezar · 2023-10-31T13:48:45Z

I tried "fixing" the partially compressed chunks by hand by uncompressing them, then compressing them manually to no avail.

decompres_chunk never finishes. I tried it on one of these chunks, which should inflate to 20GB from ~1GB, and I let it run for approximately 20min. It never finished, and I saw the WAL of the master increase in size from 3GB to 50GB (!). It makes no sense. Not in size nor in the time it's taking.

The last time (on 2.10.0) I tried compressing one of these chunks manually because we needed some space it took 7min in total to compress. I imagine decompressing should be even faster.

After cancelling the decompression I'm left with this mostly dead chunk that I have to manually VACUUM for it to return to its normal size.

mkindahl · 2023-11-01T08:48:20Z

@fetchezar Thank you for the very detailed bug report!

As I read this, there seems to be multiple issues, but the main problem is that when incorporating new data into an existing chunk, it has to decompress the existing chunk and combine it with the the new data before compressing again. This results in a lot of dead tuples in the chunk, which eats up disk and force you to run VACUUM on a regular basis to keep the disk usage down.

Is that a fair summary of the situation?

fetchezar · 2023-11-01T12:18:58Z

Not exactly. The recompress job doesn't finishes succesfully. These chunks remain in a partially compressed state because the decompression step doesn't seems to ever finish. Neither by recompress_chunks nor by manually calling decompress_chunk on them. It just blew up the WAL to 50GB and had to cancel it and then run VACUUM(FULL) on the chunk. The chunk remained in a partially compressed state.

To be clear: These tuples already existed before we upgraded from 2.10.0 to 2.12.2. This data didn't get inserted in the ~16h span between the update and the next compression policy execution. It was already there.

The old data was already there, the compression policy job happened to run the next day in the morning we upgraded, and inflated two chunks. This was a problem but manageable, we had space to spare in that database.

What cause problems was that very same day, at noon, the compression policy job ran in another database, who had less space available, and inflated two chunks and that blew up a streaming replica we had.

So at least I'm quite sure we didn't have any problems inserting the data in 2.10.0, albeit for some reason it never got recompressed into the chunks. Then we upgraded to 2.12.2, and these problems started popping up when it tried to recompress this data.

I can't say with 100% certainty but I'm quite sure these extra uncompressed tuples were sitting there for at least a couple weeks.

Tindarid · 2023-11-01T18:07:01Z

Hi @fetchezar , curious about this issue (as also playing with recompress_chunk in the recent version, but don't have problems described here):

Have you tried to do VACUUM FULL on the compressed table of the chunk (not the chunk table)?
Have you tried turning off the option suggested here (can speedup recompression for your segmented hypertable)
[Bug]: Performance degradation (5x slower) on chunk compression after update from TimescaleDB 2.9.0 to 2.11.1 #5902

akuzm · 2023-11-09T10:00:05Z

As a workaround for WAL blowing up, try to drop all indexes from the uncompressed chunk table before decompression. We'll have some improments for decompression throughput in 2.13, although I can't say this will definitively solve the problem: #6287

fetchezar · 2023-11-09T18:41:17Z

@akuzm Hmm... We only have two indices in these tables. The primary key index over (id int8, tstamp timestamptz) and the one TimescaleDB creates over the tstamp column.

So this would require dropping the primary key and the TimescaleDB created index over the tstamp column, then decompressing the chunk.

Then I should recompress it again, then re-add the primary key and the index? Or how would the "manual recompression" procedure should look like? Or should I just call recompress_chunk after dropping the pkey and the index?

fetchezar · 2023-11-09T18:43:16Z

@Tindarid As for 1. I'm not understanding what would this accomplish. It seems I can't even decompress the chunk, what would vacuuming the compressed chunk accomplish in this case?

Same for point 2. I can't even decompress it, so I'm not sure compression here is the problem.

However it may help on the recompress_chunk call if it's only decompressing the proper segments, but given the uncompressed chunk's size is left as if it were decompressed entirely, doesn't looks like it.

akuzm · 2023-11-09T19:03:46Z

Then I should recompress it again, then re-add the primary key and the index?

@fetchezar yes, you have to add them back manually if you need them. They won't be restored automatically. But if you plan to mostly keep this chunk compressed, maybe you don't need these indexes at the uncompressed chunk table at all.

For the recompression process, I think recompress_chunk() should work just as well as decompress_chunk() + compress_chunk() in this scenario, but the latter combo is more well tested.

fetchezar · 2023-11-21T16:54:06Z

@akuzm In the end I didn't have the courage to mess with TimescaleDB's internal tables.

However I started recompressing partially compressed chunks from oldes to newest, one at a time in two databases, and somehow this time it worked fine. So now I have no more partially compressed chunks (and the data stream prevents ingestion of data older than ~8 days, so we shouldn't have more compressed chunk inserts anymore).

Next step is to review the databases, enable the compression policies again and see if they don't blow up.

That being said I noticed something strange. Re-compressed chunks have nowhere near the size a normally compressed chunk has, for instance:

hypertable_name	chunk_name	compressed_chunk	chunk_status	before_size	after_size	detailed_size
public.sample	_hyper_91_1850_chunk	compress_hyper_92_1894_chunk	1	11 GB	835 MB	835 MB
public.sample	_hyper_91_1833_chunk	compress_hyper_92_1865_chunk	1	13 GB	1672 MB	1668 MB
public.sample	_hyper_91_1820_chunk	compress_hyper_92_1852_chunk	1	13 GB	1680 MB	1675 MB
public.sample	_hyper_91_1804_chunk	compress_hyper_92_1838_chunk	1	13 GB	1921 MB	1913 MB
public.sample	_hyper_91_1780_chunk	compress_hyper_92_1824_chunk	1	13 GB	836 MB	836 MB
public.sample	_hyper_91_1769_chunk	compress_hyper_92_1806_chunk	1	13 GB	826 MB	826 MB
public.sample	_hyper_91_1757_chunk	compress_hyper_92_1783_chunk	1	13 GB	1676 MB	1781 MB
public.sample	_hyper_91_1744_chunk	compress_hyper_92_1771_chunk	1	13 GB	1670 MB	1774 MB
public.sample	_hyper_91_1734_chunk	compress_hyper_92_1759_chunk	1	13 GB	1662 MB	883 MB
public.sample	_hyper_91_1721_chunk	compress_hyper_92_1747_chunk	1	13 GB	1679 MB	1784 MB
public.sample	_hyper_91_1708_chunk	compress_hyper_92_1736_chunk	1	13 GB	869 MB	869 MB
public.sample	_hyper_91_1695_chunk	compress_hyper_92_1723_chunk	1	13 GB	936 MB	936 MB

All of these chunks have roughly ~120 million samples each. The ~1.7GB chunks are the ones that were partially compressed and got re-compressed manually. The ~900MB chunks are the ones that are just compressed (ie, got filled up, the job compressed, and that's it).

I didn't try with every single one of them but only one from about 5-8 recompressed chunks wen't down to 900MB after a vacuum(full, analyze, verbose, index_cleanup). The rest remained in the 1.7GB range, no idea why. I have literally years of chunks with very stable sizes, and it has been generating 13GB uncompressed/900MB compressed chunks for months now. I don't understand why these are 2x the size after recompression. Unless recompression does something differently.

Tried decompressing them but it takes way too long (I think I left it 30m and the session was already blocking other operations so I dropped it), so I guess I'll wait for 2.13 before trying again.

antekresic · 2023-11-23T10:32:43Z

Hey @fetchezar I've actually added a change which should improve your situation with the compression policy. Basically, it adds an optimization which is already present in recompress_chunks procedure to make the policy a bit more performant.

As for the last comment you made about recompression resulting in double the compressed chunk size, its something to do with the way we recompress the data and updating the compressed chunk. At this point, best option is to run VACUUM FULL on those chunks to reclaim the storage.

There should be further work on the recompression operation which should address this issue in the future.

Hope this helps!

fetchezar · 2023-11-23T12:54:12Z

At this point, best option is to run VACUUM FULL on those chunks to reclaim the storage.

Is it normal for vacuum full not to reclaim the storage? Like I mentioned, from the 5-6 chunks I tried, only one got down to the expected size after a VACUUM FULL.

On the other hand, this change won't make it to 2.13.0 I imagine? I'll try it out when it gets released, thank you for your time!

antekresic · 2023-11-24T09:06:19Z

VACUUM FULL should reclaim the storage unless there are other queries which depend on the previous data which was overwritten, preventing vacuum from removing the data.

It actually is going to make it into 2.13 release! 🎉

fetchezar · 2023-11-24T12:13:15Z

Ah all right then, I'll try some other time when there is less load on the server.

It actually is going to make it into 2.13 release! 🎉

Nice, we'll try it when it comes out then! Thanks!

fetchezar added the bug label Oct 31, 2023

mkindahl self-assigned this Nov 1, 2023

mkindahl removed their assignment Nov 1, 2023

antekresic mentioned this issue Nov 23, 2023

Use segmentwise recompression in compress policy #6343

Merged

antekresic closed this as completed in #6343 Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Compressed chunks getting inflated after upgrading to 2.12.2 #6255

[Bug]: Compressed chunks getting inflated after upgrading to 2.12.2 #6255

fetchezar commented Oct 31, 2023

fetchezar commented Oct 31, 2023 •

edited

Loading

mkindahl commented Nov 1, 2023

fetchezar commented Nov 1, 2023 •

edited

Loading

Tindarid commented Nov 1, 2023 •

edited

Loading

akuzm commented Nov 9, 2023

fetchezar commented Nov 9, 2023

fetchezar commented Nov 9, 2023

akuzm commented Nov 9, 2023

fetchezar commented Nov 21, 2023 •

edited

Loading

antekresic commented Nov 23, 2023

fetchezar commented Nov 23, 2023

antekresic commented Nov 24, 2023

fetchezar commented Nov 24, 2023

[Bug]: Compressed chunks getting inflated after upgrading to 2.12.2 #6255

[Bug]: Compressed chunks getting inflated after upgrading to 2.12.2 #6255

Comments

fetchezar commented Oct 31, 2023

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

fetchezar commented Oct 31, 2023 • edited Loading

mkindahl commented Nov 1, 2023

fetchezar commented Nov 1, 2023 • edited Loading

Tindarid commented Nov 1, 2023 • edited Loading

akuzm commented Nov 9, 2023

fetchezar commented Nov 9, 2023

fetchezar commented Nov 9, 2023

akuzm commented Nov 9, 2023

fetchezar commented Nov 21, 2023 • edited Loading

antekresic commented Nov 23, 2023

fetchezar commented Nov 23, 2023

antekresic commented Nov 24, 2023

fetchezar commented Nov 24, 2023

fetchezar commented Oct 31, 2023 •

edited

Loading

fetchezar commented Nov 1, 2023 •

edited

Loading

Tindarid commented Nov 1, 2023 •

edited

Loading

fetchezar commented Nov 21, 2023 •

edited

Loading