-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Compressed chunks getting inflated after upgrading to 2.12.2 #6255
Comments
I tried "fixing" the partially compressed chunks by hand by uncompressing them, then compressing them manually to no avail.
The last time (on 2.10.0) I tried compressing one of these chunks manually because we needed some space it took 7min in total to compress. I imagine decompressing should be even faster. After cancelling the decompression I'm left with this mostly dead chunk that I have to manually VACUUM for it to return to its normal size. |
@fetchezar Thank you for the very detailed bug report! As I read this, there seems to be multiple issues, but the main problem is that when incorporating new data into an existing chunk, it has to decompress the existing chunk and combine it with the the new data before compressing again. This results in a lot of dead tuples in the chunk, which eats up disk and force you to run VACUUM on a regular basis to keep the disk usage down. Is that a fair summary of the situation? |
Not exactly. The recompress job doesn't finishes succesfully. These chunks remain in a partially compressed state because the decompression step doesn't seems to ever finish. Neither by recompress_chunks nor by manually calling decompress_chunk on them. It just blew up the WAL to 50GB and had to cancel it and then run VACUUM(FULL) on the chunk. The chunk remained in a partially compressed state. To be clear: These tuples already existed before we upgraded from 2.10.0 to 2.12.2. This data didn't get inserted in the ~16h span between the update and the next compression policy execution. It was already there. The old data was already there, the compression policy job happened to run the next day in the morning we upgraded, and inflated two chunks. This was a problem but manageable, we had space to spare in that database. What cause problems was that very same day, at noon, the compression policy job ran in another database, who had less space available, and inflated two chunks and that blew up a streaming replica we had. So at least I'm quite sure we didn't have any problems inserting the data in 2.10.0, albeit for some reason it never got recompressed into the chunks. Then we upgraded to 2.12.2, and these problems started popping up when it tried to recompress this data. I can't say with 100% certainty but I'm quite sure these extra uncompressed tuples were sitting there for at least a couple weeks. |
Hi @fetchezar , curious about this issue (as also playing with recompress_chunk in the recent version, but don't have problems described here):
|
As a workaround for WAL blowing up, try to drop all indexes from the uncompressed chunk table before decompression. We'll have some improments for decompression throughput in 2.13, although I can't say this will definitively solve the problem: #6287 |
@akuzm Hmm... We only have two indices in these tables. The primary key index over (id int8, tstamp timestamptz) and the one TimescaleDB creates over the tstamp column. So this would require dropping the primary key and the TimescaleDB created index over the tstamp column, then decompressing the chunk. Then I should recompress it again, then re-add the primary key and the index? Or how would the "manual recompression" procedure should look like? Or should I just call recompress_chunk after dropping the pkey and the index? |
@Tindarid As for 1. I'm not understanding what would this accomplish. It seems I can't even decompress the chunk, what would vacuuming the compressed chunk accomplish in this case? Same for point 2. I can't even decompress it, so I'm not sure compression here is the problem. However it may help on the recompress_chunk call if it's only decompressing the proper segments, but given the uncompressed chunk's size is left as if it were decompressed entirely, doesn't looks like it. |
@fetchezar yes, you have to add them back manually if you need them. They won't be restored automatically. But if you plan to mostly keep this chunk compressed, maybe you don't need these indexes at the uncompressed chunk table at all. For the recompression process, I think |
@akuzm In the end I didn't have the courage to mess with TimescaleDB's internal tables. However I started recompressing partially compressed chunks from oldes to newest, one at a time in two databases, and somehow this time it worked fine. So now I have no more partially compressed chunks (and the data stream prevents ingestion of data older than ~8 days, so we shouldn't have more compressed chunk inserts anymore). Next step is to review the databases, enable the compression policies again and see if they don't blow up. That being said I noticed something strange. Re-compressed chunks have nowhere near the size a normally compressed chunk has, for instance:
All of these chunks have roughly ~120 million samples each. The ~1.7GB chunks are the ones that were partially compressed and got re-compressed manually. The ~900MB chunks are the ones that are just compressed (ie, got filled up, the job compressed, and that's it). I didn't try with every single one of them but only one from about 5-8 recompressed chunks wen't down to 900MB after a Tried decompressing them but it takes way too long (I think I left it 30m and the session was already blocking other operations so I dropped it), so I guess I'll wait for 2.13 before trying again. |
Hey @fetchezar I've actually added a change which should improve your situation with the compression policy. Basically, it adds an optimization which is already present in As for the last comment you made about recompression resulting in double the compressed chunk size, its something to do with the way we recompress the data and updating the compressed chunk. At this point, best option is to run There should be further work on the recompression operation which should address this issue in the future. Hope this helps! |
Is it normal for vacuum full not to reclaim the storage? Like I mentioned, from the 5-6 chunks I tried, only one got down to the expected size after a VACUUM FULL. On the other hand, this change won't make it to 2.13.0 I imagine? I'll try it out when it gets released, thank you for your time! |
VACUUM FULL should reclaim the storage unless there are other queries which depend on the previous data which was overwritten, preventing vacuum from removing the data. It actually is going to make it into 2.13 release! 🎉 |
Ah all right then, I'll try some other time when there is less load on the server.
Nice, we'll try it when it comes out then! Thanks! |
What type of bug is this?
Unexpected error
What subsystems and features are affected?
Background worker, Compression
What happened?
Hi! I'm following up after reporting this on the Timescale Slack server. This issue prevents normal function of the automatic compression policies, safe ingestion of old data into the architecture, and due the exponential increase in disk size, problems in keeping the database and streaming replicas working.
After upgrading from 2.10.0 to 2.12.2, it seems like the compression policy worker is "inflating" old compressed chunks without doing anything, and seemingly failing on a deadlock while doing so. The behavior we see is the worker grabbing a partially compressed chunk, "inflating" the related uncompressed chunk table to its previously uncompressed size, and then failing to recompress anything, with a deadlock error showing in the log.
In the Postgres log it seems I'm also missing some exception details from these two exception handlers here:
timescaledb/sql/policy_internal.sql
Line 115 in 98623ea
timescaledb/sql/policy_internal.sql
Line 126 in 98623ea
Looks like _message and _detail aren't set unlike the block above here:
timescaledb/sql/policy_internal.sql
Line 101 in 98623ea
Thus the log mentioning the deadlock message in one case, but empty "()" details and messages in the others.
These are the statistics of one of the partially compressed chunks for example:
An uncompressed chunk is roughly 14GB, 4 days of data, ~30 million rows per day, ~120 million rows total over the span of the chunk, which ends up compressed to 900MB. It's a very narrow schema, the table definition is the following, with the compression segmented by the
id
column:The amount of ids present in a single chunk is in the order of 300k, which gives us around 400 rows per signal id on average. We have these chunks with very, very few sporadic inserts of old data. A couple hundred rows inserted, compared to the existing ~120 million rows in the span of the entire chunk. From what I see around 10 chunks are partially compressed, from the couple hundred we have in total.
The bussiness use case makes it so insertion of old data is almost inexistent (we continually ingest oil field installation sensor data, continually being polled from the field). Deletes/updates over old data is an usecase we outright dont perform nor support within the system. There is the odd case where some particular instrument is a little bit more complicated and has a log, that could possibly have some older values, but that's it. It's impossible we've reinserted 4 days of values again over the same compressed chunk without noticing, like the statistics say (240 million inserts), I doubt the system could even handle it, so it makes no sense for the background job to try to inflate the chunk to its full size. At most it should inflate the impacted segments, which for a couple hundred uncompressed samples inserted post-compression it would mean probably less than a hundred id segments were affected.
After the compression policy job is done with a partially compressed chunk (verified the status flag is 9 in the chunks catalog, which I understand it means compressed | partially_compressed ), we're end up with a compressed chunk with the 120 million rows, of 900MB and an uncompressed 14GB chunk with a couple hundred rows alive. So practically all of it is "dead" tuples for postgres.
If we perform a VACUUM (FULL) of the chunk, its size returns to normal, but we're still left with most of the data in the compressed chunk, plus barely a couple hundred uncompressed rows in this other adjacent chunk. This means that if we enable the compression policy again, and the background worker picks up this chunk, it tries to do exactly the same, that is, inflating it to 14GB and blowing up our disk usage and itself in the process. So for now we've disabled all automatic compression policies.
recompress_chunk
doesn't finishes. It locks the table, with Postgres registering an ever increasing amount of sessions, and after almost 2h I have to stop the function due this. This table is mostly constantly queriedWe have at least another database, albeit smaller, with the exact same problem: The background job inflates a couple of partially compressed chunks to it's whole size, blowing up the disk usage of the master and streaming replica, causing issues on both systems.
TimescaleDB version affected
2.12.2
PostgreSQL version used
14.2
What operating system did you use?
Debian 10 x64
What installation method did you use?
Deb/Apt
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
How can we reproduce the bug?
I don't have a simple reproduction of this bug to share.
The text was updated successfully, but these errors were encountered: