Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: delete_job() deadlocks #6152

Open
leppaott opened this issue Oct 5, 2023 · 4 comments
Open

[Bug]: delete_job() deadlocks #6152

leppaott opened this issue Oct 5, 2023 · 4 comments
Labels
bgw The background worker subsystem, including the scheduler bug hacktoberfest telemetry

Comments

@leppaott
Copy link

leppaott commented Oct 5, 2023

What type of bug is this?

Locking issue

What subsystems and features are affected?

Other

What happened?

Hello we have following code on e2e tests trying to remove retention jobs before a test suite. Each test suite loads the extension and creates retention jobs on the same postgres instance. This is to prevent some retention jobs from deleting items during tests.

  'SELECT delete_job(jobs.job_id) FROM ' +
          '(SELECT job_id FROM timescaledb_information.jobs ' +
          "WHERE application_name ILIKE '%Retention%' OR application_name ILIKE '%Telemetry%') as jobs"

However sometimes rarely we get a deadlock detected (see output) thought I'd report this. Any idea of a better solution/workaround?

TimescaleDB version affected

2.11.0

PostgreSQL version used

15.3

What operating system did you use?

Debian/Docker

What installation method did you use?

Docker

What platform did you run on?

Other, Not applicable

Relevant log output and stack trace

Error removing TDB retention policies error: deadlock detected
        at /opt/atlassian/pipelines/agent/build/services/common/node_modules/pg/lib/client.js:526:17
        at processTicksAndRejections (node:internal/process/task_queues:95:5)
      length: 326,
      severity: 'ERROR',
      code: '40P01',
      detail: 'Process 1087 waits for AccessExclusiveLock on advisory lock [83469,1,0,29749]; blocked by process 1085.\n' +
        'Process 1085 waits for ShareRowExclusiveLock on relation 17290 of database 83469; blocked by process 1087.',
      hint: 'See server log for query details.',
      position: undefined,
      internalPosition: undefined,
      internalQuery: undefined,
      where: undefined,
      schema: undefined,
      table: undefined,
      column: undefined,
      dataType: undefined,
      constraint: undefined,
      file: 'deadlock.c',
      line: '1148',
      routine: 'DeadLockReport'
    }

How can we reproduce the bug?

hard-to-reproce indeed locally, happens on lower end CI machine with limited CPU cores.
@melicheradam
Copy link

Hi @leppaott ,

We have encountered something similar but with compression jobs. Have you considered using pg_advisory_xact_lock?

BEGIN TRANSACTION;
select pg_advisory_xact_lock( hashtext('job_delete')); - or any string or number
**your stuff**
COMMIT;

This will basically prevent any concurrency on this operation. Probably add this around all job-altering operations.

@konskov
Copy link
Contributor

konskov commented Oct 9, 2023

hi @leppaott , thank you for reaching out. I’m guessing that relation 17290 is bgw_job_stats and one of the two processes involved in the deadlock is the telemetry job, not a retention policy. Would it be possible to confirm to which processes these PIDs correspond and which relation is 17290? Thanks!

@mkindahl mkindahl added telemetry bgw The background worker subsystem, including the scheduler labels Oct 10, 2023
@leppaott
Copy link
Author

leppaott commented Oct 12, 2023

 detail: 'Process 213 waits for AccessExclusiveLock on advisory lock [25680,1,0,29749]; blocked by process 210.\n' +
        'Process 210 waits for ShareRowExclusiveLock on relation 17290 of database 25680; blocked by process 213.',
{ oid: 17290, relname: 'bgw_job_stat' }

@konskov there you were right, is there a query to print processes involved or is that good enough?

But telemetry might be involved, I got this reproduced locally (repeated runs though) after I took the same CI image in use which has telemetry. Now trying TIMESCALEDB_TELEMETRY: 'off' also on CI.

Edit: don't see the issue anymore when not trying to delete telemetry jobs & running with off.

@melicheradam thanks for the suggestion, we'll try that if needed.

@leppaott
Copy link
Author

Thanks @melicheradam indeed, this seems to work as expected. After updating timescale/timescaledb-ha:pg16-ts2.13 I started to get this again... same bgw_job_stat indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgw The background worker subsystem, including the scheduler bug hacktoberfest telemetry
Projects
None yet
Development

No branches or pull requests

6 participants