[Bug]: delete_job() deadlocks #6152

leppaott · 2023-10-05T08:47:37Z

What type of bug is this?

Locking issue

What subsystems and features are affected?

Other

What happened?

Hello we have following code on e2e tests trying to remove retention jobs before a test suite. Each test suite loads the extension and creates retention jobs on the same postgres instance. This is to prevent some retention jobs from deleting items during tests.

  'SELECT delete_job(jobs.job_id) FROM ' +
          '(SELECT job_id FROM timescaledb_information.jobs ' +
          "WHERE application_name ILIKE '%Retention%' OR application_name ILIKE '%Telemetry%') as jobs"

However sometimes rarely we get a deadlock detected (see output) thought I'd report this. Any idea of a better solution/workaround?

TimescaleDB version affected

2.11.0

PostgreSQL version used

15.3

What operating system did you use?

Debian/Docker

What installation method did you use?

Docker

What platform did you run on?

Other, Not applicable

Relevant log output and stack trace

Error removing TDB retention policies error: deadlock detected
        at /opt/atlassian/pipelines/agent/build/services/common/node_modules/pg/lib/client.js:526:17
        at processTicksAndRejections (node:internal/process/task_queues:95:5)
      length: 326,
      severity: 'ERROR',
      code: '40P01',
      detail: 'Process 1087 waits for AccessExclusiveLock on advisory lock [83469,1,0,29749]; blocked by process 1085.\n' +
        'Process 1085 waits for ShareRowExclusiveLock on relation 17290 of database 83469; blocked by process 1087.',
      hint: 'See server log for query details.',
      position: undefined,
      internalPosition: undefined,
      internalQuery: undefined,
      where: undefined,
      schema: undefined,
      table: undefined,
      column: undefined,
      dataType: undefined,
      constraint: undefined,
      file: 'deadlock.c',
      line: '1148',
      routine: 'DeadLockReport'
    }

How can we reproduce the bug?

hard-to-reproce indeed locally, happens on lower end CI machine with limited CPU cores.

The text was updated successfully, but these errors were encountered:

melicheradam · 2023-10-06T16:21:24Z

Hi @leppaott ,

We have encountered something similar but with compression jobs. Have you considered using pg_advisory_xact_lock?

BEGIN TRANSACTION;
select pg_advisory_xact_lock( hashtext('job_delete')); - or any string or number
**your stuff**
COMMIT;

This will basically prevent any concurrency on this operation. Probably add this around all job-altering operations.

konskov · 2023-10-09T11:54:00Z

hi @leppaott , thank you for reaching out. I’m guessing that relation 17290 is bgw_job_stats and one of the two processes involved in the deadlock is the telemetry job, not a retention policy. Would it be possible to confirm to which processes these PIDs correspond and which relation is 17290? Thanks!

leppaott · 2023-10-12T06:45:23Z

 detail: 'Process 213 waits for AccessExclusiveLock on advisory lock [25680,1,0,29749]; blocked by process 210.\n' +
        'Process 210 waits for ShareRowExclusiveLock on relation 17290 of database 25680; blocked by process 213.',

{ oid: 17290, relname: 'bgw_job_stat' }

@konskov there you were right, is there a query to print processes involved or is that good enough?

But telemetry might be involved, I got this reproduced locally (repeated runs though) after I took the same CI image in use which has telemetry. Now trying TIMESCALEDB_TELEMETRY: 'off' also on CI.

Edit: don't see the issue anymore when not trying to delete telemetry jobs & running with off.

@melicheradam thanks for the suggestion, we'll try that if needed.

leppaott · 2023-12-15T12:47:50Z

Thanks @melicheradam indeed, this seems to work as expected. After updating timescale/timescaledb-ha:pg16-ts2.13 I started to get this again... same bgw_job_stat indeed.

leppaott added the bug label Oct 5, 2023

svenklemm added the hacktoberfest label Oct 5, 2023

konskov added the waiting-for-author label Oct 9, 2023

mkindahl added telemetry bgw The background worker subsystem, including the scheduler labels Oct 10, 2023

timescale-automation removed the waiting-for-author label Oct 12, 2023

mkindahl assigned mkindahl and unassigned mkindahl Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: delete_job() deadlocks #6152

[Bug]: delete_job() deadlocks #6152

leppaott commented Oct 5, 2023

melicheradam commented Oct 6, 2023

konskov commented Oct 9, 2023

leppaott commented Oct 12, 2023 •

edited

Loading

leppaott commented Dec 15, 2023

[Bug]: delete_job() deadlocks #6152

[Bug]: delete_job() deadlocks #6152

Comments

leppaott commented Oct 5, 2023

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

melicheradam commented Oct 6, 2023

konskov commented Oct 9, 2023

leppaott commented Oct 12, 2023 • edited Loading

leppaott commented Dec 15, 2023

leppaott commented Oct 12, 2023 •

edited

Loading