Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query performance: need help optimizing #495

Open
spiicychknsandy opened this issue Jul 2, 2020 · 1 comment
Open

query performance: need help optimizing #495

spiicychknsandy opened this issue Jul 2, 2020 · 1 comment

Comments

@spiicychknsandy
Copy link

I know this is not an issue for this repo but I have looked in a lot of places and tried a lot of different things and nothing is working.

I need to bulk update deleted_at for records for a table with 3M rows and I need to look at 2 other tables that both have 5-10M records.

My query is

UPDATE public.study_data_events sde
SET deleted_at = current_timestamp
WHERE sde.id
IN (
  SELECT s.id
  FROM public.study_data_events s
  LEFT JOIN public.study_data_values sdv on sdv.study_data_event_id = s.id and sdv.deleted_at is NULL
  LEFT JOIN public.intervention_phase_events ipe on ipe.intervention_event_id = s.id and ipe.deleted_at is NULL
  WHERE sdv.id is NULL
  AND ipe.id is NULL
  AND s.deleted_at is NULL
);

The EXPLAIN looks like

Update on study_data_events sde  (cost=48822.30..48822.62 rows=1 width=132)
   ->  Nested Loop  (cost=48822.30..48822.62 rows=1 width=132)
         ->  HashAggregate  (cost=48821.87..48821.88 rows=1 width=22)
               Group Key: s.id
               ->  Nested Loop Left Join  (cost=36390.15..48821.87 rows=1 width=22)
                     Filter: (sdv.id IS NULL)
                     ->  Hash Right Join  (cost=36389.71..48817.69 rows=1 width=16)
                           Hash Cond: (ipe.intervention_event_id = s.id)
                           Filter: (ipe.id IS NULL)
                           ->  Seq Scan on intervention_phase_events ipe  (cost=0.00..5637.57 rows=270448 width=14)
                                 Filter: (deleted_at IS NULL)
                           ->  Hash  (cost=29704.09..29704.09 rows=384610 width=10)
                                 ->  Seq Scan on study_data_events s  (cost=0.00..29704.09 rows=384610 width=10)
                                       Filter: (deleted_at IS NULL)
                     ->  Index Scan using study_data_values_study_data_event_id_idx on study_data_values sdv  (cost=0.43..3.31 rows=87 width=14)
                           Index Cond: (study_data_event_id = s.id)
                           Filter: (deleted_at IS NULL)
         ->  Index Scan using study_data_events_pkey on study_data_events sde  (cost=0.42..0.72 rows=1 width=106)
               Index Cond: (id = s.id)

We have this seq scan

->  Seq Scan on study_data_events s  (cost=0.00..29704.09 rows=384610 width=10)
    Filter: (deleted_at IS NULL)`

that is very costly and I'm wondering what I can do about it.

I've tried adding a variety of indices on s.deleted_at none of which have helped. Has anyone needed to filter a large dataset on deleted_at IS NULL?

@rishi-jisr
Copy link

@spiicychknsandy Did you find any solution for this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants