Optimistic lock contention on HCA replicas #6648

hannes-ucsc · 2024-10-23T15:33:49Z

The solution to #6582 fixed the omission of replicas for several types of low-cardinality, i.e., frequently referenced entities, such as donor_organism, sequencing_prototcol and library_preparation_protocol. This led to increased contention on the respective documents in the replica index.

[WARNING] 2024-10-23T14:36:17.525Z 6e6b10f5-9e45-5b8b-9b5a-4f6695def641 azul.indexer.index_service There was a conflict with document ReplicaCoordinates(entity=EntityReference(entity_type='sequencing_protocol', entity_id='571cc0c7-4dc2-443b-93f4-0ce4af08cf6d'), content_hash='a3a9f3a538a2a649690aa0974f4d1c070f6fc910'): ConflictError(409, 'version_conflict_engine_exception', {'error': {'root_cause': [{'type': 'version_conflict_engine_exception', 'reason': '[sequencing_protocol_571cc0c7-4dc2-443b-93f4-0ce4af08cf6d_a3a9f3a538a2a649690aa0974f4d1c070f6fc910]: version conflict, required seqNo [55840], primary term [1]. current document has seqNo [55843] and primary term [1]', 'index_uuid': 't2PQs1SoTMueXWOCuGtvzQ', 'shard': '16', 'index': 'azul_v2_prod_dcp43_replica'}], 'type': 'version_conflict_engine_exception', 'reason': '[sequencing_protocol_571cc0c7-4dc2-443b-93f4-0ce4af08cf6d_a3a9f3a538a2a649690aa0974f4d1c070f6fc910]: version conflict, required seqNo [55840], primary term [1]. current document has seqNo [55843] and primary term [1]', 'index_uuid': 't2PQs1SoTMueXWOCuGtvzQ', 'shard': '16', 'index': 'azul_v2_prod_dcp43_replica'}, 'status': 409}). Total # of errors: 1, retrying.

During an attempted reindex of catalog dcp32, 2 million replica writes resulted in such a 409 response from ES, compared to one thousand on a previous reindex without the fix for #6582:

The retry queue filled up with notifications and overall progress was slow:

I tried pushing the retries to ES by setting retry_on_conflict but this did not alleviate the issue much. While there were fewer conflicts, the ES client timeout of 1min kicked in more often.

Index: src/azul/indexer/document.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/indexer/document.py b/src/azul/indexer/document.py
--- a/src/azul/indexer/document.py	(revision 0f4f04cf1d7f6f85a84329f733fdd6df2d8618d3)
+++ b/src/azul/indexer/document.py	(date 1729658882164)
@@ -1585,5 +1585,14 @@
             'upsert': super()._body(field_types)
         }
 
+    def to_index(self,
+                 catalog: Optional[CatalogName],
+                 field_types: CataloguedFieldTypes,
+                 bulk: bool = False
+                 ) -> JSON:
+        result = super().to_index(catalog, field_types, bulk)
+        k, v = 'retry_on_conflict', 10
+        return {**result, f'_{k}': v} if bulk else {**result, 'params': {k: v}}
+
 
 CataloguedContribution = Contribution[CataloguedEntityReference]

The text was updated successfully, but these errors were encountered:

hannes-ucsc · 2024-10-23T15:50:07Z

The way the distinction between bulk and non-bulk leaks into the to_index method is annoying. The patch above shows that. The raw bulk API is quite clean but the Python ES client erases that by merging body and parameters using proprietary keyword arguments whose names start with an underscore. The bulk parameter to to_index should be removed and that method should return the keyword arguments to the client method. Another method should be used to translate to_index's return value to a bulk action, if and when required.

hannes-ucsc · 2024-10-23T15:51:26Z

I suspect that one partition of a bundle redundantly emits replicas also emitted by all other partitions. That's a suspicion that would need to be verified.

hannes-ucsc · 2024-10-23T16:52:12Z

Currently reindexing dcp43 with this hot-deployed temporary hotfix:

Index: deployments/prod/environment.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/deployments/prod/environment.py b/deployments/prod/environment.py
--- a/deployments/prod/environment.py	(revision 0f4f04cf1d7f6f85a84329f733fdd6df2d8618d3)
+++ b/deployments/prod/environment.py	(revision cebc97550db9409849f32456f9c51eef9347d164)
@@ -1280,5 +1280,5 @@
             'channel_id': 'C04JWDFCPFZ'  # #team-boardwalk-prod
         }),
 
-        'AZUL_ENABLE_REPLICAS': '1',
+        'AZUL_ENABLE_REPLICAS': '0',
     }

hannes-ucsc · 2024-10-23T17:50:23Z

Assignee to file PR against prod with the above temporary hotfix.

The PR should only reindex lm7 and lm8. dcp43 is currently being reindexed with this hotfix hot-deployed. I hope to get approval for dcp43 this week so that we can skip reindexing dcp42 and file the deletion of dcp42 as a permanent hotfix as well. We would then need to backport three hotfixes: this one, the lm8 fix (#6444) and the dcp42 removal.

… PR #6650)

…#6654)

hannes-ucsc · 2024-10-28T17:35:24Z

Donor (donor_organism) and protocol (…_protocol) replicas should be handled the same way HCA project and AnVIL dataset replicas are handled currently on develop. We've been using the term implicit hub to refer to these types of entities but I would like to use hot entities instead. What distinguishes these types of entities from, say, specimens, is not the fact that their instances are only implicitly linked to a file (and only HCA projects are implicitly linked), but that they have few instances that are referenced by many files. Moreover, neither projects nor datasets are currently hubs. In fact, datasets will become hubs as part of #6529.

We already track the UUIDs of protocol and donor entities in contributions and aggregate documents, but we need to ensure that the aggregation of such references is not lossy, by failing the aggregation should the aggregator hit its limit.

This change alleviates the contention (#6648) a little bit but doesn't resolve it. Once #6648 is fully addressed, this should significantly reduce the remaining bits of contention, of which there were just a few dozen incidents per reindex before we fixed the absence of donor and protocol replicas, which are the main drivers of the contention.

#6654, PR #6650) and e3a710f (#6644, PR #6651, PR #6655)

achave11-ucsc · 2024-11-05T19:55:09Z

@hannes-ucsc: "Spike to confirm my suspicion."

nadove-ucsc · 2024-11-06T09:48:37Z

It is true that when a bundle is partitioned, many replicas are written by multiple partitions. An example is bundle e831adae-359a-5d18-bd51-485556a8f6cb, which was divided into 32 partitions, performing a total of 22,885 successful replica writes for only 7,829 unique replicas. All 33 notifications for this bundle succeeded without retrying.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-indexer-prod-contribute, /aws/lambda/azul-indexer-prod-contribute_retry
start-time: 2024-10-22T02:44:41.760Z
end-time: 2024-10-22T02:47:07.959Z
query-string:

fields @message
| filter @requestId in ["cc8799d-5b2e-595c-ba51-260c48a072c4", "a8bc4010-ec14-5d90-a961-1f81ffacd6f2", "e47401ea-89d2-5ff9-8227-5e7b9a6b3d03", "041abbb9-aaad-5727-9817-e428b9138708", "481a469f-ec7f-5eb4-b686-190effc21876", "4acb5666-04c2-560c-ab74-09f717705e2e", "116fb059-5dd6-54e7-8d54-efc84e631c78", "2155dd7a-616d-5abb-9897-82eb72d505e5", "dd5e79eb-6d46-5d0f-b054-267a7ef7aa34", "370c0a9d-fb0a-51c3-aa74-550ce8cb295c", "2cbb66bd-108b-5d54-8e0a-392c03331362", "93897a7f-bb35-51e2-af1f-25eea8f5e720", "5888cedb-56e5-5a67-a347-914a4524dca5", "6e8af882-8fcb-57a4-93e7-a7139c0adf86", "e28e002f-5226-5237-a5be-5fbe41e714f5", "5acc5634-38a0-5705-8b13-09a52b14eea4", "44320cfc-b0be-5aa9-835f-536da30657e3", "d29e2a2f-d494-593e-9a43-75bec75531a1", "56508491-3e45-5010-9425-db002d84272b", "1d0eb533-d2b6-529a-a7c1-eaa2c1772a05", "e7ef3234-cd53-5ba4-b4c8-6da5de9310c6", "986f0007-28ca-5cd9-a2ac-5f963b0bb193", "0927f60a-74cf-5c9b-b442-a99b677b6550", "5dd7d950-ba57-5c07-a232-86c1a9974efd", "84d5185d-0511-55c8-a0cc-47cc16a84c89", "8512f72f-8f82-5bbb-b1ac-a8be9c43a4e5", "acb35cdd-28e8-538d-83a2-034c3b29bbaf", "e14b9676-daf1-53a9-8283-c9e7c9456a28", "83788504-2b12-5508-a7fc-da9538d333ca", "63f8152b-c2ec-5a6b-af17-4abad12cca44", "c1ad935b-70ad-52d4-9290-ccca636ccb75", "1fc5ad3c-5343-56d2-a696-f2ad40829b98", "23ae4221-a546-50e3-ade7-1db0b687687c", "ecc8799d-5b2e-595c-ba51-260c48a072c4"]
  and @message like "Successfully wrote replica"
| parse @message "Successfully wrote replica of *" as coords
| stats count(*) as write_count by coords
| stats count(*) as replica_count by write_count
| sort replica_count desc

Here are the result for only files:

write_count	replica_count
2	3104
3	1376
1	107
32	1

And for all entity types:

write_count	replica_count
3	4130
2	3408
1	109
26	18
22	12
23	10
12	10
18	10
11	10
31	10
32	9
10	8
15	8
13	8
9	6
24	6
19	6
20	6
21	6
29	4
25	4
16	4
14	4
27	4
5	4
30	4
17	4
8	2
28	2
7	2

This shows that the most common outcome was for a given entity in this bundle to be written as a replica 2 or 3 times. However, some replicas were written only once, and it is impossible to determine from the logs whether any of these writes were truly redundant or not, as we do not log the hub IDs when writing replicas.

achave11-ucsc · 2024-11-06T19:53:03Z

@hannes-ucsc: "We won't address the redundant replica write issue at this time, but instead file a separate issue. Assignee to implement the solution outlined in my comment above. Since we don't have a personal deployment to test this at scale we need to be careful when promoting this.the PR for this should also include a separate commit that enables replicas in prod again."

hannes-ucsc added the orange [process] Done by the Azul team label Oct 23, 2024

hannes-ucsc changed the title ~~Contention~~ Optimistic lock contention on HCA replicas Oct 23, 2024

hannes-ucsc added bug [type] A defect preventing use of the system as specified indexer [subject] The indexer part of Azul perf [subject] Performance, efficiency or cost + [priority] High labels Oct 23, 2024

dsotirho-ucsc self-assigned this Oct 23, 2024

dsotirho-ucsc mentioned this issue Oct 23, 2024

Promotion 2024-10-22 #6646

Closed

7 tasks

dsotirho-ucsc added a commit that referenced this issue Oct 23, 2024

[h] Disable indexing of replica documents (#6648)

190e6bd

dsotirho-ucsc mentioned this issue Oct 23, 2024

Hotfix prod: Optimistic lock contention on HCA replicas (#6648) #6650

Merged

40 tasks

dsotirho-ucsc added a commit that referenced this issue Oct 23, 2024

[h r] Fix: Optimistic lock contention on HCA replicas (#6648)

613a7fd

dsotirho-ucsc added a commit that referenced this issue Oct 23, 2024

[h r] Fix: Optimistic lock contention on HCA replicas (#6648)

73f5837

dsotirho-ucsc added a commit that referenced this issue Oct 23, 2024

[h r] Hotfix prod: Optimistic lock contention on HCA replicas (#6648,…

8334ae0

… PR #6650)

hannes-ucsc mentioned this issue Oct 24, 2024

Fix: Optimistic lock contention on HCA replicas (#6648) #6652

Merged

dsotirho-ucsc mentioned this issue Oct 24, 2024

Hotfix prod: Optimistic lock contention on HCA replicas (#6648) #6654

Merged

40 tasks

dsotirho-ucsc added a commit that referenced this issue Oct 25, 2024

[h] Hotfix prod: Optimistic lock contention on HCA replicas (#6648, PR …

cb2393a

…#6654)

achave11-ucsc assigned hannes-ucsc and unassigned dsotirho-ucsc Oct 29, 2024

achave11-ucsc added a commit that referenced this issue Oct 30, 2024

[r p] Fix: Optimistic lock contention on HCA replicas (#6648, PR #6652)

24aaa42

achave11-ucsc added a commit that referenced this issue Nov 4, 2024

Backport: a9d2087 (#6658, #6664, PR #6659), cb2393a+8334ae06 (#6648, PR

1cc7535

#6654, PR #6650) and e3a710f (#6644, PR #6651, PR #6655)

hannes-ucsc mentioned this issue Nov 5, 2024

No replicas for donors in HCA #6582

Open

achave11-ucsc added the spike:3 [process] Spike estimate of three points label Nov 5, 2024

achave11-ucsc assigned nadove-ucsc and unassigned hannes-ucsc Nov 5, 2024

nadove-ucsc removed their assignment Nov 6, 2024

achave11-ucsc added - [priority] Medium and removed + [priority] High labels Nov 6, 2024

achave11-ucsc assigned nadove-ucsc Nov 6, 2024

nadove-ucsc mentioned this issue Nov 8, 2024

Fix: Optimistic lock contention on HCA replicas (#6648) #6693

Draft

nadove-ucsc added a commit that referenced this issue Nov 8, 2024

[r] Fix: Optimistic lock contention on HCA replicas (#6648)

ba6eb08

nadove-ucsc added a commit that referenced this issue Nov 8, 2024

[r] Fix: Optimistic lock contention on HCA replicas (#6648)

a37b871

nadove-ucsc added a commit that referenced this issue Nov 8, 2024

[r] Fix: Optimistic lock contention on HCA replicas (#6648)

1e50df1

nadove-ucsc added a commit that referenced this issue Nov 11, 2024

[r] Fix: Optimistic lock contention on HCA replicas (#6648)

87b9140

nadove-ucsc added a commit that referenced this issue Nov 14, 2024

[r] Fix: Optimistic lock contention on HCA replicas (#6648)

76e5d52

nadove-ucsc added a commit that referenced this issue Nov 14, 2024

[r] Fix: Optimistic lock contention on HCA replicas (#6648)

2a51f48

nadove-ucsc added a commit that referenced this issue Nov 15, 2024

[r] Fix: Optimistic lock contention on HCA replicas (#6648)

cb3c054

nadove-ucsc added a commit that referenced this issue Nov 15, 2024

[r] Fix: Optimistic lock contention on HCA replicas (#6648)

d0b59af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimistic lock contention on HCA replicas #6648

Optimistic lock contention on HCA replicas #6648

hannes-ucsc commented Oct 23, 2024 •

edited

Loading

hannes-ucsc commented Oct 23, 2024 •

edited

Loading

hannes-ucsc commented Oct 23, 2024 •

edited

Loading

hannes-ucsc commented Oct 23, 2024

hannes-ucsc commented Oct 23, 2024 •

edited

Loading

hannes-ucsc commented Oct 28, 2024 •

edited

Loading

achave11-ucsc commented Nov 5, 2024

nadove-ucsc commented Nov 6, 2024 •

edited

Loading

achave11-ucsc commented Nov 6, 2024 •

edited

Loading

Optimistic lock contention on HCA replicas #6648

Optimistic lock contention on HCA replicas #6648

Comments

hannes-ucsc commented Oct 23, 2024 • edited Loading

hannes-ucsc commented Oct 23, 2024 • edited Loading

hannes-ucsc commented Oct 23, 2024 • edited Loading

hannes-ucsc commented Oct 23, 2024

hannes-ucsc commented Oct 23, 2024 • edited Loading

hannes-ucsc commented Oct 28, 2024 • edited Loading

achave11-ucsc commented Nov 5, 2024

nadove-ucsc commented Nov 6, 2024 • edited Loading

achave11-ucsc commented Nov 6, 2024 • edited Loading

hannes-ucsc commented Oct 23, 2024 •

edited

Loading

hannes-ucsc commented Oct 23, 2024 •

edited

Loading

hannes-ucsc commented Oct 23, 2024 •

edited

Loading

hannes-ucsc commented Oct 23, 2024 •

edited

Loading

hannes-ucsc commented Oct 28, 2024 •

edited

Loading

nadove-ucsc commented Nov 6, 2024 •

edited

Loading

achave11-ucsc commented Nov 6, 2024 •

edited

Loading