-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimistic lock contention on HCA replicas #6648
Comments
The way the distinction between bulk and non-bulk leaks into the |
I suspect that one partition of a bundle redundantly emits replicas also emitted by all other partitions. That's a suspicion that would need to be verified. |
Currently reindexing dcp43 with this hot-deployed temporary hotfix: Index: deployments/prod/environment.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/deployments/prod/environment.py b/deployments/prod/environment.py
--- a/deployments/prod/environment.py (revision 0f4f04cf1d7f6f85a84329f733fdd6df2d8618d3)
+++ b/deployments/prod/environment.py (revision cebc97550db9409849f32456f9c51eef9347d164)
@@ -1280,5 +1280,5 @@
'channel_id': 'C04JWDFCPFZ' # #team-boardwalk-prod
}),
- 'AZUL_ENABLE_REPLICAS': '1',
+ 'AZUL_ENABLE_REPLICAS': '0',
} |
Assignee to file PR against The PR should only reindex |
Donor ( We already track the UUIDs of protocol and donor entities in contributions and aggregate documents, but we need to ensure that the aggregation of such references is not lossy, by failing the aggregation should the aggregator hit its limit. |
This change alleviates the contention (#6648) a little bit but doesn't resolve it. Once #6648 is fully addressed, this should significantly reduce the remaining bits of contention, of which there were just a few dozen incidents per reindex before we fixed the absence of donor and protocol replicas, which are the main drivers of the contention.
This change alleviates the contention (#6648) a little bit but doesn't resolve it. Once #6648 is fully addressed, this should significantly reduce the remaining bits of contention, of which there were just a few dozen incidents per reindex before we fixed the absence of donor and protocol replicas, which are the main drivers of the contention.
@hannes-ucsc: "Spike to confirm my suspicion." |
It is true that when a bundle is partitioned, many replicas are written by multiple partitions. An example is bundle CloudWatch Logs Insights
Here are the result for only files:
And for all entity types:
This shows that the most common outcome was for a given entity in this bundle to be written as a replica 2 or 3 times. However, some replicas were written only once, and it is impossible to determine from the logs whether any of these writes were truly redundant or not, as we do not log the hub IDs when writing replicas. |
@hannes-ucsc: "We won't address the redundant replica write issue at this time, but instead file a separate issue. Assignee to implement the solution outlined in my comment above. Since we don't have a personal deployment to test this at scale we need to be careful when promoting this.the PR for this should also include a separate commit that enables replicas in prod again." |
The solution to #6582 fixed the omission of replicas for several types of low-cardinality, i.e., frequently referenced entities, such as
donor_organism
,sequencing_prototcol
andlibrary_preparation_protocol
. This led to increased contention on the respective documents in the replica index.[WARNING] 2024-10-23T14:36:17.525Z 6e6b10f5-9e45-5b8b-9b5a-4f6695def641 azul.indexer.index_service There was a conflict with document ReplicaCoordinates(entity=EntityReference(entity_type='sequencing_protocol', entity_id='571cc0c7-4dc2-443b-93f4-0ce4af08cf6d'), content_hash='a3a9f3a538a2a649690aa0974f4d1c070f6fc910'): ConflictError(409, 'version_conflict_engine_exception', {'error': {'root_cause': [{'type': 'version_conflict_engine_exception', 'reason': '[sequencing_protocol_571cc0c7-4dc2-443b-93f4-0ce4af08cf6d_a3a9f3a538a2a649690aa0974f4d1c070f6fc910]: version conflict, required seqNo [55840], primary term [1]. current document has seqNo [55843] and primary term [1]', 'index_uuid': 't2PQs1SoTMueXWOCuGtvzQ', 'shard': '16', 'index': 'azul_v2_prod_dcp43_replica'}], 'type': 'version_conflict_engine_exception', 'reason': '[sequencing_protocol_571cc0c7-4dc2-443b-93f4-0ce4af08cf6d_a3a9f3a538a2a649690aa0974f4d1c070f6fc910]: version conflict, required seqNo [55840], primary term [1]. current document has seqNo [55843] and primary term [1]', 'index_uuid': 't2PQs1SoTMueXWOCuGtvzQ', 'shard': '16', 'index': 'azul_v2_prod_dcp43_replica'}, 'status': 409}). Total # of errors: 1, retrying.
During an attempted reindex of catalog dcp32, 2 million replica writes resulted in such a 409 response from ES, compared to one thousand on a previous reindex without the fix for #6582:
The retry queue filled up with notifications and overall progress was slow:
I tried pushing the retries to ES by setting
retry_on_conflict
but this did not alleviate the issue much. While there were fewer conflicts, the ES client timeout of 1min kicked in more often.The text was updated successfully, but these errors were encountered: