Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Append-only Indices #12886

Closed
sarthakaggarwal97 opened this issue Mar 24, 2024 · 29 comments
Closed

RFC: Append-only Indices #12886

sarthakaggarwal97 opened this issue Mar 24, 2024 · 29 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Roadmap:Cost/Performance/Scale Project-wide roadmap label

Comments

@sarthakaggarwal97
Copy link
Contributor

sarthakaggarwal97 commented Mar 24, 2024

Is your feature request related to a problem? Please describe

OpenSearch today caters to various use cases like log analytics, full text search, metrics, observability, security events, etc. By default, any index created in OpenSearch allows updates and deletes on the documents ingested. While this is good to cater the various use cases mentioned above. There are time series based use cases such as logs, metrics, observability, security events, etc. which does not require update or delete operations. It is well known that updates and deletes are expensive operations as they require the OpenSearch to lookup and perform operations, add soft deletes, and consequently can cause additional work during merges which can hinder the overall performance which can be avoided by restricting those operations for those use cases which doesn’t have a need. Also, there are certain optimizations (listed below) that can be applied if we know the data will not be updated or deleted (at document level).

Disabling updates/deletes on the index documents can allow us to handle multiple use cases efficiently, they are:

  • Onboard Data Structures optimized for append only: Recently, an RFC was opened to support pre-compute data structures like Star Tree where any updates or deletes would be quite expensive in terms of compute to rebuild the star tree.
  • Support Security driven use cases: There have been requests from the users to support indices where documents should be immutable. Such requests fall within use-cases like audit logs, security logs, transactions, ledgers, etc. and the core requirement is to ensure the documents cannot be changed/altered.
  • Optimizing index settings: We can tune the merge policy to allow faster access on more recent data. We would also support bigger merge sizes of the segments (currently index.merge.policy.max_merged_segment defaults to 5gb). We would be avoiding a chunk of merges by preventing deletes and updates, and thus these huge segments will come in contention to be merged, allowing us to increase 5gb limit.

Describe the solution you'd like

We propose to introduce the concept of append-only indices in OpenSearch to support aforementioned use-cases. With the support for restriction around keeping documents immutable, we would deny any updates and deletes of the document. This will help on reducing the footprint around memory usage for indices (e.g. version map) and also unlock the avenues to enable optimizations and features in future based on this restriction e.g.

  1. We can support automated rollovers with append-only indices.
  2. With the future support of Writable Warm, we can enable auto-migration of shards/segments instead of keeping all the segments/shards hot on data nodes.

Implementation details: TBU based on community feedback.

Additional context

FAQs:

Q: How would it be different from data streams?
A: While Data Streams optimizes on the automated rollover of time series data, it still supports for all CRUD operations on the backing indices. With append-only mode we would aim to provide with specialized optimizations and security features to such indices as a core functionality.

Q: What would be the APIs/features that we will not allow for append-only indices?
A: Some initial thoughts on the APIs/features we may not be able to support are:

  1. We will not be supporting updates and deletes of the documents in the index
  2. _doc API will be denied to avoid document index with custom id, updates and deletes
  3. _split and _shrink APIs will be denied to avoid removal of documents from underlying shards of the source index.
@sarthakaggarwal97 sarthakaggarwal97 added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 24, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Mar 24, 2024
@shwetathareja
Copy link
Member

Thanks @sarthakaggarwal97 for the proposal. +1 on all the optimizations which can be applied under-the-hood for the append-only indices.

Recently, found with Security plugin, there is a way to configure Immutable indices, the definition looks similar

public static final String SECURITY_COMPLIANCE_IMMUTABLE_INDICES = "plugins.security.compliance.immutable_indices";

FYI to ensure, there shouldn't be any conflict between the two in terms of implementation later

Couple of questions:

  1. Would append-only semantics be applicable for DataStreams?
  2. We can support automated rollovers with append-only indices.

Without the definition of alias (pointing to index) from users, automated rollover can't work. With DataStreams, it is possible as DataStream itself provides that logical construct on which searches and indexing can be performed.

  1. _doc API will be denied to avoid document index with custom id

Though, not lot of users use custom doc id with time series workload but some users may still use it and it could be helpful in establishing consistency and debugging issues across their systems. They will not benefit from append-only semantics as such as otherwise version map would be needed for request Id (doc id) idempotency.

  1. We need to think in terms of user experience, how it ties back to RFC: Application Based Configuration Templates #12683

  2. With the future support of [RFC] Support for writable warm indices on Opensearch #12809, we can enable auto-migration of shards/segments instead of keeping all the segments/shards hot on data nodes.

Auto tiering/ migration should support irrespective of append-only or updates. Definitely, it would be more efficient for append-only indices but can work for indices which take updates/ deletes as well and other factors could define the efficiency like update frequency.

@sarthakaggarwal97
Copy link
Contributor Author

@shwetathareja thank you for your comments!

FYI to ensure, there shouldn't be any conflict between the two in terms of implementation later

Yes, I came across the immutable indices. I agree that we should keep the implementation in sync to avoid conflicts. Moreover, we can also look backwards and see if immutable indices can benefit from append-only semantics from OpenSearch core. Will have to dive into Security Plugin's implementation to do that.

Would append-only semantics be applicable for DataStreams?

Datastreams allows all CRUD operations for their documents. Some of the suggested semantics / optimizations for append-only indices would be expensive for datastreams I believe.

Without the definition of alias (pointing to index) from users, automated rollover can't work

This is a valid point. If we want to support this, we should look to mandate an alias incase the user wants to perform automated rollovers with append-only indices.

Though, not lot of users use custom doc id with time series workload but some users may still use it and it could be helpful in establishing consistency and debugging issues across their systems

Denying the custom id helps us to avoid shard skews. We may run into one shard / one segment to be a hotspot, and with merge optimizations in place, I am not sure if we should do this. I would like more inputs on this from the community.

We need to think in terms of user experience

Agreed, I am aligned on this. We should provide a ready-to-go template to allow for append-only indices. Tagging @mgodwan to give more insights.

Auto tiering/ migration should support irrespective of append-only or updates

Yes, this was added to highlight that append-only indices would be able to efficiently handle auto-tiering since we won't be allowing updates, and can create bigger segments after a point as well without worrying about updates.

@reta
Copy link
Collaborator

reta commented Mar 27, 2024

Datastreams allows all CRUD operations for their documents. Some of the suggested semantics / optimizations for append-only indices would be expensive for datastreams I believe.

@sarthakaggarwal97 it looks to me this what is already possible with data streams?

Data streams are designed for use cases where existing data is rarely, if ever, updated. You cannot send update or deletion requests for existing documents directly to a data stream. Instead, use the update by query and delete by query APIs. - https://www.elastic.co/guide/en/elasticsearch/reference/7.10/data-streams.html

@sarthakaggarwal97
Copy link
Contributor Author

@reta I just tried it, and it looks like we can update the documents in the backing indices of a datastream. So the backing indices are not truly append-only.

@reta
Copy link
Collaborator

reta commented Mar 30, 2024

@reta I just tried it, and it looks like we can update the documents in the backing indices of a datastream. So the backing indices are not truly append-only.

@sarthakaggarwal97 thank you, that's by design (and in accordance with the documentation). I am wondering if the providing the capability to have backing indices truly append-only would be a natural improvement over data streams (in scope of this feature proposal)?

@shwetathareja
Copy link
Member

shwetathareja commented Apr 2, 2024

Datastreams allows all CRUD operations for their documents. Some of the suggested semantics / optimizations for append-only indices would be expensive for datastreams I believe.

@sarthakaggarwal97 append-only semantics should be applicable to Data Streams as DS abstractions are meant for time series workload. In order not to have breaking changes, this should still be driven by configuration. Be it regular indices or data streams backed indices, append-only semantics should work for both.

@RS146BIJAY
Copy link
Contributor

Thanks @shwetathareja and @reta for the feedback. Append only indices will be supported through a configurable setting. If this setting is enabled, all update and delete operations on the index (UPDATE, DELETE, UPSERT, UPDATE BY QUERY, DELETE BY QUERY, etc,) will be blocked. Initially, we will also block indexing a document with a custom id to ensure index operation will always create a new document rather than updating an existing one. Additionally, append-only indices will not be enabled for data stream in the initial phase, although they can be extended for these use cases as well in the future.

@reta
Copy link
Collaborator

reta commented Dec 4, 2024

Thanks @shwetathareja and @reta for the feedback. Append only indices will be supported through a configurable setting.

Thanks @RS146BIJAY , I would expect it to be per index upon creation, right? (aka index setting).

@mgodwan
Copy link
Member

mgodwan commented Dec 4, 2024

Initially, we will also block indexing a document with a custom id to ensure index operation will always create a new document rather than updating an existing one.

How would retries work here?

@mgodwan
Copy link
Member

mgodwan commented Dec 4, 2024

Additionally, append-only indices will not be enabled for data stream in the initial phase

Since the proposal is to have an index setting to configure this, do you see issues with users trying to configure it for backing indices for data streams?

@RS146BIJAY
Copy link
Contributor

Thanks @RS146BIJAY , I would expect it to be per index upon creation, right? (aka index setting).

Yes.

@RS146BIJAY
Copy link
Contributor

RS146BIJAY commented Dec 4, 2024

How would retries work here?

Retries will be separate update requests for OpenSearch. Right? So this will rejecting each of update requests. This is something similar to when write block is present on the Index and user tries to index the documents via retries. Since write is blocked, every write requests are rejected.

I am thinking of implementing this configuration which blocks all forms of updates on an index.

@mgodwan
Copy link
Member

mgodwan commented Dec 4, 2024

This is something similar to when write block is present on the Index and user tries to index the documents via retries. Since write is blocked, every write requests are rejected.

  1. This is different since write is not blocked, and a user who received the error may want to acknowledge that document has been inserted or not. For append only indices, we may need to define an experience which allows users to understand if the document was ingested or not, as otherwise users may build retry flows without doc ids which can lead to duplicate documents.
  2. There are internal retries in case of some transport errors which retry primary shard requests. We need to think through on how those will work.

@shwetathareja
Copy link
Member

Thanks @RS146BIJAY ! It is reasonable to start with blocking all operation except creation of document on regular indices. I am fine with solving Data Streams in next phase but wanted to know if there are specific challenges with DataStreams as DS is an abstraction over backing indices which are like regular indices except certain restrictions.

Regarding handling of failure scenarios including retries, it would be good to capture those explicitly as you start working on low level design. I agree with @mgodwan that internal retries should be handled properly as opposed to explicit retries from the user. In case of auto generated document ids, any retries from user will be treated as new document (which is same as today).

This is something similar to when write block is present on the Index and user tries to index the documents via retries. Since write is blocked, every write requests are rejected.

@RS146BIJAY Are you planning to introduce finer blocks at operation level similar read/write blocks at index level?

@RS146BIJAY
Copy link
Contributor

RS146BIJAY commented Dec 5, 2024

if there are specific challenges with DataStreams as DS is an abstraction over backing indices which are like regular indices except certain restrictions

On a high level, do not see any specific issue with DataStreams, but still need to validate if this can break any specific usecase

This is something similar to when write block is present on the Index and user tries to index the documents via retries. Since write is blocked, every write requests are rejected.

Are you planning to introduce finer blocks at operation level similar read/write blocks at index level?

Nope. Sorry for earlier confusion. There is a single append_only property (which is an Index setting set only once during index creation). If an index is append only, all forms of update operations will be blocked on that index and only new documents can be added on that index.

a user who received the error may want to acknowledge that document has been inserted or not

Thanks @mgodwan for pointing this scenario. Need to deep dive a bit more on how to handle retry scenarioes as we are not allowing passing a new custom doc ids.

@RS146BIJAY
Copy link
Contributor

RS146BIJAY commented Dec 19, 2024

For coordinator retries

In order to handle retries from coordinator (incase of ConnectTransportException, NodeClosedException, etc.,) for append_only indices, I am thinking about three approaches:

Approach 1: Handling retries at coordinator layer itself

A straightforward method to handle retries from the coordinator for append-only indices is to filter out documents that have already been indexed in OpenSearch for subsequent retries. This can be achieved by making an additional transport call to search for all the documents passed in the bulk request using their respective document ids before retrying. If there is an entry for the documents in the search response, we can exclude these documents in the retry call. This ensures OpenSearch will try to index a single document (with one document id) only once. We can use the response from the search api calls to reconstruct the final response of the bulk api call for the documents.

Issue with this approach

  1. A significant issue with this approach is that responses from the search call may not return a response (due to eventual consistency). This could result in recently indexed documents (which were not returned in the search call) being indexed again during the retry, returning a validation error to the user, even though document got correctly indexed.
  2. Another issue with this approach is search transport call made to retrieve details about the bulk item documents may itself fail. This would require handling these errors separately, further complicating the retry logic.

Approach 2: Handling retries at data node layer (at TransportShardBulkAction)

To avoid making additional transport calls from the co ordinator to identify already indexed documents, we can handle retries at TransportShardBulkAction before executing the bulk item request for each document on the primary shard. Before triggering indexing on primary shard, we can create IndexSearcher on primary shard and query for this document using it's id. If the IndexSearcher returns a response and this is a retry call, we will skip calling executeBulkItemRequest on that document and construct a response using the result returned by searcher.

Pros

  1. Avoids a transport call to check for already indexed documents.

Cons

  1. The response from the searcher may still not return a response (due to eventual consistency). So there is still a possibility that a retry bulk request is made for an already indexed document.

Approach 3: Handling retries at InternalEngine layer (Preferred Approach)

The best way to prevent the same document from being indexed again during a retry is to allow the retry request to pass till Engine and handle it at the InternalEngine layer itself. During document indexing, we can use isRetry flag passed in the indexing request to identify retries. To figure out if a document is already indexed or not we can use the current version of the document (which is identified using a Searcher or a version map). Incase version of a document is equal to 1 and it is retry request for an append only index, we can reconstruct the response and prevent indexing the document in Lucene as well as prevent a Translog entry for this request. Additional handling is needed at BulkPrimaryExecutionContext to skip translog sync for this retry request.

Pros

  1. This approach avoids a transport call.
  2. It guarantee that a document is indexed into OpenSearch only once (even with retries configured) as we are fetching version information using both version map (which is thread safe map) and an IndexSearch, ensuring realtime consistency.

For user retries

Since the user uses custom document ids to uniquely identify documents during a bulk indexing request, the only way to support this is by allowing passing custom doc ids for indexing requests. To ensure a document with same ID is indexed only once, we will use the document version (identified via the version map or Searcher) to check if a document with that id already exist. If the current version of the document is greater than 1, we will return a validation error indicating that update operations are not allowed for append only indices.

@RS146BIJAY
Copy link
Contributor

RS146BIJAY commented Dec 19, 2024

Adding a POC link for append only indices with the third approach for retry handling: RS146BIJAY@5334da3

@RS146BIJAY
Copy link
Contributor

RS146BIJAY commented Dec 19, 2024

@reta @shwetathareja @mgodwan @Bukhtawar Let me know what do u think on the above approach for retries?

@mgodwan
Copy link
Member

mgodwan commented Jan 2, 2025

Thanks @RS146BIJAY for the thoughts on this one. I agree with the recommended approach of Handling retries at InternalEngine layer as this avoid transport calls while ensuring we are performing operations in a thread-safe manner.

@reta
Copy link
Collaborator

reta commented Jan 3, 2025

The best way to prevent the same document from being indexed again during a retry is to allow the retry request to pass till Engine and handle it at the InternalEngine layer itself.

Thanks @RS146BIJAY , I am wondering if InternalEngine is the only place to make the retry decisions? For example, what if during the retry operation the node goes down / primary shard changes? Shouldn't the coordinator step in and retry the operation than?

@RS146BIJAY
Copy link
Contributor

RS146BIJAY commented Jan 5, 2025

Shouldn't the coordinator step in and retry the operation than?

Coordinator retries still occur when a node goes down, the primary shard changes, or for other failure scenarios. The concern we are addressing above is that if such a retry happens today, an update is also performed for all documents that were already indexed in the first attempt (performing a create and an update operation even for a document create request).

With above change what are we ensuring is in such retry scenarios, for an Append only index, the InternalEngine will reconstruct and return the response using the already indexed document (indexed at the first attempt), without updating document in Lucene or making a Translog entry. This ensures that Coordinator retries do not create a new version for already indexed documents (by not allowing update) for an Append only index.

Let me know if I have addressed the concern.

@shwetathareja
Copy link
Member

@RS146BIJAY Approach 3 would be preferred to avoid extra transport call.

Lets say bulk for docId1 succeeded, then user sent again another bulk for docId1 (insert), first attempt failed due to some issue, this will result in retry, how will this retry attempt handled now with Approach 3?

@Bukhtawar
Copy link
Collaborator

+1 on the approach 3

@RS146BIJAY Approach 3 would be preferred to avoid extra transport call.

Lets say bulk for docId1 succeeded, then user sent again another bulk for docId1 (insert), first attempt failed due to some issue, this will result in retry, how will this retry attempt handled now with Approach 3?

Should we update the version map with the success entry only after a successful indexing operation

@msfroh
Copy link
Collaborator

msfroh commented Jan 7, 2025

Note that the problem with coordinator/user-level retries goes away with pull-based ingestion, since the primary shard directly controls whether an update is applied or not. (Of course, during recovery the primary shard needs to check each update from in-flight batches to see if they were included in the commit point.)

@RS146BIJAY
Copy link
Contributor

RS146BIJAY commented Jan 9, 2025

Thanks @msfroh @shwetathareja @Bukhtawar @reta for the feedback.

Lets say bulk for docId1 succeeded, then user sent again another bulk for docId1 (insert), first attempt failed due to some issue, this will result in retry, how will this retry attempt handled now with Approach 3?

Currently, for a retry of the second request, we will reconstruct the response using the document which got indexed in the first request. Ideally, the response for the second request should have been a validation exception indicating that updates are not allowed for an append only index, regardless of whether it is a retry.

The reason for maintaining this behaviour is that, at the InternalEngine layer, there is no way differentiate between retry request for first request (index call) and a retry request for the second request on docId1 (update call). Due to this issue, for any retry call (be it for Index or an update request), I am reconstructing the response using the already indexed document.

Should we update the version map with the success entry only after a successful indexing operation

I think this may fail in some scenarios. When retrieving the version in Engine, we query both the version map and the actual Lucene Index (using a searcher) in case the version map does not have an entry for this docId. So while getting version for the doc id, if indexing went through, we will still receive a valid doc version for this document. If we want to update version map atomically (only when coordinator receives a successful response), we probably need to update Lucene Index atomically as well. Another drawback is that this approach will require a separate transport call from Coordinator to update version map since version map is maintained at InternalEngine.

I can think of three possible ways to ensure a consistent response is returned during indexing, even if there is a retry:

  1. First approach can be to not support passing a custom doc id at all during indexing call. Since all indexing requests with a custom doc id will be rejected at transport layer itself, InternalEngine will only receive indexing retry requests for documents being indexed in OpenSearch for the first time. In such cases, we can reconstruct the response using already indexed document for any Indexing retry request.
  2. Second approach can be to let user decide in case of a retry if they are ok with receiving a stale document for a retry (for both indexing and update retry). They can configure this via a setting for append only indices. In case this setting is enabled, we will always reconstruct the response using the indexed document (which got indexed in the first request) for any type of indexing retries (be it for document create or document update). If the setting is disabled, we will return a validation exception for any retry, ensuring that a stale response is never returned for an indexing request.
  3. A third approach can be during retry we can compare document passed in indexing request with already indexing document to determine if this is an indexing or an update request. If the document in the request matches the indexed document, it indicates that the document was inserted in the previous attempts of the same request, and we will reconstruct the response using already indexed document. If they do not match, it indicates a retry for an update request, and we will throw a validation Exception in the response.

I think for the initial release we can keep things simple and go with Approach 1 by not allowing custom doc id during the indexing call. Let me know if this makes sense.

@reta
Copy link
Collaborator

reta commented Jan 10, 2025

Thanks @RS146BIJAY , I believe for the first case,

  1. First approach can be to not support passing a custom doc id at all during indexing call. Since all indexing requests with a custom doc id will be rejected at transport layer itself, InternalEngine will only receive indexing retry requests for documents being indexed in OpenSearch for the first time. In such cases, we can reconstruct the response using already indexed document for any Indexing retry request.

it is captured by @shwetathareja here #12886 (comment)

  1. Second approach can be to let user decide in case of a retry if they are ok with receiving a stale document for a retry (for both indexing and update retry). They can configure this via a setting for append only indices. In case this setting is enabled, we will always reconstruct the response using the indexed document (which got indexed in the first request) for any type of indexing retries (be it for document create or document update). If the setting is disabled, we will return a validation exception for any retry, ensuring that a stale response is never returned for an indexing request.

Could we make it per-bulk-request policy (if it makes sense)? Something like "conflict_on_retry": ignore | override | ...

  1. A third approach can be during retry we can compare document passed in indexing request with already indexing document to determine if this is an indexing or an update request. If the document in the request matches the indexed document, it indicates that the document was inserted in the previous attempts of the same request, and we will reconstruct the response using already indexed document. If they do not match, it indicates a retry for an update request, and we will throw a validation Exception in the response.

I think this could be a heavy hit (since we have to fetch the whole document and do comparison), should we plan for it by adding something like @hash internal field to store the hash only (that would allow as to compare documents quickly)?

@shwetathareja
Copy link
Member

Thanks @RS146BIJAY for the exploring different approaches for sending consistent responses during failure handling.

I agree with @reta we can look into generating internal hash of document for comparison and generate it doing regular parsing of document so that it doesn't result in double parsing. Comparing documents is expensive operation.

In order to make progress here

  1. Open tracking issue to introduce efficient document comparison
  2. I am not in favor of option 2 as it seems we are passing internal OpenSearch limitation to the user.
  3. For Option 1 in order not to limit custom document id support for regular indices, so that it doesnt become adoption blocker for the feature (DataStream is already out of scope):
    a. For auto generated doc ids, this is not an issue as we can make an assumption that there will never be user retries and only system retries and document will be same always during retry. Don't throw validation error so that user is not confused.
    b. For custom document ids, always throw validation error, be it retries or actual update request. Fix this behavior once document comparison is available and throw validation error only when actual update request is sent by the user.

@mgodwan
Copy link
Member

mgodwan commented Jan 14, 2025

Thanks @RS146BIJAY for listing down the possible approaches.

For approach 1, I think it is a good way to start as it provides a 2-way-door decision. If we start by blocking custom _id field for the append-only indices (I don't know if there are many use cases which require both append-only and custom _id?), we essentially get a request-id to _id mapping which ensures that we know whether retry is internal or external based on the checks from available data within InternalEngine.

Approach 2 requires integration changes for users as they may now need to handle an exception or provide additional parameter to ensure that the expected document has been ingested into OpenSearch, and will lead into a breaking change if we decide to solve for it through a mechanism.

@shwetathareja
Copy link
Member

Agreed for append only use cases, using custom_id may not be first choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Roadmap:Cost/Performance/Scale Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

9 participants