-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta][GCS] - Improvements and addition of new features to the GCS input #41107
Comments
Pinging @elastic/security-service-integrations (Team:Security-Service Integrations) |
cc: @narph, @andrewkroh |
@andrewkroh, @efd6 please feel free to expand this issue by suggesting improvements/additions that you would like to see in the input moving foreward. |
Can you describe this feature with a bit more detail please. I don't understand what the feature does or what the use case is for it. |
By "SDK level" does this mean API calls into GCS? Those should retry, but I don't see a reason to make this user-configurable. What would a user do with these options? Basically I think it should continuously retry failed list_objects calls with some back-off. And any failed get_object calls will naturally be retried on the next loop over the bucket's contents (assuming the input state does not positively indicate that the object was ingested). |
Another item, possibly to be taken up in a separate issue, is that the input should not timeout get_object calls where we are downloading and processing the stream of bytes. This operation can be slowed down by a number of factors (e.g. back-pressure) and having an arbitrary maximum operation timeout is not helpful. |
@andrewkroh, If you look at this doc, state tracking via start offset will allow us to list pages of objects in a lexicographic ordered manner. This will be more efficient and precise for state history if the users keep their bucket objects also lexicographically ordered. So now the users will have two ways to track state, which just gives more options. State switching won't be allowed so if a user picks one state tracking option the other will be disabled. We will have an option in the config to enable this with the necessary warnings about compatibility, then it would be upto the users to choose. |
I agree with this, hence I will be removing the concept of bucket_timeout and let the program context pass through the job object that will directly solve this. Initial philosophy was to give users more control but in this scenario it's proven to be not ideal. Already have a PR up: #41970 |
@andrewkroh, This was a direct request from users and product: elastic/integrations#11580, they want configurable retryable policies and options. The current retry was only for failed jobs that we capture on our end and store in state. With a new PR, we are now giving users full control of the sdk retry options. This I feel will let users have more control on how the API level retries behave rather than just relying on our default values. |
Describe the enhancement: This issue is to track various improvements and addition of new features to the GCS input.
Describe a specific use case for the enhancement or feature: With our GCS input being slowly adopted across various integrations, we need to make it more robust and optimal in terms of performance, failure tracking and scalability, hence this Meta issue is created to bring in the necessary changes overtime.
Improvements:
New Features
The text was updated successfully, but these errors were encountered: