Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta][GCS] - Improvements and addition of new features to the GCS input #41107

Open
7 of 14 tasks
ShourieG opened this issue Oct 3, 2024 · 9 comments
Open
7 of 14 tasks

Comments

@ShourieG
Copy link
Contributor

ShourieG commented Oct 3, 2024

Describe the enhancement: This issue is to track various improvements and addition of new features to the GCS input.

Describe a specific use case for the enhancement or feature: With our GCS input being slowly adopted across various integrations, we need to make it more robust and optimal in terms of performance, failure tracking and scalability, hence this Meta issue is created to bring in the necessary changes overtime.

Improvements:

  • Add metrics to the GCS input. Separate issue here
  • Segregate batch_size from worker count (currently worker count is used as the batch_size to distribute jobs evenly)
  • Segregate cursor save op from event publish and add support for detecting elasticsearch acknowledgement signal and use that to update cursor state.
  • Improve documentation and explain the impact polling op has on scalability.
  • Remove bucket_timeout and pass parent program context to bucket operations.

New Features

  • Add support for SDK level retry mechanism and make it user configurable.
  • Add support for filtering by prefix and glob expressions.
  • Add support for GCS PubSub enabling horizontal scalability of the input.
  • Add support for more content-types to the GCS input via content decoders :-
    • JSON/NDJSON
    • CSV
    • PARQUET
    • TEXT
  • Add support for state tracking via optional startOffset (user configurable with certain ordering limitations)
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@ShourieG ShourieG added the Epic label Oct 3, 2024
@ShourieG ShourieG assigned ShourieG and unassigned ShourieG Oct 3, 2024
@ShourieG
Copy link
Contributor Author

ShourieG commented Oct 4, 2024

cc: @narph, @andrewkroh

@ShourieG ShourieG changed the title [filebeat][Meta]- Improvements and addition of new features to the GCS input [Meta][GCS] - Improvements and addition of new features to the GCS input Oct 4, 2024
@ShourieG
Copy link
Contributor Author

ShourieG commented Oct 4, 2024

@andrewkroh, @efd6 please feel free to expand this issue by suggesting improvements/additions that you would like to see in the input moving foreward.

@andrewkroh
Copy link
Member

Add support for state tracking via optional startOffset (user configurable with certain ordering limitations)

Can you describe this feature with a bit more detail please. I don't understand what the feature does or what the use case is for it.

@andrewkroh
Copy link
Member

Add support for SDK level retry mechanism and make it user configurable.

By "SDK level" does this mean API calls into GCS? Those should retry, but I don't see a reason to make this user-configurable. What would a user do with these options? Basically I think it should continuously retry failed list_objects calls with some back-off. And any failed get_object calls will naturally be retried on the next loop over the bucket's contents (assuming the input state does not positively indicate that the object was ingested).

@andrewkroh
Copy link
Member

Another item, possibly to be taken up in a separate issue, is that the input should not timeout get_object calls where we are downloading and processing the stream of bytes. This operation can be slowed down by a number of factors (e.g. back-pressure) and having an arbitrary maximum operation timeout is not helpful.

@ShourieG
Copy link
Contributor Author

ShourieG commented Dec 10, 2024

support for state tracking via optional startOffset (user configurab

@andrewkroh, If you look at this doc, state tracking via start offset will allow us to list pages of objects in a lexicographic ordered manner. This will be more efficient and precise for state history if the users keep their bucket objects also lexicographically ordered. So now the users will have two ways to track state, which just gives more options. State switching won't be allowed so if a user picks one state tracking option the other will be disabled. We will have an option in the config to enable this with the necessary warnings about compatibility, then it would be upto the users to choose.

@ShourieG
Copy link
Contributor Author

ShourieG commented Dec 10, 2024

Another item, possibly to be taken up in a separate issue, is that the input should not timeout get_object calls where we are downloading and processing the stream of bytes. This operation can be slowed down by a number of factors (e.g. back-pressure) and having an arbitrary maximum operation timeout is not helpful.

I agree with this, hence I will be removing the concept of bucket_timeout and let the program context pass through the job object that will directly solve this. Initial philosophy was to give users more control but in this scenario it's proven to be not ideal.

Already have a PR up: #41970

@ShourieG
Copy link
Contributor Author

ShourieG commented Dec 10, 2024

Add support for SDK level retry mechanism and make it user configurable.

By "SDK level" does this mean API calls into GCS? Those should retry, but I don't see a reason to make this user-configurable. What would a user do with these options? Basically I think it should continuously retry failed list_objects calls with some back-off. And any failed get_object calls will naturally be retried on the next loop over the bucket's contents (assuming the input state does not positively indicate that the object was ingested).

@andrewkroh, This was a direct request from users and product: elastic/integrations#11580, they want configurable retryable policies and options. The current retry was only for failed jobs that we capture on our end and store in state. With a new PR, we are now giving users full control of the sdk retry options. This I feel will let users have more control on how the API level retries behave rather than just relying on our default values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants