-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large file sizes causing OOMKills and timeouts #155
Comments
Hi thanks for the issue. Generally we don't recommend interesting such large files but this should be possible, the lake writer logic just needs to be modified to be a modified to be a bit more intelligent and file size aware. Optimal parquet sizes are 100-500MB so it shouldn't need to bring more than that in memory at a time. I'd also like to see why it's ending up with that much data in lake writer and not flushing earlier, let me do some testing and update. |
awesome, thanks! this is the managed |
Splitting would work but we would probably want to support it out of the box in this case. I will take a closer look at the code, you can watch this issue. |
Were you able to test this out? I tested 2GB uncompressed ALB logs in the linked PR. |
hi all! i'm investigating using matano for some log ingestion, and some of the ALB log files i'm looking at are extremely large - 100MB compressed, multiple GB decompressed. we're running into resource exhaustion issues for memory usage, even after manually adjusting limits in the console to the maximum of 10240MB of memory. this happens in multiple lambdas, most notably the transform and writer.
the specific issues we're seeing in the writer are basically that it logs
INFO lake_writer: Starting 25 downloads from S3
and then 20s later it's killed by lambda for exceeding 10240 MB of memory used. can this 25 number be tuned or tweaked to take into account size?the transformer and databatcher issues we were able to resolve by increasing the timeout and memory, which should be covered by #85 when it's included. i may be able to contribute this depending on how our discovery goes, but not sure how long it would be until that could happen.
from the investigation i've done into this problem for a custom processing solution, that "best" resolutions appear to be either loading the data and processing it as a stream rather than loading it all into memory at once, or have some sort of pre-processor that splits large files into smaller chunks before they get to the loader.
do y'all have any thoughts on the best path forward here, or if matano would ever consider handling situations like this where the inputs/batches cannot be processed due to size?
The text was updated successfully, but these errors were encountered: