Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading from AWS S3 large file gives "Required array length is too large" error #478

Open
msmygit opened this issue Jun 22, 2023 · 2 comments
Labels

Comments

@msmygit
Copy link
Collaborator

msmygit commented Jun 22, 2023

Command Executed:

export DSBULK_JAVA_OPTS="-Xmx10G"
./dsbulk load -k <keyspace> -t transactions -b secure-connect-<db_name>.zip -u <username> -p <password> -url "s3://path/to/transactions.csv?region=us-east-1"

Console Output:

Operation LOAD_20230622-155525-014753 failed unexpectedly: Required array length 2147483639 + 96 is too large.

Full Stacktrace:

2023-06-22 15:55:59 ERROR Operation LOAD_20230622-155525-014753 failed unexpectedly: Required array length 2147483639 + 96 is too large.
java.lang.OutOfMemoryError: Required array length 2147483639 + 96 is too large
        at java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
        at java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
        at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100)
        at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:132)
        at software.amazon.awssdk.utils.IoUtils.toByteArray(IoUtils.java:48)
        at software.amazon.awssdk.core.sync.ResponseTransformer.lambda$toBytes$3(ResponseTransformer.java:175)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler$HttpResponseHandlerAdapter.transformResponse(BaseSyncClientHandler.java:218)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler$HttpResponseHandlerAdapter.handle(BaseSyncClientHandler.java:206)
        at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleSuccessResponse(CombinedResponseHandler.java:99)
        at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:75)

FWIW, smaller csv files can be used for the load with no problem (~200mb), but get Java heap space errors on CSVs greater than 1 gb and hence the usage of export DSBULK_JAVA_OPTS="-Xmx10G". Are there other throttling available on here? Tried with a file of size 187GB csv and another with 2.3GB and both ended with the same error.

┆Issue is synchronized with this Jira Task by Unito

@msmygit msmygit added the bug label Jun 22, 2023
@msmygit
Copy link
Collaborator Author

msmygit commented Jun 22, 2023

@DavidTaylorCengage do you have any inputs here?

@DavidTaylorCengage
Copy link
Contributor

I'm afraid I don't have any specific insights. We were dealing with a URL file that had up to about 10 million records in it; your 2+ billion records (living in just the one transactions.csv file, it seems?) is probably going to run into the limits of what standard Java data structures can support (Integer.MAX_VALUE being 2147483647).

My recommendation would be to break your file up into smaller chunks. That's probably the easiest solution. If you're feeling bold, you could try to revisit my solution to make it more efficient or better utilize streaming.

Also, in case you haven't already found it, #399 was where I implemented the S3 functionality. There may be more answers that can be gleaned from that PR. I don't think I did anything in particular with processing the records themselves; I just dealt with reading the list from S3 and passing it to the normal DSBulk operation. (It may be worth noting that I have no association with DataStax or DSBulk. I'm just a random dev who needed a new feature and decided to implement it himself.)

I hope that is at least a little helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants