Loading from AWS S3 large file gives "Required array length is too large" error #478

msmygit · 2023-06-22T16:53:02Z

Command Executed:

export DSBULK_JAVA_OPTS="-Xmx10G"
./dsbulk load -k <keyspace> -t transactions -b secure-connect-<db_name>.zip -u <username> -p <password> -url "s3://path/to/transactions.csv?region=us-east-1"

Console Output:

Operation LOAD_20230622-155525-014753 failed unexpectedly: Required array length 2147483639 + 96 is too large.

Full Stacktrace:

2023-06-22 15:55:59 ERROR Operation LOAD_20230622-155525-014753 failed unexpectedly: Required array length 2147483639 + 96 is too large.
java.lang.OutOfMemoryError: Required array length 2147483639 + 96 is too large
        at java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
        at java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
        at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100)
        at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:132)
        at software.amazon.awssdk.utils.IoUtils.toByteArray(IoUtils.java:48)
        at software.amazon.awssdk.core.sync.ResponseTransformer.lambda$toBytes$3(ResponseTransformer.java:175)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler$HttpResponseHandlerAdapter.transformResponse(BaseSyncClientHandler.java:218)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler$HttpResponseHandlerAdapter.handle(BaseSyncClientHandler.java:206)
        at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleSuccessResponse(CombinedResponseHandler.java:99)
        at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:75)

FWIW, smaller csv files can be used for the load with no problem (~200mb), but get Java heap space errors on CSVs greater than 1 gb and hence the usage of export DSBULK_JAVA_OPTS="-Xmx10G". Are there other throttling available on here? Tried with a file of size 187GB csv and another with 2.3GB and both ended with the same error.

┆Issue is synchronized with this Jira Task by Unito

The text was updated successfully, but these errors were encountered:

msmygit · 2023-06-22T16:53:19Z

@DavidTaylorCengage do you have any inputs here?

DavidTaylorCengage · 2023-06-22T17:21:03Z

I'm afraid I don't have any specific insights. We were dealing with a URL file that had up to about 10 million records in it; your 2+ billion records (living in just the one transactions.csv file, it seems?) is probably going to run into the limits of what standard Java data structures can support (Integer.MAX_VALUE being 2147483647).

My recommendation would be to break your file up into smaller chunks. That's probably the easiest solution. If you're feeling bold, you could try to revisit my solution to make it more efficient or better utilize streaming.

Also, in case you haven't already found it, #399 was where I implemented the S3 functionality. There may be more answers that can be gleaned from that PR. I don't think I did anything in particular with processing the records themselves; I just dealt with reading the list from S3 and passing it to the normal DSBulk operation. (It may be worth noting that I have no association with DataStax or DSBulk. I'm just a random dev who needed a new feature and decided to implement it himself.)

I hope that is at least a little helpful!

msmygit added the bug label Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading from AWS S3 large file gives "Required array length is too large" error #478

Loading from AWS S3 large file gives "Required array length is too large" error #478

msmygit commented Jun 22, 2023 •

edited by sync-by-unito bot

Loading

msmygit commented Jun 22, 2023

DavidTaylorCengage commented Jun 22, 2023

Loading from AWS S3 large file gives "Required array length is too large" error #478

Loading from AWS S3 large file gives "Required array length is too large" error #478

Comments

msmygit commented Jun 22, 2023 • edited by sync-by-unito bot Loading

msmygit commented Jun 22, 2023

DavidTaylorCengage commented Jun 22, 2023

msmygit commented Jun 22, 2023 •

edited by sync-by-unito bot

Loading