You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No matter what write throughput we configure (dynamodb.throughput.write) or what percentage (dynamodb.throughput.write.percent) our hive job completely blows past any configured limits. Ultimately the throughput ends up being entirely dependent on the number of mappers (At least beyond a certain number of mappers)
Example:
Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.RequestLimitExceededException: Throughput exceeds the current throughput limit for your account. Please contact AWS Support at [https://aws.amazon.com/support ](https://aws.amazon.com/support) request a limit increase (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: RequestLimitExceeded; Request ID: <>; Proxy: null)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:106)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:81)
at org.apache.hadoop.dynamodb.DynamoDBClient.writeBatch(DynamoDBClient.java:263)
at org.apache.hadoop.dynamodb.DynamoDBClient.putBatch(DynamoDBClient.java:220)
The connector apparently only retries for ProvisionedThroughputExceededException, ThrottlingException, 500s and 503s. So I don’t think it gets the normal connector retries. It’s only just getting the usual 4 hive task retries and then failing.
I’m arguing here that ideally we shouldn’t have to care about exceeding max throughput. Ideally, the connector should be handling throughput till the max we specify, and throttle as needed to remain within table and account throughput bounds.
The other thing is, throughput ends up being determined by the total number of input splits/mappers. Is it reasonable to change the library to just not write but wait in certain map tasks when the configured throughput/percentage is hit? Like right now looks like it has a minimum to 1 item "per second" though AFAICS it's basically 1 per put API call and not 1 per second per task.
As we can see from here, if the auto scaling is enabled, configured write throughput will be ignored. The user can disable auto scaling via config dynamodb.throughput.write.autoscaling=false. However, I don't think this config can really help the use case you mentioned.
Also, as we can see from here, RequestLimitExceededException is not added into throttleErrorCodes. We can at least simply fix this issue by adding a new entry RequestLimitExceededException to retry when the application gets throttled.
No matter what write throughput we configure (dynamodb.throughput.write) or what percentage (dynamodb.throughput.write.percent) our hive job completely blows past any configured limits. Ultimately the throughput ends up being entirely dependent on the number of mappers (At least beyond a certain number of mappers)
Example:
It does not look like that particular exception is retried, it’s re-thrown from the else block in the handleException() method. I don’t see any max retry limit for the exceptions it /does/ handle: https://github.com/awslabs/emr-dynamodb-connector/blob/e1c937ff5d55711c91b575d33fa0281bc5fb4323/emr-dynamodb-hadoop/src/main/java/org/apache/hadoop/dynamodb/DynamoDBFibonacciRetryer.java#L106
The connector apparently only retries for ProvisionedThroughputExceededException, ThrottlingException, 500s and 503s. So I don’t think it gets the normal connector retries. It’s only just getting the usual 4 hive task retries and then failing.
I’m arguing here that ideally we shouldn’t have to care about exceeding max throughput. Ideally, the connector should be handling throughput till the max we specify, and throttle as needed to remain within table and account throughput bounds.
The other thing is, throughput ends up being determined by the total number of input splits/mappers. Is it reasonable to change the library to just not write but wait in certain map tasks when the configured throughput/percentage is hit?
Like right now looks like it has a minimum to 1 item "per second" though AFAICS it's basically 1 per put API call and not 1 per second per task.emr-dynamodb-connector/emr-dynamodb-hadoop/src/main/java/org/apache/hadoop/dynamodb/IopsController.java
Line 58 in 40f605a
The text was updated successfully, but these errors were encountered: