-
Notifications
You must be signed in to change notification settings - Fork 15
Tow/implement new cassandra exception #7353
base: develop
Are you sure you want to change the base?
Conversation
Generate changelog in
|
…b.com/palantir/atlasdb into tow/implement-new-cassandra-exception
I have changed the integration test assert for OneNodeDownDeleteTest .deleteAllTimestampsThrows() and deletingThrows() as our thrown exception has changed, however I have done a search in our clients and we have no clients catching this Exception type, so this is a safe change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdaudali probably has more context on this - but is there a reason we wanted to wrap the rest of the exceptions (other than the TimedOut) in RuntimeExceptions / switch to the unchecked paradigm?
new: "parameter void com.palantir.atlasdb.keyvalue.api.InsufficientConsistencyException::<init>(java.lang.Throwable,\ | ||
\ ===com.palantir.logsafe.Arg<?>[]===)" | ||
justification: "Not a break as I have handle the implementation in the same\ | ||
\ PR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: This would be true for truly internal APIs. However, this exception is actually used in the internal backup and restore product's tests. It's probably not a major issue, but we will want to make sure that we fix their tests and bump their Atlas version accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a sourcegraph check, and I couldn't find anything. How would I see the usages of "internal backup and restore product's tests" if not using sourcegraph?
...sdb-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/WrappingQueryRunner.java
Outdated
Show resolved
Hide resolved
...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTimedOutException.java
Show resolved
Hide resolved
...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTimedOutException.java
Outdated
Show resolved
Hide resolved
From what you have said, I think you are asking why we are throwing an AtlasDbDependency exception (which extends the runtime exception) for all TExceptions which are not either InsufficientConsistencyException or TimedOutException? The decision was that as Cassandra thrift exceptions are an in reality a dependency issue it made sense to have either the explicit error eg InsufficientConsistencyException or TimedOutException be thrown or the more general AtlasDbDependency be thrown, but have them logically connected. The reason for us using an unchecked exception like runtime was I think due to the fact we then wouldn't have to change multiple methods to methodName() throws x exception, and as we had AtlasDbDependency within the code already there seemed to be a precedent for using runtime. However, I suppose there is a design argument to say that Thrift exceptions are operational errors coming from the database due to client usage. I think because the next part of the work would be to adjust alta to catch and proactively react to these errors the code ease overrode the checked versuses unchecked issue? |
To be upfront: I have not reviewed this PR yet - other than giving the guidance on wrapping the TimedOutException and the idea of having the CassandraTExceptions class to do it in.
Not specifically [haven't reviewed it yet] (but also, IIRC, and will check once better - don't we already wrap a bunch of the TExceptions in RuntimeExceptions via Throwables#somethingsomethingwrap?). For TExceptions in particular I thought it useful to handle them all centrally in the CassandraTExceptions class or whatever it's called, to handle wrapping other exceptions without having except handlers for each individual type (but how we handle the unmapped types I haven't checked.) Ah, on reviewing the PR - I realise that the PR is changing a bunch of checked exception places too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reviewing, but this PR is doing a few too many things (and there's a fair bit to unpack here). Let's break it down to make it easier to review, please.
An example split:
- Implement CassandraTimedOutExceptions
- Implement CassandraTExceptions, and add tests.
- Modify the relevant unchecked exception locations.
(Be careful with the above, you don't always need to use CassandraTExceptions if you're providing a specific message for a given exception - e.g., if you're explicitly catching UnavailableException to provide the relevant exception message, you don't need to call CassandraTExceptions) - Consider whether we need to modify any of the checked exception locations (and ping me what your thoughts are for that)
@@ -36,13 +36,13 @@ void testSetup(CassandraKeyValueService kvs) { | |||
|
|||
@Test | |||
public void deletingThrows() { | |||
assertThrowsInsufficientConsistencyExceptionAndDoesNotChangeCassandraSchema( | |||
assertThrowsAtlasDbDependencyExceptionAndDoesNotChangeCassandraSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems odd - why do we need to change this? We should still be throwing InsufficientConsistencyExceptions here.
@@ -137,8 +137,14 @@ private CassandraClient instrumentClient(Client rawClient) { | |||
return client; | |||
} | |||
|
|||
private Cassandra.Client getRawClientWithKeyspaceSet() throws TException { | |||
Client ret = getRawClientWithTimedCreation(); | |||
private Cassandra.Client getRawClientWithKeyspaceSet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be changing the signature here - this should be an invisible refactor, and changing the fact that this no longer throws the checked exception is changing the signature.
Do double check whether you actually want to modify here to remap the exception, or at the caller (and let me know what you believe to be the case)
General
Before this PR:
We were not catching TimedOutExceptions.
After this PR:
==COMMIT_MSG==
New Exception classes:
Amended how UnavailableException TSwift Cassandra native Exceptions are caught, and they now get mapped to InsufficientConsistencyException() via CassandraTExceptions().
==COMMIT_MSG==
Priority:
P2
Concerns / possible downsides (what feedback would you like?):
None
Is documentation needed?:
No
Compatibility
Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?:
No
Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?:
No
The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.):
Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?:
Does this PR need a schema migration?
Testing and Correctness
What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
What was existing testing like? What have you done to improve it?:
If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.:
If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?:
Execution
How would I tell this PR works in production? (Metrics, logs, etc.):
Has the safety of all log arguments been decided correctly?:
Will this change significantly affect our spending on metrics or logs?:
How would I tell that this PR does not work in production? (monitors, etc.):
If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):
Scale
Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
Development Process
Where should we start reviewing?:
If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:
Please tag any other people who should be aware of this PR:
@jeremyk-91
@raiju