Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

Tow/implement new cassandra exception #7353

Open
wants to merge 23 commits into
base: develop
Choose a base branch
from

Conversation

tillyow
Copy link
Contributor

@tillyow tillyow commented Oct 15, 2024

General

Before this PR:
We were not catching TimedOutExceptions.

After this PR:

==COMMIT_MSG==
New Exception classes:

  • CassandraTExceptions()
  • CassandraTimedOutException()

Amended how UnavailableException TSwift Cassandra native Exceptions are caught, and they now get mapped to InsufficientConsistencyException() via CassandraTExceptions().

==COMMIT_MSG==

Priority:
P2
Concerns / possible downsides (what feedback would you like?):
None
Is documentation needed?:
No

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?:
No
Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?:
No
The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.):

Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?:

Does this PR need a schema migration?

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:

What was existing testing like? What have you done to improve it?:

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.:

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?:

Execution

How would I tell this PR works in production? (Metrics, logs, etc.):

Has the safety of all log arguments been decided correctly?:

Will this change significantly affect our spending on metrics or logs?:

How would I tell that this PR does not work in production? (monitors, etc.):

If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:

If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:

Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:

Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:

Development Process

Where should we start reviewing?:

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:

Please tag any other people who should be aware of this PR:
@jeremyk-91
@raiju

@changelog-app
Copy link

changelog-app bot commented Oct 15, 2024

Generate changelog in changelog/@unreleased

What do the change types mean?
  • feature: A new feature of the service.
  • improvement: An incremental improvement in the functionality or operation of the service.
  • fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
  • break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
    and external consumers of the service's API (e.g. customer-written software or integrations).
  • deprecation: Advertises the intention to remove service functionality without any change to the
    operation of the service itself.
  • manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
    performing database surgery, ...) at the time of upgrade for it to succeed.
  • migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?
  • ❗The break and manual task changelog types will result in a major release!
  • 🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
  • ✨ All others will result in a minor version release.

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

New Exception classes:

  • CassandraTExceptions()
  • CassandraTimedOutException()

Amended how UnavailableException TSwift Cassandra native Exceptions are caught, and they now get mapped to InsufficientConsistencyException() via CassandraTExceptions().

Check the box to generate changelog(s)

  • Generate changelog entry

@tillyow tillyow changed the base branch from develop to tow/atlasdb-dependency-log-safe October 16, 2024 10:43
@tillyow tillyow changed the base branch from tow/atlasdb-dependency-log-safe to develop October 16, 2024 10:44
@tillyow
Copy link
Contributor Author

tillyow commented Oct 23, 2024

I have changed the integration test assert for OneNodeDownDeleteTest .deleteAllTimestampsThrows() and deletingThrows() as our thrown exception has changed, however I have done a search in our clients and we have no clients catching this Exception type, so this is a safe change.

Copy link
Contributor

@jeremyk-91 jeremyk-91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdaudali probably has more context on this - but is there a reason we wanted to wrap the rest of the exceptions (other than the TimedOut) in RuntimeExceptions / switch to the unchecked paradigm?

new: "parameter void com.palantir.atlasdb.keyvalue.api.InsufficientConsistencyException::<init>(java.lang.Throwable,\
\ ===com.palantir.logsafe.Arg<?>[]===)"
justification: "Not a break as I have handle the implementation in the same\
\ PR"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: This would be true for truly internal APIs. However, this exception is actually used in the internal backup and restore product's tests. It's probably not a major issue, but we will want to make sure that we fix their tests and bump their Atlas version accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a sourcegraph check, and I couldn't find anything. How would I see the usages of "internal backup and restore product's tests" if not using sourcegraph?

changelog/@unreleased/pr-7353.v2.yml Outdated Show resolved Hide resolved
@tillyow
Copy link
Contributor Author

tillyow commented Oct 25, 2024

@mdaudali probably has more context on this - but is there a reason we wanted to wrap the rest of the exceptions (other than the TimedOut) in RuntimeExceptions / switch to the unchecked paradigm?

From what you have said, I think you are asking why we are throwing an AtlasDbDependency exception (which extends the runtime exception) for all TExceptions which are not either InsufficientConsistencyException or TimedOutException? The decision was that as Cassandra thrift exceptions are an in reality a dependency issue it made sense to have either the explicit error eg InsufficientConsistencyException or TimedOutException be thrown or the more general AtlasDbDependency be thrown, but have them logically connected.

The reason for us using an unchecked exception like runtime was I think due to the fact we then wouldn't have to change multiple methods to methodName() throws x exception, and as we had AtlasDbDependency within the code already there seemed to be a precedent for using runtime. However, I suppose there is a design argument to say that Thrift exceptions are operational errors coming from the database due to client usage. I think because the next part of the work would be to adjust alta to catch and proactively react to these errors the code ease overrode the checked versuses unchecked issue?

@tillyow tillyow requested a review from jeremyk-91 October 25, 2024 10:59
@mdaudali
Copy link
Contributor

mdaudali commented Oct 30, 2024

@mdaudali probably has more context on this - but is there a reason we wanted to wrap the rest of the exceptions (other than the TimedOut) in RuntimeExceptions / switch to the unchecked paradigm?

To be upfront: I have not reviewed this PR yet - other than giving the guidance on wrapping the TimedOutException and the idea of having the CassandraTExceptions class to do it in.

but is there a reason we wanted to wrap the rest of the exceptions (other than the TimedOut) in RuntimeExceptions / switch to the unchecked paradigm?

Not specifically [haven't reviewed it yet] (but also, IIRC, and will check once better - don't we already wrap a bunch of the TExceptions in RuntimeExceptions via Throwables#somethingsomethingwrap?). For TExceptions in particular I thought it useful to handle them all centrally in the CassandraTExceptions class or whatever it's called, to handle wrapping other exceptions without having except handlers for each individual type (but how we handle the unmapped types I haven't checked.)

Ah, on reviewing the PR - I realise that the PR is changing a bunch of checked exception places too

@tillyow
Copy link
Contributor Author

tillyow commented Oct 30, 2024

@mdaudali this PR is just a cleaner version of this one: #7285. Where we discussed the wrapping and copying of Throwables.wrapsomething(). I created a new PR, to seperate out the AtlasDBDEpendency stuff from the CassandraTException stuff.

Copy link
Contributor

@mdaudali mdaudali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing, but this PR is doing a few too many things (and there's a fair bit to unpack here). Let's break it down to make it easier to review, please.

An example split:

  1. Implement CassandraTimedOutExceptions
  2. Implement CassandraTExceptions, and add tests.
  3. Modify the relevant unchecked exception locations.
    (Be careful with the above, you don't always need to use CassandraTExceptions if you're providing a specific message for a given exception - e.g., if you're explicitly catching UnavailableException to provide the relevant exception message, you don't need to call CassandraTExceptions)
  4. Consider whether we need to modify any of the checked exception locations (and ping me what your thoughts are for that)

@@ -36,13 +36,13 @@ void testSetup(CassandraKeyValueService kvs) {

@Test
public void deletingThrows() {
assertThrowsInsufficientConsistencyExceptionAndDoesNotChangeCassandraSchema(
assertThrowsAtlasDbDependencyExceptionAndDoesNotChangeCassandraSchema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd - why do we need to change this? We should still be throwing InsufficientConsistencyExceptions here.

@@ -137,8 +137,14 @@ private CassandraClient instrumentClient(Client rawClient) {
return client;
}

private Cassandra.Client getRawClientWithKeyspaceSet() throws TException {
Client ret = getRawClientWithTimedCreation();
private Cassandra.Client getRawClientWithKeyspaceSet() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be changing the signature here - this should be an invisible refactor, and changing the fact that this no longer throws the checked exception is changing the signature.

Do double check whether you actually want to modify here to remap the exception, or at the caller (and let me know what you believe to be the case)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants