-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add related source code locations to errors #13664
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @eliaperantoni -- this is very cool. I left some comments, let me know what you think
@@ -131,6 +132,7 @@ pub enum DataFusionError { | |||
/// Errors from either mapping LogicalPlans to/from Substrait plans | |||
/// or serializing/deserializing protobytes to Substrait plans | |||
Substrait(String), | |||
Diagnostic(Diagnostic, Box<DataFusionError>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea -- one way to think about this is that it adds additional structured information to DataFusionError::Context
If we went with the DataFusion::Diagnostic
approach, do you think we would be able to deprecate / remove DataFusionError::Context
in a future release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Yes, I think DataFusion::Diagnostic
can convey a superset of the information that DataFusion::Context
can. Any wrapping such as:
let error = DataFusionError::Context(message, Box::new(error));
can be converted to:
let error = DataFusionError::Diagnostic(
Diagnostic {
entries: vec![
DiagnosticEntry {
span: Span::empty(),
kind: DiagnosticEntryKind::Error,
message,
}
]
},
Box::new(error)
);
And of course, we can provide a DataFusionError::with_simple_diagnostic
function to avoid the boilerplate. At that point, DataFusion::Context
could be removed.
This also enables progressively adding Span
information to what was previously simply a string message.
datafusion/common/src/column.rs
Outdated
pub struct Column { | ||
/// relation/table reference. | ||
pub relation: Option<TableReference>, | ||
/// field/column name. | ||
pub name: String, | ||
#[derivative( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting pattern -- it is different than the way sqlparser did it (which is to effectively wrap the span with a struct that is ignored when comparing eq, hash, etc
Did you consider a similar approach?
pub struct Column {
...
// Wrapped span that does not contribute to PartialEq, Eq, Hash, etc
pub span: DiagnosticOnly<Span>
}
I think this approach is less verbose and might be easier to understand for the casual reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see WithSpan
has a similar approach (but that is for wrapping things with Spans 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed interesting. 😅 I was going to use derivative for this problem in sqlparser initially actually, but I wasn't sure if you would want to add a new dependency. If that is not a concern, I think using derivative for cases like this is pretty nice as it is also clear at the usage site that it is being ignored, unlike AttachedToken
which kind of hides the behavior. If a wrapper type is preferred, I now think actually maybe the struct in sqlparser should have been called DiagnosticOnly as well 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the AttachedToken
approach that we took in sqlparser
could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative
.
What I don't 100% love about it is that the PartialEq
implementation on AttachedToken
is a bit "impure". In that, it serves the one specific purpose of making it ignored when used in a struct field, but prevents you from actually comparing two instances of it because they would always are equal (the implementation returns always true
), which is something you might want to do at some point. Especially because the name AttachedToken
doesn't clearly convey that intent, and does more than just being a passthrough for PartialEq
since its main purpose is tying together a Span
and a Token
.
When looking at a struct that uses it:
#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
pub struct Select {
/// Token for the `SELECT` keyword
pub select_token: AttachedToken,
// ...
}
it's not clear to me that AttachedToken
is being ignored. I only realise that when I go look at the implementation of PartialEq
.
Overall, the derivative
approach to me more clearly lets me keep the sensible implementation of PartialEq
for Span
, while also conveying that "when used as a field in this particular struct, I want it to not influence the struct's comparison". But in other structs, I might want it to influence the comparison.
I personally find this approach more readable and flexible, but I'm open to converting to a wrapper type :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see WithSpan has a similar approach (but that is for wrapping things with Spans 🤔 )
Ah yes good catch! I could definitely have used derivative
for WithSpan
too. And about WithSpan
: do you like the general idea of it?
It's meant to be a "sidecar" to get a Span
all the way to where it's needed to create a meaningful diagnostic, by attaching it to values that are sort-of-related, but wouldn't make a lot of sense to add as a struct field.
For example, the two DataType
s in a binary expression come from the two expressions in the SQL query, so they (in some sense) relate to a portion of the code. But putting span
as a field of DataType
sounds wrong.
Maybe a better name could be SpanRelated
or something like that?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective some design goals should be:
- Make it easy for someone reading the DataFusion source code to understand what is going on
- If people are not interested in the span / diagnostic information they can ignore it
From my perspective, using new crates does reduce the amount of code in the DataFusion crate, but can often increase cognitive load (as now to understand the DataFusion code you need to understand the crates)
I think the AttachedToken approach that we took in sqlparser could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative.
Yeah, that is why I think AttachedToken might be better in this case
datafusion/common/src/with_span.rs
Outdated
|
||
use sqlparser::tokenizer::Span; | ||
|
||
/// A wrapper type for entitites that somehow refer to a location in the source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider a trait like was done in sqlparser that retrieves spans? For example
pub trait Spanned {
/// return the span for this object, if known
fn span(&self) -> Option<&Span>
}
And then we would implement that trait for various DataFusion / Arrow tyoes (like DataType
, etc)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would likely require implementing specific wrapper types like DataTypeWithSpan
or something to attach Span information to DataTypes
I also see the need to somehow plumb through the span information into things like coerce-types 🤔
Though maybe now would be a good time to make a proper structure for coercing types, (struct TypeCoercer
or something) where we could attach the spans 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider a trait like was done in sqlparser that retrieves spans?
Yes, but to implement one such trait I'd have to put the data in DataType
. And I think that's a bit pushing the limits of what DataType
is meant to do. I quite like the idea of the WithSpan
(can rename) functioning like a sidecar, to limit big changes in the code.
Maybe we could have a Spanned
trait in datafusion as well, that is automatically implemented for all WithSpan<T>
, and also by types that natively carry a Span
?
Though maybe now would be a good time to make a proper structure for coercing types, (struct TypeCoercer or something) where we could attach the spans 🤔
Not too familiar about the history of that component. But I think that's just one example, and there's probably many more cases where changes like that would be needed to accommodate the presence of Span
s.
@alamb here's a few examples of what the new diagnostics that I implemented so far can do: What are you thoughts so far? Note that the rendering here is not part of my PR, but it consumes the We think that this could be quite nicely integrated with Perhaps this kind of quality of life features can also be related to #13525, as it would make it easier for consumers of Datafusion to provide a nicer experience to their end users. |
Sorry for my radio silence here -- I am reviewing this now. Thank you for your patience |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again for the delay
I spent some more time reviewing this PR, but did not get through it entirely
I think one of the largest design decisions is "where to store the spans"?
There is a tension between trying to keep full source provenance information and overburdening the existing code. I am musing here, but I am hoping to find some balance between
- Getting something useful in to DataFusion that we can incrementally update
- Ensuring we have a good long term roadmap.
The basic idea of attaching the spans to existing LogicalPlan
and Expr
s (which is what this PR seems to do) makes sense to me in theory, but in practice it feels like it may be hugely disruptive (many changes to existing structs) and the information propagation will likely be to hard to maintain and test -- all code that transforms/rewrites would have to propagate the spans
I also added a span: Span to Column. This is the first step of a (probably) long process of enriching the types used during logical planning with the Spans from the parsed AST nodes.
What do you think about initially focusing on locations from the errors out of the sql planner (SqlToRel
) and postpone adding Spans to the plan nodes?
This would allow us to set up a good pattern for attaching Diagnostics to errors and reportin them. Once that infrastructure is solidified we can start figuring out a pattern to plumb the source information into other areas (like type coercions errors)?
datafusion/common/src/column.rs
Outdated
pub struct Column { | ||
/// relation/table reference. | ||
pub relation: Option<TableReference>, | ||
/// field/column name. | ||
pub name: String, | ||
#[derivative( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective some design goals should be:
- Make it easy for someone reading the DataFusion source code to understand what is going on
- If people are not interested in the span / diagnostic information they can ignore it
From my perspective, using new crates does reduce the amount of code in the DataFusion crate, but can often increase cognitive load (as now to understand the DataFusion code you need to understand the crates)
I think the AttachedToken approach that we took in sqlparser could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative.
Yeah, that is why I think AttachedToken might be better in this case
datafusion/common/src/column.rs
Outdated
/// Attaches a [`Span`] to the [`Column`], i.e. its location in the source | ||
/// SQL query. | ||
pub fn with_span(mut self, span: Span) -> Self { | ||
self.spans = vec![span]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this perhaps append a span to the Column rather than overwrite it?
@@ -113,6 +117,10 @@ pub struct DFSchema { | |||
field_qualifiers: Vec<Option<TableReference>>, | |||
/// Stores functional dependencies in the schema. | |||
functional_dependencies: FunctionalDependencies, | |||
/// The location in the source code where the fields are defined (e.g. in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems somewhat strange to me as the schema may not be defined in the query
Maybe we can avoid adding this as part of V1
0506a80
to
91f4e3c
Compare
@alamb I'm very sorry for the delay. I think your points about not wanting to change the logical types are valid, and I confess it was a bit of a pain to make all the changes to make my previous iteration compile and route the spans through. I tried again, this time using your suggestion of:
I think this was a lot easier and pollutes the source code a lot less. So far I've implemented:
I think the "column not found" is the most interesting, since when I get an unresolved i.e. since I can't store data in the logical node, when I get a troublesome one I have to transform the AST tree again to figure out which AST node produces the problematic logical node. That works okay. Performance might not be great but it only happens in case of errors, so I think it might be okay. I'd like to hear your opinion. But then I started implementing the "column missing from GROUP BY clause" error and that was a bit more difficult (or unreadable) because there's many more layers of functions in between that which has access the AST tree, and the one that checks a single logical expr in the SELECT part of the query and throws the error. I could of course pass down the AST tree, but need I need a |
After the PoC that I did in https://github.com/eliaperantoni/datafusion/tree/eper/inline-diagnostics, in this PR I'm attempting to build a more robust solution, with the goal of eventually merging it.
Still, so far I've only added diagnostics to the
Cannot coerce arithmetic expression .. to valid types
error to get feedback and approval on the general implementation. That way I can make tweaks quickly, before I go extend this to the rest of the repo.Which issue does this PR close?
Closes #13662.
What changes are included in this PR?
I introduce the
Diagnostic
,DiagnosticEntry
, andDiagnosticEntryKind
types to enrich errors with better user-facing information, mainly by referring to the portions of the SQL code that caused it.A
DiagnosticEntry
is used to refer to one particular portion of SQL code and describe its involvement in the error with a string message and akind
. ADiagnostic
is simply a collection ofDiagnosticEntry
.For example, in
SELECT a_string + an_integer
we'd have:kind=Error
message=Incompatible types in binary expression
Wrapping the wholea_string + an_integer
expression.kind=Note
message=Is of type Utf8
Wrapping just the left side.kind=Note
message=Is of type Int64
Wrapping just the right side.A new
DataFusionError::Diagnostic
variant is used to attachDiagnostic
to errors. You can callwith_diagnostic
on anyDataFusionError
to do so. They can later be retrieved by callingDataFusionError::get_diagnostics
.It is possible to attach multiple
Diagnostic
while an error is being returned along the function call stack. The rationale here is that different functions might have different levels of detail, but they might all have useful diagnostics to add. e.g. the function that computes the output type of an expression has no idea about why it's being called, but calling functions have no idea about the intricacies of the error. They all know different things.I also added a
span: Span
toColumn
. This is the first step of a (probably) long process of enriching the types used during logical planning with theSpan
s from the parsed AST nodes.Added
WithSpan<T>
. The idea is that we want to getSpan
information down to functions called deep in the call stack, but without breaking existing code too much so that we can ensure a smooth and incremental transition toDiagnostic
-ed errors.WithSpan<T>
was my attempt to do so because anyT
implementsInto<WithSpan<T>>
by simply attachingSpan::empty
(but of course, if aSpan
is available it should be used instead). This means that any time a function wants to start spawningDiagnostic
-ed errors, it can change some of its arguments by wrapping their types inWithSpan<T>
and existing calls will keep on working.WithSpan<T>
should be used when the argument type is something that can loosely be traced back to a location in the source code, but is not really part of any AST node, or anything that is isomorphic to it. A good example isDataType
which is taken bysignature(DataType, Operator, DataType) -> DataType
to get the output type of a binary expression. That is the function that spawns the error, and the caller doesn't have enough information to provide a goodDiagnostic
. But at the same time, it would be (in my opinion) cumbersome to addSpan
arguments to the function. So instead, the function can wrap theDataType
s for the two sides of the expression inWithSpan<T>
. Some call points can be updated to pass aSpan
-enrichedWithSpan<DataType>
, and all the others will keep on working without changes required.Are these changes tested?
Not yet, while this a draft to gather feedback.
Are there any user-facing changes?
Yes. The user can call
DataFusionError::get_diagnostics
to get an iterator over all theDiagnostic
that have been attached to the error chain. Those contain source code locations that relate to the error, and each location comes with a message.