Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add related source code locations to errors #13664

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

eliaperantoni
Copy link

After the PoC that I did in https://github.com/eliaperantoni/datafusion/tree/eper/inline-diagnostics, in this PR I'm attempting to build a more robust solution, with the goal of eventually merging it.

Still, so far I've only added diagnostics to the Cannot coerce arithmetic expression .. to valid types error to get feedback and approval on the general implementation. That way I can make tweaks quickly, before I go extend this to the rest of the repo.

Which issue does this PR close?

Closes #13662.

What changes are included in this PR?

I introduce the Diagnostic, DiagnosticEntry, and DiagnosticEntryKind types to enrich errors with better user-facing information, mainly by referring to the portions of the SQL code that caused it.

A DiagnosticEntry is used to refer to one particular portion of SQL code and describe its involvement in the error with a string message and a kind. A Diagnostic is simply a collection of DiagnosticEntry.

For example, in SELECT a_string + an_integer we'd have:

  • Entry kind=Error message=Incompatible types in binary expression Wrapping the whole a_string + an_integer expression.
  • Entry kind=Note message=Is of type Utf8 Wrapping just the left side.
  • Entry kind=Note message=Is of type Int64 Wrapping just the right side.

A new DataFusionError::Diagnostic variant is used to attach Diagnostic to errors. You can call with_diagnostic on any DataFusionError to do so. They can later be retrieved by calling DataFusionError::get_diagnostics.

It is possible to attach multiple Diagnostic while an error is being returned along the function call stack. The rationale here is that different functions might have different levels of detail, but they might all have useful diagnostics to add. e.g. the function that computes the output type of an expression has no idea about why it's being called, but calling functions have no idea about the intricacies of the error. They all know different things.

I also added a span: Span to Column. This is the first step of a (probably) long process of enriching the types used during logical planning with the Spans from the parsed AST nodes.

Added WithSpan<T>. The idea is that we want to get Span information down to functions called deep in the call stack, but without breaking existing code too much so that we can ensure a smooth and incremental transition to Diagnostic-ed errors. WithSpan<T> was my attempt to do so because any T implements Into<WithSpan<T>> by simply attaching Span::empty (but of course, if a Span is available it should be used instead). This means that any time a function wants to start spawning Diagnostic-ed errors, it can change some of its arguments by wrapping their types in WithSpan<T> and existing calls will keep on working.

WithSpan<T> should be used when the argument type is something that can loosely be traced back to a location in the source code, but is not really part of any AST node, or anything that is isomorphic to it. A good example is DataType which is taken by signature(DataType, Operator, DataType) -> DataType to get the output type of a binary expression. That is the function that spawns the error, and the caller doesn't have enough information to provide a good Diagnostic. But at the same time, it would be (in my opinion) cumbersome to add Span arguments to the function. So instead, the function can wrap the DataTypes for the two sides of the expression in WithSpan<T>. Some call points can be updated to pass a Span-enriched WithSpan<DataType>, and all the others will keep on working without changes required.

Are these changes tested?

Not yet, while this a draft to gather feedback.

Are there any user-facing changes?

Yes. The user can call DataFusionError::get_diagnostics to get an iterator over all the Diagnostic that have been attached to the error chain. Those contain source code locations that relate to the error, and each location comes with a message.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Dec 5, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eliaperantoni -- this is very cool. I left some comments, let me know what you think

@@ -131,6 +132,7 @@ pub enum DataFusionError {
/// Errors from either mapping LogicalPlans to/from Substrait plans
/// or serializing/deserializing protobytes to Substrait plans
Substrait(String),
Diagnostic(Diagnostic, Box<DataFusionError>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea -- one way to think about this is that it adds additional structured information to DataFusionError::Context

If we went with the DataFusion::Diagnostic approach, do you think we would be able to deprecate / remove DataFusionError::Context in a future release?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Yes, I think DataFusion::Diagnostic can convey a superset of the information that DataFusion::Context can. Any wrapping such as:

let error = DataFusionError::Context(message, Box::new(error));

can be converted to:

let error = DataFusionError::Diagnostic(
	Diagnostic {
		entries: vec![
			DiagnosticEntry {
				span: Span::empty(),
				kind: DiagnosticEntryKind::Error,
				message,
			}
		]
	},
	Box::new(error)
);

And of course, we can provide a DataFusionError::with_simple_diagnostic function to avoid the boilerplate. At that point, DataFusion::Context could be removed.

This also enables progressively adding Span information to what was previously simply a string message.

pub struct Column {
/// relation/table reference.
pub relation: Option<TableReference>,
/// field/column name.
pub name: String,
#[derivative(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting pattern -- it is different than the way sqlparser did it (which is to effectively wrap the span with a struct that is ignored when comparing eq, hash, etc

Did you consider a similar approach?

pub struct Column {
...
  // Wrapped span that does not contribute to PartialEq, Eq, Hash, etc
  pub span: DiagnosticOnly<Span>
}

I think this approach is less verbose and might be easier to understand for the casual reader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see WithSpan has a similar approach (but that is for wrapping things with Spans 🤔 )

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed interesting. 😅 I was going to use derivative for this problem in sqlparser initially actually, but I wasn't sure if you would want to add a new dependency. If that is not a concern, I think using derivative for cases like this is pretty nice as it is also clear at the usage site that it is being ignored, unlike AttachedToken which kind of hides the behavior. If a wrapper type is preferred, I now think actually maybe the struct in sqlparser should have been called DiagnosticOnly as well 😂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the AttachedToken approach that we took in sqlparser could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative.

What I don't 100% love about it is that the PartialEq implementation on AttachedToken is a bit "impure". In that, it serves the one specific purpose of making it ignored when used in a struct field, but prevents you from actually comparing two instances of it because they would always are equal (the implementation returns always true), which is something you might want to do at some point. Especially because the name AttachedToken doesn't clearly convey that intent, and does more than just being a passthrough for PartialEq since its main purpose is tying together a Span and a Token.

When looking at a struct that uses it:

#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
pub struct Select {
    /// Token for the `SELECT` keyword
    pub select_token: AttachedToken,
	// ...
}

it's not clear to me that AttachedToken is being ignored. I only realise that when I go look at the implementation of PartialEq.

Overall, the derivative approach to me more clearly lets me keep the sensible implementation of PartialEq for Span, while also conveying that "when used as a field in this particular struct, I want it to not influence the struct's comparison". But in other structs, I might want it to influence the comparison.

I personally find this approach more readable and flexible, but I'm open to converting to a wrapper type :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see WithSpan has a similar approach (but that is for wrapping things with Spans 🤔 )

Ah yes good catch! I could definitely have used derivative for WithSpan too. And about WithSpan: do you like the general idea of it?

It's meant to be a "sidecar" to get a Span all the way to where it's needed to create a meaningful diagnostic, by attaching it to values that are sort-of-related, but wouldn't make a lot of sense to add as a struct field.

For example, the two DataTypes in a binary expression come from the two expressions in the SQL query, so they (in some sense) relate to a portion of the code. But putting span as a field of DataType sounds wrong.

Maybe a better name could be SpanRelated or something like that?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective some design goals should be:

  1. Make it easy for someone reading the DataFusion source code to understand what is going on
  2. If people are not interested in the span / diagnostic information they can ignore it

From my perspective, using new crates does reduce the amount of code in the DataFusion crate, but can often increase cognitive load (as now to understand the DataFusion code you need to understand the crates)

I think the AttachedToken approach that we took in sqlparser could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative.

Yeah, that is why I think AttachedToken might be better in this case


use sqlparser::tokenizer::Span;

/// A wrapper type for entitites that somehow refer to a location in the source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider a trait like was done in sqlparser that retrieves spans? For example

pub trait Spanned {
  /// return the span for this object, if known
  fn span(&self) -> Option<&Span>
}

And then we would implement that trait for various DataFusion / Arrow tyoes (like DataType, etc)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would likely require implementing specific wrapper types like DataTypeWithSpan or something to attach Span information to DataTypes

I also see the need to somehow plumb through the span information into things like coerce-types 🤔

Though maybe now would be a good time to make a proper structure for coercing types, (struct TypeCoercer or something) where we could attach the spans 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider a trait like was done in sqlparser that retrieves spans?

Yes, but to implement one such trait I'd have to put the data in DataType. And I think that's a bit pushing the limits of what DataType is meant to do. I quite like the idea of the WithSpan (can rename) functioning like a sidecar, to limit big changes in the code.

Maybe we could have a Spanned trait in datafusion as well, that is automatically implemented for all WithSpan<T>, and also by types that natively carry a Span?

Though maybe now would be a good time to make a proper structure for coercing types, (struct TypeCoercer or something) where we could attach the spans 🤔

Not too familiar about the history of that component. But I think that's just one example, and there's probably many more cases where changes like that would be needed to accommodate the presence of Spans.

@eliaperantoni
Copy link
Author

eliaperantoni commented Dec 9, 2024

@alamb here's a few examples of what the new diagnostics that I implemented so far can do:

image

image

What are you thoughts so far?

Note that the rendering here is not part of my PR, but it consumes the Diagnostic data that is now produced by datafusion.

We think that this could be quite nicely integrated with datafusion-dft and datafusion-clito provide a richer experience for end users.

Perhaps this kind of quality of life features can also be related to #13525, as it would make it easier for consumers of Datafusion to provide a nicer experience to their end users.

@github-actions github-actions bot added the proto Related to proto crate label Dec 11, 2024
@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

Sorry for my radio silence here -- I am reviewing this now. Thank you for your patience

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again for the delay

I spent some more time reviewing this PR, but did not get through it entirely

I think one of the largest design decisions is "where to store the spans"?

There is a tension between trying to keep full source provenance information and overburdening the existing code. I am musing here, but I am hoping to find some balance between

  1. Getting something useful in to DataFusion that we can incrementally update
  2. Ensuring we have a good long term roadmap.

The basic idea of attaching the spans to existing LogicalPlan and Exprs (which is what this PR seems to do) makes sense to me in theory, but in practice it feels like it may be hugely disruptive (many changes to existing structs) and the information propagation will likely be to hard to maintain and test -- all code that transforms/rewrites would have to propagate the spans

I also added a span: Span to Column. This is the first step of a (probably) long process of enriching the types used during logical planning with the Spans from the parsed AST nodes.

What do you think about initially focusing on locations from the errors out of the sql planner (SqlToRel) and postpone adding Spans to the plan nodes?

This would allow us to set up a good pattern for attaching Diagnostics to errors and reportin them. Once that infrastructure is solidified we can start figuring out a pattern to plumb the source information into other areas (like type coercions errors)?

pub struct Column {
/// relation/table reference.
pub relation: Option<TableReference>,
/// field/column name.
pub name: String,
#[derivative(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective some design goals should be:

  1. Make it easy for someone reading the DataFusion source code to understand what is going on
  2. If people are not interested in the span / diagnostic information they can ignore it

From my perspective, using new crates does reduce the amount of code in the DataFusion crate, but can often increase cognitive load (as now to understand the DataFusion code you need to understand the crates)

I think the AttachedToken approach that we took in sqlparser could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative.

Yeah, that is why I think AttachedToken might be better in this case

/// Attaches a [`Span`] to the [`Column`], i.e. its location in the source
/// SQL query.
pub fn with_span(mut self, span: Span) -> Self {
self.spans = vec![span];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this perhaps append a span to the Column rather than overwrite it?

@@ -113,6 +117,10 @@ pub struct DFSchema {
field_qualifiers: Vec<Option<TableReference>>,
/// Stores functional dependencies in the schema.
functional_dependencies: FunctionalDependencies,
/// The location in the source code where the fields are defined (e.g. in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems somewhat strange to me as the schema may not be defined in the query

Maybe we can avoid adding this as part of V1

@github-actions github-actions bot removed logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate labels Jan 17, 2025
@eliaperantoni
Copy link
Author

@alamb I'm very sorry for the delay. I think your points about not wanting to change the logical types are valid, and I confess it was a bit of a pain to make all the changes to make my previous iteration compile and route the spans through. I tried again, this time using your suggestion of:

What do you think about initially focusing on locations from the errors out of the sql planner (SqlToRel) and postpone adding Spans to the plan nodes?

I think this was a lot easier and pollutes the source code a lot less. So far I've implemented:

  • Table not found
  • Unqualified column not found
  • Qualified column not found
  • Mismatch in number of columns in set operation

I think the "column not found" is the most interesting, since when I get an unresolved datafusion::Column, I now have to walk down a tree of sqlparser::Expr and fine one that produces a matching datafusion::Column.

i.e. since I can't store data in the logical node, when I get a troublesome one I have to transform the AST tree again to figure out which AST node produces the problematic logical node.

That works okay. Performance might not be great but it only happens in case of errors, so I think it might be okay.

I'd like to hear your opinion.

But then I started implementing the "column missing from GROUP BY clause" error and that was a bit more difficult (or unreadable) because there's many more layers of functions in between that which has access the AST tree, and the one that checks a single logical expr in the SELECT part of the query and throws the error. I could of course pass down the AST tree, but need I need a DFSchema and a PlannerContext to normalise the identifiers and correctly go from AST to logical nodes. I find that a bit cumbersome and error prone. I feel like it would be so much easier if we could agree on a way to store Span in the logical Column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add related source code locations to errors
3 participants