-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow naming/aliasing relations #649
feat: allow naming/aliasing relations #649
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about making this something like:
map<string, string> metadata = 6;
Then you could have {"alias": "catalog.table.rel_alias"}
. We could define some canonical keys (e.g. alias
) so that different producers / consumers have interoperability.
I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.
Rather than add a bunch of optional properties would it be better to have one dictionary of optimal metadata that can be attached to relations?
proto/substrait/algebra.proto
Outdated
// Optional name (alias) for this relation, if provided, should be unique across the root. | ||
// Safe to ignore, or can be used for qualifying the relation or debugging. | ||
// Repeated to allow for multiple levels of naming, e.g. catalog/schema/table | ||
repeated string rel_names = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an alias for the relation? Or for the fields in the relation? If it is an alias for the relation should this maybe just be repeated string alias = 5;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alias for the relation. Yeah, naming is hard 😅 I tried to align with the "names" in ReadTable rel, but yes I'd be happy with just "alias".
That'd work for me! I can see the benefit of having custom metadata as well. Just for my usecase I'd really need the "alias" key to be canonical given I need it in both Spark and DataFusion - so I wonder if the best way to define those canonical keys would still be to have some protobuf message, with the canonical keys + the extension point (like the map you suggested)? Or is there some reason not to have it in the message / some better place to have it? I guess technically one could also wrap it all in some SimpleExtension kind of thing, and then provide the canonical keys as extensions in this repo 🤔 |
I think my main motivation here is for users to be able to add new metadata fields for experimentation and trials without requiring a spec change. By canonical keys + extension point do you mean something like...
I'm not opposed to it. However...
What if there were a section in the site that listed the canonical keys. This way the protobuf doesn't need to change when new keys are added: |
Exactly - I modified this PR to show that. I feel like that'd be a good compromise between allowing experimentation while also providing some guarantees of structure. (I'm not strict about the specific names, so if there's better ideas happy to rename!)
I feel this could end up being too fragile - it might be easier to break the list of keys on the website, people might use keys for one reason before the same key might be added to the website for other reasons, etc. I'd prefer this proposal where there's a clear divide between optiona-but-defined-metadata and totally-custom-metadata. It could still work, I guess, but would not be my preferred option. |
proto/substrait/algebra.proto
Outdated
@@ -52,6 +53,22 @@ message RelCommon { | |||
substrait.extensions.AdvancedExtension advanced_extension = 10; | |||
} | |||
} | |||
|
|||
// Optional metadata that the producer may attach to the relation. | |||
// Consumer should hold no expectations about the existance metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about: "The consumer should not rely on the metadata for the evaluation of the plan."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might even want to mention that both the consumer and the producer shouldn't rely on this.
In other words:
- As a consumer, that wants to handle plans from multiple producers, be prepared to handle plans that are missing this metadata
- As a producer, that wants plans to be run by multiple consumers, understand that your consumer might ignore this metadata
proto/substrait/algebra.proto
Outdated
// Any custom metadata that the producer wants to attach to the relation. | ||
// Consumer should hold no expectations about the existence, or lack of, | ||
// or meaning of, of any specific key, unless the producer is known. | ||
map<string, string> custom = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having string values is probably the right choice here as it means you can't stuff arbitrary protobufs here (that's the job on a protobuf extension anyway).
proto/substrait/algebra.proto
Outdated
// Consumer should hold no expectations about the existance metadata. | ||
message Metadata { | ||
// Name (alias) for this relation. | ||
// If provided, must be unique across the relations in the plan/root. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it only needs to be unique within the subplan that consumes it. But this is a safer qualification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to bike shed on names but if it must be unique across all relations then would id
be a better name to use here than alias
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have a slight preference for "alias" rather than "id", as "alias" is the name used in SQL (eg. https://www.w3schools.com/sql/sql_alias.asp) and also "id" sounds like something people might expect to be able to refer to later on in the (substrait) plan, which we probably don't want. But I can be convinced otherwise :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick with alias then.
proto/substrait/algebra.proto
Outdated
// Safe to ignore, or can be used for e.g. qualifying the relation (see e.g. | ||
// Spark's SubqueryAlias), or debugging. | ||
// Repeated to allow for multiple levels of naming, e.g. catalog/schema/table. | ||
repeated string alias = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this separate is the only part left of the PR that could be contentious. By allowing this to qualify the relation this is participating in the evaluation of the plan. Naming a relation isn't going to be enough to fix Datafusion's duplicate field errors -- the equivalent of SELECT x+2, x+2 will still trigger them.
To play devil's advocate somewhat.
With the
To David's point, giving users an extension point to add arbitrary strings into the system feels like it starts with string names and string uuids and ends with string encoded binary messages. I do like the idea of having canonical metadata information, like aliases, but maybe they could be baked into advanced extensions like: message AdvancedExtension {
// An optimization is helpful information that don't influence semantics. May
// be ignored by a consumer.
google.protobuf.Any optimization = 1;
// An enhancement alter semantics. Cannot be ignored by a consumer.
google.protobuf.Any enhancement = 2;
// metadata provides additional information which must not alter semantics
// metadata can assist with debugging and overall plan readability
// can be ignore by consumers
repeated google.protobuf.Any metadata = 2;
} and then we can provide messages in the core spec like message AliasMetadata {
// Name (alias) for this relation.
// If provided, must be unique across the relations in the plan/root.
// Safe to ignore, or can be used for e.g. qualifying the relation (see e.g.
// Spark's SubqueryAlias), or debugging.
// Repeated to allow for multiple levels of naming, e.g. catalog/schema/table.
repeated string alias = 1;
} Which let's us provide simple metadata but then also allows users (and us) to do things like: message DebuggingMetadata {
// Complex debugging info
...
} This also avoid the issues of breaking key changes because there are no keys involved, just messages. |
I do like @vbarua 's suggestion. I would argue that we should couple it with changes in the site docs (add a "standard advanced extensions" section or "standard metadata" section). Otherwise it is not clear from the protobuf that |
One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types. We probably should have repeated advanced extensions included wherever we currently have one. |
@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap. |
It should support that already? So the |
I was thinking more along the lines of the |
@richtia this is only a problem in some specific and not-that-common cases (mainly virtual tables with same column names and self-joins iirc). Plans not involving those would be supported w/o the additional metadata also in the future. That said, I've been thinking about this and I think I want to propose changes into DataFusion to handle these cases w/o the aliasing. I still think the aliasing would be useful feature for metadata/logging/debugging reasons, but if I manage to make DF more resilient that decouples this discussion from my immediate needs, so we can take all the time here to find a good solution :) |
Re having the metadata be an AdvancedExtension, that sounds reasonable to me - however I'm not very familiar with those, and I had the same concern as @EpsilonPrime:
|
This is a concern for I agree that
Ok, I understand now. Yes, I agree that DataFusion needs to be able to handle self-joins (and other scenarios) without requiring the incoming plan have an alias. |
This PR was discussed for quite a while in the community meeting (maybe we can link the video once its up on YouTube). I'll try and summarize (based on my biased and imperfect memory):
|
My personal opinion is split between several approaches
Regardless of which approach we take I think we should make |
Both Spark and DataFusion allow composite name (Spark iirc has a list, DataFusion has three fields for catalog/schema/name. That said, those make sense when coming from a ReadRel, but dunno if anyone would ever use those within the plan. I'd expect using just a single string would be okay for all usages.
I'd assume so, yes.
Yep, same for DataFusion. I used the repeated string here since it's just more generalizable (a consumer can always combine the list into dotted name, for example), but I'm happy to go with a single string as well - maybe that makes things simpler. |
I'd vote for one of these. I think having it in |
I updated the PR to be just the Also, FWIW, apache/datafusion#11049 should resolve the original problem I had on DataFusion side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This style works for me. We don't want extra fields growing without bounds but I think this property is common enough to be justified.
@jacques-n / @vbarua / @EpsilonPrime can one of you take another look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
Same goal as in #648 - to support what Spark's and DataFusion's SubqueryAlias relation does.
As Substrait mostly only referes to columns by their index, there is no inherent need for table name/qualifiers within Substrait. However, some consumers, e.g. DataFusion, require column names to be either unique or qualified for joins, which is troublesome w/o the possibility to qualify relations.
Also, for debugging failed plans and for roundtrip testing of X -> Substrait -> X conversions, it would be convenient to have proper, human-readable names to refer to.
Closes #571