-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Super-fy the Super Columnar format doc #5399
Conversation
afa80b7
to
1f5311c
Compare
Super Columnar is a file format based on | ||
the [super data model](zed.md) where data is stacked to form columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the opening lines a fair amount. A couple things I'll point out:
-
When I sat down to start working on this file, there were only three uses of the word "vector" left, and I think I recall hearing in the past that we wanted to phase that word out entirely if possible, so I've taken the liberty of trying to finish that off here.
-
Previously the text indicated that the columns were specifically for holding "a stream of records", but Super Columnar can handle streams of primitive values and other top-level non-record types, hence my deletion of that text.
But if there's a feeling I've gone too far here, please speak up!
|
||
There is one column stream for each top-level type encountered in the input where | ||
each column stream is encoded according to its type. For top-level complex types, | ||
the embedded elements are encoded recursively in additional column streams | ||
as described below. For example, | ||
a record column encodes a "presence" vector encoding any null value for | ||
a [record column](#record-column) encodes a [presence column](#presence-columns) encoding any null value for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another spot I'm de-"vector"-ing. Since the related section was already called "presence columns" this seemed like a safe change.
each field then encodes each non-null field recursively, whereas | ||
an array column encodes a "lengths" vector and encodes each | ||
an [array column](#array-column) encodes a sequence of "lengths" and encodes each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third de-"vector"-ing. I did pause here because my normal instinct would be go from "lengths vector" to "lengths column", but the section on array column was already written to describe "lengths" as a "sequence of values", so I went that way. No doubt we could go in either direction, but I wanted to make sure it was consistent across both places.
|
||
These N super types are defined by the first N values of the reassembly stream | ||
and are encoded as a null value of the indicated super type. | ||
A super type's integer position in this sequence defines its identifier | ||
encoded in the super column (defined below). This identifier is called | ||
encoded in the [super column](#the-super-column). This identifier is called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured I'd let the hyperlink do the "defined below" talking.
run-length encoding of the locations of column values in their respective rows, | ||
when there are null values (as described below). | ||
when there are null values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Letting the hyperlink a couple lines above do the "as described below" talking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding my comments I realized you're just doing vng
@@ -118,7 +118,7 @@ replication easy to support and deploy. | |||
The cloud objects that comprise a lake, e.g., data objects, | |||
commit history, transaction journals, partial aggregations, etc., | |||
are stored as Zed data, i.e., either as [row-based Super Binary](../formats/bsup.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zed?
@@ -163,7 +163,7 @@ The human-readable format of Zed is called [ZSON](../formats/jsup.md) | |||
|
|||
ZSON is nice because it has a comprehensive type system and you can | |||
go from ZSON to an efficient binary row format ([Super Binary](../formats/bsup.md)) | |||
and columnar ([VNG](../formats/vng.md)) --- and vice versa --- | |||
and columnar ([Super Columnar](../formats/csup.md)) --- and vice versa --- | |||
with complete fidelity and no loss of information. In this tour, | |||
we'll stick to ZSON (though for large data sets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZSON?
In the same spirit as #5368.
I'm guessing this document will be subject to significant overhaul in the near future regardless, but since it's linked to from the top-level README I figured a first pass of de-Zedding was justified ASAP.