Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super-fy the Super Columnar format doc #5399

Merged
merged 1 commit into from
Oct 31, 2024
Merged

Super-fy the Super Columnar format doc #5399

merged 1 commit into from
Oct 31, 2024

Conversation

philrz
Copy link
Contributor

@philrz philrz commented Oct 31, 2024

In the same spirit as #5368.

I'm guessing this document will be subject to significant overhaul in the near future regardless, but since it's linked to from the top-level README I figured a first pass of de-Zedding was justified ASAP.

Comment on lines +8 to +9
Super Columnar is a file format based on
the [super data model](zed.md) where data is stacked to form columns.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the opening lines a fair amount. A couple things I'll point out:

  1. When I sat down to start working on this file, there were only three uses of the word "vector" left, and I think I recall hearing in the past that we wanted to phase that word out entirely if possible, so I've taken the liberty of trying to finish that off here.

  2. Previously the text indicated that the columns were specifically for holding "a stream of records", but Super Columnar can handle streams of primitive values and other top-level non-record types, hence my deletion of that text.

But if there's a feeling I've gone too far here, please speak up!


There is one column stream for each top-level type encountered in the input where
each column stream is encoded according to its type. For top-level complex types,
the embedded elements are encoded recursively in additional column streams
as described below. For example,
a record column encodes a "presence" vector encoding any null value for
a [record column](#record-column) encodes a [presence column](#presence-columns) encoding any null value for
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spot I'm de-"vector"-ing. Since the related section was already called "presence columns" this seemed like a safe change.

each field then encodes each non-null field recursively, whereas
an array column encodes a "lengths" vector and encodes each
an [array column](#array-column) encodes a sequence of "lengths" and encodes each
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third de-"vector"-ing. I did pause here because my normal instinct would be go from "lengths vector" to "lengths column", but the section on array column was already written to describe "lengths" as a "sequence of values", so I went that way. No doubt we could go in either direction, but I wanted to make sure it was consistent across both places.


These N super types are defined by the first N values of the reassembly stream
and are encoded as a null value of the indicated super type.
A super type's integer position in this sequence defines its identifier
encoded in the super column (defined below). This identifier is called
encoded in the [super column](#the-super-column). This identifier is called
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured I'd let the hyperlink do the "defined below" talking.

run-length encoding of the locations of column values in their respective rows,
when there are null values (as described below).
when there are null values.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Letting the hyperlink a couple lines above do the "as described below" talking.

@philrz philrz self-assigned this Oct 31, 2024
@philrz philrz requested a review from a team October 31, 2024 00:48
@philrz philrz marked this pull request as ready for review October 31, 2024 00:48
Copy link
Collaborator

@mccanne mccanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding my comments I realized you're just doing vng

@@ -118,7 +118,7 @@ replication easy to support and deploy.
The cloud objects that comprise a lake, e.g., data objects,
commit history, transaction journals, partial aggregations, etc.,
are stored as Zed data, i.e., either as [row-based Super Binary](../formats/bsup.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zed?

@@ -163,7 +163,7 @@ The human-readable format of Zed is called [ZSON](../formats/jsup.md)

ZSON is nice because it has a comprehensive type system and you can
go from ZSON to an efficient binary row format ([Super Binary](../formats/bsup.md))
and columnar ([VNG](../formats/vng.md)) --- and vice versa ---
and columnar ([Super Columnar](../formats/csup.md)) --- and vice versa ---
with complete fidelity and no loss of information. In this tour,
we'll stick to ZSON (though for large data sets,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZSON?

@philrz philrz merged commit 210836e into main Oct 31, 2024
4 checks passed
@philrz philrz deleted the docs-csup-format branch October 31, 2024 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants