Super-fy the Super Columnar format doc #5399

philrz · 2024-10-31T00:17:48Z

In the same spirit as #5368.

I'm guessing this document will be subject to significant overhaul in the near future regardless, but since it's linked to from the top-level README I figured a first pass of de-Zedding was justified ASAP.

philrz · 2024-10-31T00:39:25Z

docs/formats/csup.md

+Super Columnar is a file format based on
+the [super data model](zed.md) where data is stacked to form columns.


I simplified the opening lines a fair amount. A couple things I'll point out:

When I sat down to start working on this file, there were only three uses of the word "vector" left, and I think I recall hearing in the past that we wanted to phase that word out entirely if possible, so I've taken the liberty of trying to finish that off here.

Previously the text indicated that the columns were specifically for holding "a stream of records", but Super Columnar can handle streams of primitive values and other top-level non-record types, hence my deletion of that text.

But if there's a feeling I've gone too far here, please speak up!

philrz · 2024-10-31T00:40:46Z

docs/formats/csup.md


 There is one column stream for each top-level type encountered in the input where
 each column stream is encoded according to its type.  For top-level complex types,
 the embedded elements are encoded recursively in additional column streams
 as described below.  For example,
-a record column encodes a "presence" vector encoding any null value for
+a [record column](#record-column) encodes a [presence column](#presence-columns) encoding any null value for


Another spot I'm de-"vector"-ing. Since the related section was already called "presence columns" this seemed like a safe change.

philrz · 2024-10-31T00:43:05Z

docs/formats/csup.md

 each field then encodes each non-null field recursively, whereas
-an array column encodes a "lengths" vector and encodes each
+an [array column](#array-column) encodes a sequence of "lengths" and encodes each


The third de-"vector"-ing. I did pause here because my normal instinct would be go from "lengths vector" to "lengths column", but the section on array column was already written to describe "lengths" as a "sequence of values", so I went that way. No doubt we could go in either direction, but I wanted to make sure it was consistent across both places.

philrz · 2024-10-31T00:44:12Z

docs/formats/csup.md


 These N super types are defined by the first N values of the reassembly stream
 and are encoded as a null value of the indicated super type.
 A super type's integer position in this sequence defines its identifier
-encoded in the super column (defined below).  This identifier is called
+encoded in the [super column](#the-super-column).  This identifier is called


Figured I'd let the hyperlink do the "defined below" talking.

philrz · 2024-10-31T00:45:17Z

docs/formats/csup.md

 run-length encoding of the locations of column values in their respective rows,
-when there are null values (as described below).
+when there are null values.


Letting the hyperlink a couple lines above do the "as described below" talking.

mccanne

Regarding my comments I realized you're just doing vng

mccanne · 2024-10-31T02:08:15Z

docs/commands/zed.md

@@ -118,7 +118,7 @@ replication easy to support and deploy.
 The cloud objects that comprise a lake, e.g., data objects,
 commit history, transaction journals, partial aggregations, etc.,
 are stored as Zed data, i.e., either as [row-based Super Binary](../formats/bsup.md)


mccanne · 2024-10-31T02:08:51Z

docs/tutorials/zq.md

@@ -163,7 +163,7 @@ The human-readable format of Zed is called [ZSON](../formats/jsup.md)

 ZSON is nice because it has a comprehensive type system and you can
 go from ZSON to an efficient binary row format ([Super Binary](../formats/bsup.md))
-and columnar ([VNG](../formats/vng.md)) --- and vice versa ---
+and columnar ([Super Columnar](../formats/csup.md)) --- and vice versa ---
 with complete fidelity and no loss of information.  In this tour,
 we'll stick to ZSON (though for large data sets,


Super-fy the Super Columnar format doc

1f5311c

philrz force-pushed the docs-csup-format branch from afa80b7 to 1f5311c Compare October 31, 2024 00:33

philrz commented Oct 31, 2024

View reviewed changes

philrz self-assigned this Oct 31, 2024

philrz requested a review from a team October 31, 2024 00:48

philrz marked this pull request as ready for review October 31, 2024 00:48

mccanne approved these changes Oct 31, 2024

View reviewed changes

philrz merged commit 210836e into main Oct 31, 2024
4 checks passed

philrz deleted the docs-csup-format branch October 31, 2024 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super-fy the Super Columnar format doc #5399

Super-fy the Super Columnar format doc #5399

philrz commented Oct 31, 2024 •

edited

Loading

philrz Oct 31, 2024

philrz Oct 31, 2024

philrz Oct 31, 2024

philrz Oct 31, 2024

philrz Oct 31, 2024

mccanne left a comment

mccanne Oct 31, 2024

mccanne Oct 31, 2024

		Super Columnar is a file format based on
		the [super data model](zed.md) where data is stacked to form columns.

Super-fy the Super Columnar format doc #5399

Super-fy the Super Columnar format doc #5399

Conversation

philrz commented Oct 31, 2024 • edited Loading

philrz Oct 31, 2024

Choose a reason for hiding this comment

philrz Oct 31, 2024

Choose a reason for hiding this comment

philrz Oct 31, 2024

Choose a reason for hiding this comment

philrz Oct 31, 2024

Choose a reason for hiding this comment

philrz Oct 31, 2024

Choose a reason for hiding this comment

mccanne left a comment

Choose a reason for hiding this comment

mccanne Oct 31, 2024

Choose a reason for hiding this comment

mccanne Oct 31, 2024

Choose a reason for hiding this comment

philrz commented Oct 31, 2024 •

edited

Loading