diff --git a/previews/PR119/.documenter-siteinfo.json b/previews/PR119/.documenter-siteinfo.json
index e4f07d3..a1fb615 100644
--- a/previews/PR119/.documenter-siteinfo.json
+++ b/previews/PR119/.documenter-siteinfo.json
@@ -1 +1 @@
-{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-08-28T20:19:28","documenter_version":"1.6.0"}}
\ No newline at end of file
+{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-08-28T20:41:26","documenter_version":"1.6.0"}}
\ No newline at end of file
diff --git a/previews/PR119/arrow-concepts/index.html b/previews/PR119/arrow-concepts/index.html
index 3d30f17..89617fb 100644
--- a/previews/PR119/arrow-concepts/index.html
+++ b/previews/PR119/arrow-concepts/index.html
@@ -1,2 +1,2 @@
-
If you're a newcomer to Legolas.jl, please familiarize yourself with the tour before diving into this documentation.
Legolas.jl's target (de)serialization format, Arrow, already features wide cross-language adoption, enabling Legolas-serialized tables to be seamlessly read into many non-Julia environments. This documentation section contains conventions related to Legolas-serialized Arrow tables that may be observable by generic Legolas-unaware Arrow consumers.
Legolas defines a special field legolas_schema_qualified that Legolas-aware Arrow writers may include in an Arrow table's table-level metadata to indicate a particular Legolas schema with which the table complies.
Arrow tables which include this field are considered to "support Legolas schema discovery" and are referred to as "Legolas-discoverable", since Legolas consumers may employ this field to automatically match the table against available application-layer Legolas schema definitions.
When writing a Legolas-discoverable Arrow table to a file, prefer using the file extension *.<schema name>.arrow. For example, if the file's table's full Legolas schema version identifier is baz.supercar@1>bar.automobile@1, use the file extension *.baz.supercar.arrow.
Settings
This document was generated with Documenter.jl version 1.6.0 on Wednesday 28 August 2024. Using Julia version 1.10.5.
If you're a newcomer to Legolas.jl, please familiarize yourself with the tour before diving into this documentation.
Legolas.jl's target (de)serialization format, Arrow, already features wide cross-language adoption, enabling Legolas-serialized tables to be seamlessly read into many non-Julia environments. This documentation section contains conventions related to Legolas-serialized Arrow tables that may be observable by generic Legolas-unaware Arrow consumers.
Legolas defines a special field legolas_schema_qualified that Legolas-aware Arrow writers may include in an Arrow table's table-level metadata to indicate a particular Legolas schema with which the table complies.
Arrow tables which include this field are considered to "support Legolas schema discovery" and are referred to as "Legolas-discoverable", since Legolas consumers may employ this field to automatically match the table against available application-layer Legolas schema definitions.
When writing a Legolas-discoverable Arrow table to a file, prefer using the file extension *.<schema name>.arrow. For example, if the file's table's full Legolas schema version identifier is baz.supercar@1>bar.automobile@1, use the file extension *.baz.supercar.arrow.
Settings
This document was generated with Documenter.jl version 1.6.0 on Wednesday 28 August 2024. Using Julia version 1.10.5.
At its core, Legolas.jl provides a lightweight, expressive set of mechanisms/patterns for wrangling Tables.jl-compliant values in a manner that enables schema composability, extensibility and a few nice utilties on top.
The package originated from code developed internally at Beacon to wrangling heterogeneous Arrow datasets, and is thus probably mostly useful for folks in a similar situation. If you're curating tabular datasets and you'd like to build shared Julia tools atop the schemas therein, then Legolas.jl may be worth checking out.
Technically, Legolas.jl's core @schema/@version functionality is agnostic to (de)serialization and could be useful for anybody who wants to wrangle Tables.jl-compliant values.
Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with Arrow.jl "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with JSON3.jl or other packages.
Many of Legolas' current choices on this front stem from refactoring efforts undertaken as part of this pull request, and directly resulted from a design mini-investigation associated with those efforts.
Settings
This document was generated with Documenter.jl version 1.6.0 on Wednesday 28 August 2024. Using Julia version 1.10.5.
At its core, Legolas.jl provides a lightweight, expressive set of mechanisms/patterns for wrangling Tables.jl-compliant values in a manner that enables schema composability, extensibility and a few nice utilties on top.
The package originated from code developed internally at Beacon to wrangling heterogeneous Arrow datasets, and is thus probably mostly useful for folks in a similar situation. If you're curating tabular datasets and you'd like to build shared Julia tools atop the schemas therein, then Legolas.jl may be worth checking out.
Technically, Legolas.jl's core @schema/@version functionality is agnostic to (de)serialization and could be useful for anybody who wants to wrangle Tables.jl-compliant values.
Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with Arrow.jl "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with JSON3.jl or other packages.
Many of Legolas' current choices on this front stem from refactoring efforts undertaken as part of this pull request, and directly resulted from a design mini-investigation associated with those efforts.
Settings
This document was generated with Documenter.jl version 1.6.0 on Wednesday 28 August 2024. Using Julia version 1.10.5.
A type representing a particular version of Legolas schema. The relevant name (a Symbol) and version (an Integer) are surfaced as type parameters, allowing them to be utilized for dispatch.
For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.
The constructor SchemaVersion{name,version}() will throw an ArgumentError if version is negative.
A type representing a particular version of Legolas schema. The relevant name (a Symbol) and version (an Integer) are surfaced as type parameters, allowing them to be utilized for dispatch.
For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.
The constructor SchemaVersion{name,version}() will throw an ArgumentError if version is negative.
@version RecordType begin
declared_field_expression_1
declared_field_expression_2
⋮
@@ -27,14 +27,14 @@
FooV1{Float32}: (x = 1, y = 2.0f0)
julia> FooV1(x=1, y="bad")
-ERROR: TypeError: in FooV1, in _y_T, expected _y_T<:Real, got Type{String}
This macro will throw a Legolas.SchemaVersionDeclarationError if:
The provided RecordType does not follow the $(Prefix)V$(n) format, where Prefix was previously associated with a given schema by a prior @schema declaration.
There are no declared field expressions, duplicate fields are declared, or a given declared field expression is invalid.
(if a parent is specified) The @version declaration does not comply with its parent's @version declaration, or the parent hasn't yet been declared at all.
Note that this macro expects to be evaluated within top-level scope.
For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.
return an n element Vector{SchemaVersion} whose ith element is SchemaVersion(names[i], versions[i]).
Throws an ArgumentError if the provided string is not a valid schema version identifier.
For details regarding valid schema version identifiers and their structure, see the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.
Return this Legolas.SchemaVersion's fully qualified schema version identifier. This string is serialized as the "legolas_schema_qualified" field value in table metadata for table written via Legolas.write.
type::Union{Symbol,Expr}: the declared field's declared type constraint
parameterize::Bool: whether or not the declared field is exposed as a parameter
statement::Expr: the declared field's full assignment statement (as processed by @version, not necessarily as written)
Note that declaration is primarily intended to be used for interactive discovery purposes, and does not include the contents of declaration(parent(sv)).
Return a Vector{Pair{Symbol,Union{Type,Missing}}} of all of ts's violations with respect to sv.
This function's notion of "violation" is defined by Legolas.find_violation, which immediately returns the first violation found; prefer to use that function instead of find_violations in situations where you only need to detect any violation instead of all violations.
Return the "maximal supertype" of T that is accepted by sv when evaluating a field of type >:T for schematic compliance via Legolas.find_violation; see that function's docstring for an explanation of this function's use in context.
SchemaVersion authors may overload this function to broaden particular type constraints that determine schematic compliance for their SchemaVersion, without needing to broaden the type constraints employed by their SchemaVersion's record type.
Legolas itself defines the following default overloads:
accepted_field_type(::SchemaVersion, T::Type) = T
+ERROR: TypeError: in FooV1, in _y_T, expected _y_T<:Real, got Type{String}
This macro will throw a Legolas.SchemaVersionDeclarationError if:
The provided RecordType does not follow the $(Prefix)V$(n) format, where Prefix was previously associated with a given schema by a prior @schema declaration.
There are no declared field expressions, duplicate fields are declared, or a given declared field expression is invalid.
(if a parent is specified) The @version declaration does not comply with its parent's @version declaration, or the parent hasn't yet been declared at all.
Note that this macro expects to be evaluated within top-level scope.
For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.
return an n element Vector{SchemaVersion} whose ith element is SchemaVersion(names[i], versions[i]).
Throws an ArgumentError if the provided string is not a valid schema version identifier.
For details regarding valid schema version identifiers and their structure, see the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.
Return this Legolas.SchemaVersion's fully qualified schema version identifier. This string is serialized as the "legolas_schema_qualified" field value in table metadata for table written via Legolas.write.
type::Union{Symbol,Expr}: the declared field's declared type constraint
parameterize::Bool: whether or not the declared field is exposed as a parameter
statement::Expr: the declared field's full assignment statement (as processed by @version, not necessarily as written)
Note that declaration is primarily intended to be used for interactive discovery purposes, and does not include the contents of declaration(parent(sv)).
Return a Vector{Pair{Symbol,Union{Type,Missing}}} of all of ts's violations with respect to sv.
This function's notion of "violation" is defined by Legolas.find_violation, which immediately returns the first violation found; prefer to use that function instead of find_violations in situations where you only need to detect any violation instead of all violations.
Return the "maximal supertype" of T that is accepted by sv when evaluating a field of type >:T for schematic compliance via Legolas.find_violation; see that function's docstring for an explanation of this function's use in context.
SchemaVersion authors may overload this function to broaden particular type constraints that determine schematic compliance for their SchemaVersion, without needing to broaden the type constraints employed by their SchemaVersion's record type.
Legolas itself defines the following default overloads:
accepted_field_type(::SchemaVersion, T::Type) = T
accepted_field_type(::SchemaVersion, ::Type{Any}) = Any
accepted_field_type(::SchemaVersion, ::Type{UUID}) = Union{UUID,UInt128}
accepted_field_type(::SchemaVersion, ::Type{Symbol}) = Union{Symbol,AbstractString}
accepted_field_type(::SchemaVersion, ::Type{String}) = AbstractString
accepted_field_type(sv::SchemaVersion, ::Type{<:Vector{T}}) where T = AbstractVector{<:(accepted_field_type(sv, T))}
accepted_field_type(::SchemaVersion, ::Type{Vector}) = AbstractVector
-accepted_field_type(sv::SchemaVersion, ::Type{Union{T,Missing}}) where {T} = Union{accepted_field_type(sv, T),Missing}
Outside of these default overloads, this function should only be overloaded against specific SchemaVersions that are authored within the same module as the overload definition; to do otherwise constitutes type piracy and should be avoided.
If validate is true, Legolas.read will attempt to extract a Legolas.SchemaVersion from the deserialized Arrow.Table's metadata and use Legolas.validate to verify that the table's Table.Schema complies with the extracted Legolas.SchemaVersion before returning the table.
Note that io_or_path may be any type that supports Base.read(io_or_path)::Vector{UInt8}.
Return f(x) unless x isa Union{Nothing,Missing}, in which case return missing.
This is particularly useful when handling values from Arrow.Table, whose null values may present as either missing or nothing depending on how the table itself was originally constructed.
Construct T(x) unless x is of type T, in which case return x itself. Useful in conjunction with the lift function for types which don't have a constructor which accepts instances of itself (e.g. T(::T)).
Examples
julia> using Legolas: construct
+accepted_field_type(sv::SchemaVersion, ::Type{Union{T,Missing}}) where {T} = Union{accepted_field_type(sv, T),Missing}
Outside of these default overloads, this function should only be overloaded against specific SchemaVersions that are authored within the same module as the overload definition; to do otherwise constitutes type piracy and should be avoided.
If validate is true, Legolas.read will attempt to extract a Legolas.SchemaVersion from the deserialized Arrow.Table's metadata and use Legolas.validate to verify that the table's Table.Schema complies with the extracted Legolas.SchemaVersion before returning the table.
Note that io_or_path may be any type that supports Base.read(io_or_path)::Vector{UInt8}.
Return f(x) unless x isa Union{Nothing,Missing}, in which case return missing.
This is particularly useful when handling values from Arrow.Table, whose null values may present as either missing or nothing depending on how the table itself was originally constructed.
Construct T(x) unless x is of type T, in which case return x itself. Useful in conjunction with the lift function for types which don't have a constructor which accepts instances of itself (e.g. T(::T)).
Return a new AbstractRecord with the same schema version as record, whose fields are computed via Tables.rowmerge(record; fields_to_merge...). The returned record is constructed by passing these merged fields to the AbstractRecord constructor that matches the type of the input record.
Gather rows from tables into a unified cross-table index along column_name. Returns a Dict whose keys are the unique values of column_name across tables, and whose values are tuples of the form:
The provided extract function is used to extract rows from each table; it takes as input a table and a Vector{Int} of row indices, and returns the corresponding subtable. The default definition is sufficient for DataFrames tables.
Note that this function may internally call Tables.columns on each input table, so it may be slower and/or require more memory if any(!Tables.columnaccess, tables).
Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.
Return a Dict whose keys are the set of all elements across all provided collections, and whose values are the indices that locate each corresponding element across all provided collecitons.
Specifically, locations(collections)[k][i] will return a Vector{Int} whose elements are the index locations of k in collections[i]. If !(k in collections[i]), this Vector{Int} will be empty.
Return a new AbstractRecord with the same schema version as record, whose fields are computed via Tables.rowmerge(record; fields_to_merge...). The returned record is constructed by passing these merged fields to the AbstractRecord constructor that matches the type of the input record.
Gather rows from tables into a unified cross-table index along column_name. Returns a Dict whose keys are the unique values of column_name across tables, and whose values are tuples of the form:
The provided extract function is used to extract rows from each table; it takes as input a table and a Vector{Int} of row indices, and returns the corresponding subtable. The default definition is sufficient for DataFrames tables.
Note that this function may internally call Tables.columns on each input table, so it may be slower and/or require more memory if any(!Tables.columnaccess, tables).
Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.
Return a Dict whose keys are the set of all elements across all provided collections, and whose values are the indices that locate each corresponding element across all provided collecitons.
Specifically, locations(collections)[k][i] will return a Vector{Int} whose elements are the index locations of k in collections[i]. If !(k in collections[i]), this Vector{Int} will be empty.
This function is useful when table has built-in deserialize-on-access or conversion-on-access behavior (like Arrow.Table) and you'd like to pay such access costs upfront before repeatedly accessing the table.
Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.
This function is useful when table has built-in deserialize-on-access or conversion-on-access behavior (like Arrow.Table) and you'd like to pay such access costs upfront before repeatedly accessing the table.
Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.
Legolas defines "schema version identifiers" as strings of the form:
name@version where:
name is a lowercase alphanumeric string and may include the special characters . and -.
version is a non-negative integer.
or, x>y where x and y are valid schema version identifiers and > denotes "extends from".
A schema version identifier is said to be fully qualified if it includes the identifiers of all ancestors of the particular schema version that it directly identifies.
Schema authors should follow the below conventions when choosing the name of a new schema:
Include a namespace. For example, assuming the schema is defined in a package Foo.jl, foo.automobile is good, automobile is bad.
Prefer singular over plural. For example, foo.automobile is good, foo.automobiles is bad.
Don't "overqualify" a schema name with ancestor-derived information that is better captured by the fully qualified identifier of a specific schema version. For example, bar.automobile should be preferred over bar.foo.automobile, since bar.automobile@1>foo.automobile@1 is preferable to bar.foo.automobile@1>foo.automobile@1. Similarly, baz.supercar should be preferred over baz.automobile.supercar, since baz.supercar@1>bar.automobile@1 is preferable to baz.automobile.supercar@1>bar.automobile@1.
While it is fairly established practice to semantically version source code, the world of data/artifact versioning is a bit more varied. As presented in the tour, each Legolas.SchemaVersion carries a single version integer. The central rule that governs Legolas' schema versioning approach is:
Do not introduce a change to an existing schema version that might cause existing compliant data to become non-compliant; instead, incorporate the intended change in a new schema version whose version number is one greater than the previous version number.
A schema author must introduce a new schema version if any of the following changes are introduced:
A new type-constrained and/or value-constrained field is declared. In other words, for the introduction of a new declared field to be non-breaking, the new field's type constraint must be ::Any and it may not feature a value-constraining or value-transforming assignment expression.
An existing declared field's type or value constraints are tightened.
An existing declared field is renamed.
If any of the above breaking changes are made to an existing schema version, instead of introducing a new schema version, subtle downstream breakage may occur. For example, if a new type/value-constrained field is declared, previously compliant tables containing a field with the same name might accidentally become non-compliant if existing values violate the new constraints. Similarly, downstream schema version extensions may have already declared a field with the same name, but with constraints that are incompatible with the new constraints.
One benefit of Legolas' approach is that multiple schema versions may be defined in the same codebase, e.g. there's nothing that prevents @version(FooV1, ...) and @version(FooV2, ...) from being defined and utilized simultaneously. The source code that defines any given Legolas schema version and/or consumes/produces Legolas tables is presumably already semantically versioned, such that consumer/producer packages can determine their compatibility with each other in the usual manner via interpreting major/minor/patch increments.
Note that it is preferable to avoid introducing new versions of an existing schema, if possible, in order to minimize code/data churn for downstream producers/consumers. Thus, authors should prefer conservative field type restrictions from the get-go. Remember: loosening a field type restriction is not a breaking change, but tightening one is.