From b4068e10ebd3c88d5eeeaff2e7522ca68e09b1d0 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Wed, 28 Aug 2024 20:19:34 +0000 Subject: [PATCH] build based on 4fa33a9 --- previews/PR119/.documenter-siteinfo.json | 2 +- previews/PR119/arrow-concepts/index.html | 2 +- previews/PR119/faq/index.html | 2 +- previews/PR119/index.html | 10 +++++----- previews/PR119/schema-concepts/index.html | 2 +- previews/PR119/upgrade/index.html | 2 +- 6 files changed, 10 insertions(+), 10 deletions(-) diff --git a/previews/PR119/.documenter-siteinfo.json b/previews/PR119/.documenter-siteinfo.json index b835d19..e4f07d3 100644 --- a/previews/PR119/.documenter-siteinfo.json +++ b/previews/PR119/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-08-28T19:52:05","documenter_version":"1.6.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-08-28T20:19:28","documenter_version":"1.6.0"}} \ No newline at end of file diff --git a/previews/PR119/arrow-concepts/index.html b/previews/PR119/arrow-concepts/index.html index 920705e..3d30f17 100644 --- a/previews/PR119/arrow-concepts/index.html +++ b/previews/PR119/arrow-concepts/index.html @@ -1,2 +1,2 @@ -Arrow-Related Concepts/Conventions · Legolas

Arrow-Related Concepts/Conventions

Note

If you're a newcomer to Legolas.jl, please familiarize yourself with the tour before diving into this documentation.

Legolas.jl's target (de)serialization format, Arrow, already features wide cross-language adoption, enabling Legolas-serialized tables to be seamlessly read into many non-Julia environments. This documentation section contains conventions related to Legolas-serialized Arrow tables that may be observable by generic Legolas-unaware Arrow consumers.

Supporting Legolas Schema Discovery In Arrow Tables

Legolas defines a special field legolas_schema_qualified that Legolas-aware Arrow writers may include in an Arrow table's table-level metadata to indicate a particular Legolas schema with which the table complies.

Arrow tables which include this field are considered to "support Legolas schema discovery" and are referred to as "Legolas-discoverable", since Legolas consumers may employ this field to automatically match the table against available application-layer Legolas schema definitions.

If present, the legolas_schema_qualified field's value must be a fully qualified schema version identifier.

Arrow File Naming Conventions

When writing a Legolas-discoverable Arrow table to a file, prefer using the file extension *.<schema name>.arrow. For example, if the file's table's full Legolas schema version identifier is baz.supercar@1>bar.automobile@1, use the file extension *.baz.supercar.arrow.

+Arrow-Related Concepts/Conventions · Legolas

Arrow-Related Concepts/Conventions

Note

If you're a newcomer to Legolas.jl, please familiarize yourself with the tour before diving into this documentation.

Legolas.jl's target (de)serialization format, Arrow, already features wide cross-language adoption, enabling Legolas-serialized tables to be seamlessly read into many non-Julia environments. This documentation section contains conventions related to Legolas-serialized Arrow tables that may be observable by generic Legolas-unaware Arrow consumers.

Supporting Legolas Schema Discovery In Arrow Tables

Legolas defines a special field legolas_schema_qualified that Legolas-aware Arrow writers may include in an Arrow table's table-level metadata to indicate a particular Legolas schema with which the table complies.

Arrow tables which include this field are considered to "support Legolas schema discovery" and are referred to as "Legolas-discoverable", since Legolas consumers may employ this field to automatically match the table against available application-layer Legolas schema definitions.

If present, the legolas_schema_qualified field's value must be a fully qualified schema version identifier.

Arrow File Naming Conventions

When writing a Legolas-discoverable Arrow table to a file, prefer using the file extension *.<schema name>.arrow. For example, if the file's table's full Legolas schema version identifier is baz.supercar@1>bar.automobile@1, use the file extension *.baz.supercar.arrow.

diff --git a/previews/PR119/faq/index.html b/previews/PR119/faq/index.html index 5693922..ee2ee07 100644 --- a/previews/PR119/faq/index.html +++ b/previews/PR119/faq/index.html @@ -1,2 +1,2 @@ -FAQ · Legolas

FAQ

What is the point of Legolas.jl? Who benefits from using it?

At its core, Legolas.jl provides a lightweight, expressive set of mechanisms/patterns for wrangling Tables.jl-compliant values in a manner that enables schema composability, extensibility and a few nice utilties on top.

The package originated from code developed internally at Beacon to wrangling heterogeneous Arrow datasets, and is thus probably mostly useful for folks in a similar situation. If you're curating tabular datasets and you'd like to build shared Julia tools atop the schemas therein, then Legolas.jl may be worth checking out.

Why does Legolas.jl support Arrow as a (de)serialization target, but not, say, JSON?

Technically, Legolas.jl's core @schema/@version functionality is agnostic to (de)serialization and could be useful for anybody who wants to wrangle Tables.jl-compliant values.

Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with Arrow.jl "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with JSON3.jl or other packages.

Why are Legolas.jl's generated record types defined the way that they are? For example, why is the version number hardcoded in the type name?

Many of Legolas' current choices on this front stem from refactoring efforts undertaken as part of this pull request, and directly resulted from a design mini-investigation associated with those efforts.

+FAQ · Legolas

FAQ

What is the point of Legolas.jl? Who benefits from using it?

At its core, Legolas.jl provides a lightweight, expressive set of mechanisms/patterns for wrangling Tables.jl-compliant values in a manner that enables schema composability, extensibility and a few nice utilties on top.

The package originated from code developed internally at Beacon to wrangling heterogeneous Arrow datasets, and is thus probably mostly useful for folks in a similar situation. If you're curating tabular datasets and you'd like to build shared Julia tools atop the schemas therein, then Legolas.jl may be worth checking out.

Why does Legolas.jl support Arrow as a (de)serialization target, but not, say, JSON?

Technically, Legolas.jl's core @schema/@version functionality is agnostic to (de)serialization and could be useful for anybody who wants to wrangle Tables.jl-compliant values.

Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with Arrow.jl "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with JSON3.jl or other packages.

Why are Legolas.jl's generated record types defined the way that they are? For example, why is the version number hardcoded in the type name?

Many of Legolas' current choices on this front stem from refactoring efforts undertaken as part of this pull request, and directly resulted from a design mini-investigation associated with those efforts.

diff --git a/previews/PR119/index.html b/previews/PR119/index.html index 5245bcb..f8a06de 100644 --- a/previews/PR119/index.html +++ b/previews/PR119/index.html @@ -1,5 +1,5 @@ -API Documentation · Legolas

API Documentation

Note

If you're a newcomer to Legolas.jl, please familiarize yourself with the tour before diving into this documentation.

Legolas Schemas

Legolas.SchemaVersionType
Legolas.SchemaVersion{name,version}

A type representing a particular version of Legolas schema. The relevant name (a Symbol) and version (an Integer) are surfaced as type parameters, allowing them to be utilized for dispatch.

For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

The constructor SchemaVersion{name,version}() will throw an ArgumentError if version is negative.

See also: Legolas.@schema

source
Legolas.@schemaMacro
@schema "name" Prefix

Declare a Legolas schema with the given name. Types generated by subsequent @version declarations for this schema will be prefixed with Prefix.

For more details and examples, please see Legolas.jl/examples/tour.jl.

source
Legolas.@versionMacro
@version RecordType begin
+API Documentation · Legolas

API Documentation

Note

If you're a newcomer to Legolas.jl, please familiarize yourself with the tour before diving into this documentation.

Legolas Schemas

Legolas.SchemaVersionType
Legolas.SchemaVersion{name,version}

A type representing a particular version of Legolas schema. The relevant name (a Symbol) and version (an Integer) are surfaced as type parameters, allowing them to be utilized for dispatch.

For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

The constructor SchemaVersion{name,version}() will throw an ArgumentError if version is negative.

See also: Legolas.@schema

source
Legolas.@schemaMacro
@schema "name" Prefix

Declare a Legolas schema with the given name. Types generated by subsequent @version declarations for this schema will be prefixed with Prefix.

For more details and examples, please see Legolas.jl/examples/tour.jl.

source
Legolas.@versionMacro
@version RecordType begin
     declared_field_expression_1
     declared_field_expression_2
     ⋮
@@ -27,14 +27,14 @@
 FooV1{Float32}: (x = 1, y = 2.0f0)
 
 julia> FooV1(x=1, y="bad")
-ERROR: TypeError: in FooV1, in _y_T, expected _y_T<:Real, got Type{String}

This macro will throw a Legolas.SchemaVersionDeclarationError if:

  • The provided RecordType does not follow the $(Prefix)V$(n) format, where Prefix was previously associated with a given schema by a prior @schema declaration.
  • There are no declared field expressions, duplicate fields are declared, or a given declared field expression is invalid.
  • (if a parent is specified) The @version declaration does not comply with its parent's @version declaration, or the parent hasn't yet been declared at all.

Note that this macro expects to be evaluated within top-level scope.

For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

source
Legolas.is_valid_schema_nameFunction
Legolas.is_valid_schema_name(x::AbstractString)

Return true if x is a valid schema name, return false otherwise.

Valid schema names are lowercase, alphanumeric, and may contain hyphens or periods.

source
Legolas.parse_identifierFunction
Legolas.parse_identifier(id::AbstractString)

Given a valid schema version identifier id of the form:

$(names[1])@$(versions[1]) > $(names[2])@$(versions[2]) > ... > $(names[n])@$(versions[n])

return an n element Vector{SchemaVersion} whose ith element is SchemaVersion(names[i], versions[i]).

Throws an ArgumentError if the provided string is not a valid schema version identifier.

For details regarding valid schema version identifiers and their structure, see the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

source
Legolas.identifierFunction
Legolas.identifier(::Legolas.SchemaVersion)

Return this Legolas.SchemaVersion's fully qualified schema version identifier. This string is serialized as the "legolas_schema_qualified" field value in table metadata for table written via Legolas.write.

source
Legolas.parentFunction
Legolas.parent(sv::Legolas.SchemaVersion)

Return the Legolas.SchemaVersion instance that corresponds to sv's declared parent.

source
Legolas.declared_fieldsFunction
Legolas.declared_fields(sv::Legolas.SchemaVersion)

Return a NamedTuple{...,Tuple{Vararg{DataType}} whose fields take the form:

<name of field declared by `sv`> = <field's type>

If sv has a parent, the returned fields will include declared_fields(parent(sv)).

source
Legolas.declarationFunction
Legolas.declaration(sv::Legolas.SchemaVersion)

Return a Pair{String,Vector{NamedTuple}} of the form

schema_version_identifier::String => declared_field_infos::Vector{Legolas.DeclaredFieldInfo}

where DeclaredFieldInfo has the fields:

  • name::Symbol: the declared field's name
  • type::Union{Symbol,Expr}: the declared field's declared type constraint
  • parameterize::Bool: whether or not the declared field is exposed as a parameter
  • statement::Expr: the declared field's full assignment statement (as processed by @version, not necessarily as written)

Note that declaration is primarily intended to be used for interactive discovery purposes, and does not include the contents of declaration(parent(sv)).

source
Legolas.declaredFunction
Legolas.declared(sv::Legolas.SchemaVersion{name,version})

Return true if the schema version name@version has been declared via @version in the current Julia session; return false otherwise.

source
Legolas.find_violationFunction
Legolas.find_violation(ts::Tables.Schema, sv::Legolas.SchemaVersion)

For each field f::F declared by sv:

  • Define A = Legolas.accepted_field_type(sv, F)
  • If f::T is present in ts, ensure that T <: A or else immediately return f::Symbol => T::DataType.
  • If f isn't present in ts, ensure that Missing <: A or else immediately return f::Symbol => missing::Missing.

Otherwise, return nothing.

To return all violations instead of just the first, use Legolas.find_violations.

See also: Legolas.validate, Legolas.complies_with, Legolas.find_violations.

source
Legolas.find_violationsFunction
Legolas.find_violations(ts::Tables.Schema, sv::Legolas.SchemaVersion)

Return a Vector{Pair{Symbol,Union{Type,Missing}}} of all of ts's violations with respect to sv.

This function's notion of "violation" is defined by Legolas.find_violation, which immediately returns the first violation found; prefer to use that function instead of find_violations in situations where you only need to detect any violation instead of all violations.

See also: Legolas.validate, Legolas.complies_with, Legolas.find_violation.

source
Legolas.accepted_field_typeFunction
Legolas.accepted_field_type(sv::Legolas.SchemaVersion, T::Type)

Return the "maximal supertype" of T that is accepted by sv when evaluating a field of type >:T for schematic compliance via Legolas.find_violation; see that function's docstring for an explanation of this function's use in context.

SchemaVersion authors may overload this function to broaden particular type constraints that determine schematic compliance for their SchemaVersion, without needing to broaden the type constraints employed by their SchemaVersion's record type.

Legolas itself defines the following default overloads:

accepted_field_type(::SchemaVersion, T::Type) = T
+ERROR: TypeError: in FooV1, in _y_T, expected _y_T<:Real, got Type{String}

This macro will throw a Legolas.SchemaVersionDeclarationError if:

  • The provided RecordType does not follow the $(Prefix)V$(n) format, where Prefix was previously associated with a given schema by a prior @schema declaration.
  • There are no declared field expressions, duplicate fields are declared, or a given declared field expression is invalid.
  • (if a parent is specified) The @version declaration does not comply with its parent's @version declaration, or the parent hasn't yet been declared at all.

Note that this macro expects to be evaluated within top-level scope.

For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

source
Legolas.is_valid_schema_nameFunction
Legolas.is_valid_schema_name(x::AbstractString)

Return true if x is a valid schema name, return false otherwise.

Valid schema names are lowercase, alphanumeric, and may contain hyphens or periods.

source
Legolas.parse_identifierFunction
Legolas.parse_identifier(id::AbstractString)

Given a valid schema version identifier id of the form:

$(names[1])@$(versions[1]) > $(names[2])@$(versions[2]) > ... > $(names[n])@$(versions[n])

return an n element Vector{SchemaVersion} whose ith element is SchemaVersion(names[i], versions[i]).

Throws an ArgumentError if the provided string is not a valid schema version identifier.

For details regarding valid schema version identifiers and their structure, see the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

source
Legolas.identifierFunction
Legolas.identifier(::Legolas.SchemaVersion)

Return this Legolas.SchemaVersion's fully qualified schema version identifier. This string is serialized as the "legolas_schema_qualified" field value in table metadata for table written via Legolas.write.

source
Legolas.parentFunction
Legolas.parent(sv::Legolas.SchemaVersion)

Return the Legolas.SchemaVersion instance that corresponds to sv's declared parent.

source
Legolas.declared_fieldsFunction
Legolas.declared_fields(sv::Legolas.SchemaVersion)

Return a NamedTuple{...,Tuple{Vararg{DataType}} whose fields take the form:

<name of field declared by `sv`> = <field's type>

If sv has a parent, the returned fields will include declared_fields(parent(sv)).

source
Legolas.declarationFunction
Legolas.declaration(sv::Legolas.SchemaVersion)

Return a Pair{String,Vector{NamedTuple}} of the form

schema_version_identifier::String => declared_field_infos::Vector{Legolas.DeclaredFieldInfo}

where DeclaredFieldInfo has the fields:

  • name::Symbol: the declared field's name
  • type::Union{Symbol,Expr}: the declared field's declared type constraint
  • parameterize::Bool: whether or not the declared field is exposed as a parameter
  • statement::Expr: the declared field's full assignment statement (as processed by @version, not necessarily as written)

Note that declaration is primarily intended to be used for interactive discovery purposes, and does not include the contents of declaration(parent(sv)).

source
Legolas.declaredFunction
Legolas.declared(sv::Legolas.SchemaVersion{name,version})

Return true if the schema version name@version has been declared via @version in the current Julia session; return false otherwise.

source
Legolas.find_violationFunction
Legolas.find_violation(ts::Tables.Schema, sv::Legolas.SchemaVersion)

For each field f::F declared by sv:

  • Define A = Legolas.accepted_field_type(sv, F)
  • If f::T is present in ts, ensure that T <: A or else immediately return f::Symbol => T::DataType.
  • If f isn't present in ts, ensure that Missing <: A or else immediately return f::Symbol => missing::Missing.

Otherwise, return nothing.

To return all violations instead of just the first, use Legolas.find_violations.

See also: Legolas.validate, Legolas.complies_with, Legolas.find_violations.

source
Legolas.find_violationsFunction
Legolas.find_violations(ts::Tables.Schema, sv::Legolas.SchemaVersion)

Return a Vector{Pair{Symbol,Union{Type,Missing}}} of all of ts's violations with respect to sv.

This function's notion of "violation" is defined by Legolas.find_violation, which immediately returns the first violation found; prefer to use that function instead of find_violations in situations where you only need to detect any violation instead of all violations.

See also: Legolas.validate, Legolas.complies_with, Legolas.find_violation.

source
Legolas.accepted_field_typeFunction
Legolas.accepted_field_type(sv::Legolas.SchemaVersion, T::Type)

Return the "maximal supertype" of T that is accepted by sv when evaluating a field of type >:T for schematic compliance via Legolas.find_violation; see that function's docstring for an explanation of this function's use in context.

SchemaVersion authors may overload this function to broaden particular type constraints that determine schematic compliance for their SchemaVersion, without needing to broaden the type constraints employed by their SchemaVersion's record type.

Legolas itself defines the following default overloads:

accepted_field_type(::SchemaVersion, T::Type) = T
 accepted_field_type(::SchemaVersion, ::Type{Any}) = Any
 accepted_field_type(::SchemaVersion, ::Type{UUID}) = Union{UUID,UInt128}
 accepted_field_type(::SchemaVersion, ::Type{Symbol}) = Union{Symbol,AbstractString}
 accepted_field_type(::SchemaVersion, ::Type{String}) = AbstractString
 accepted_field_type(sv::SchemaVersion, ::Type{<:Vector{T}}) where T = AbstractVector{<:(accepted_field_type(sv, T))}
 accepted_field_type(::SchemaVersion, ::Type{Vector}) = AbstractVector
-accepted_field_type(sv::SchemaVersion, ::Type{Union{T,Missing}}) where {T} = Union{accepted_field_type(sv, T),Missing}

Outside of these default overloads, this function should only be overloaded against specific SchemaVersions that are authored within the same module as the overload definition; to do otherwise constitutes type piracy and should be avoided.

source

Validating/Writing/Reading Legolas Tables

Legolas.extract_schema_versionFunction
Legolas.extract_schema_version(table)

Attempt to extract Arrow metadata from table via Arrow.getmetadata(table).

If Arrow metadata is present and contains "legolas_schema_qualified" => s, return first(parse_identifier(s))

Otherwise, return nothing.

source
Legolas.writeFunction
Legolas.write(io_or_path, table, sv::SchemaVersion; validate::Bool=true, kwargs...)

Write table to io_or_path, inserting the appropriate legolas_schema_qualified field in the written out Arrow metadata.

If validate is true, Legolas.validate(Tables.schema(table), vs) will be invoked before the table is written out to io_or_path.

Any other provided kwargs are forwarded to an internal invocation of Arrow.write.

Note that io_or_path may be any type that supports Base.write(io_or_path, bytes::Vector{UInt8}).

source
Legolas.readFunction
Legolas.read(io_or_path; validate::Bool=true)

Read and return an Arrow.Table from io_or_path.

If validate is true, Legolas.read will attempt to extract a Legolas.SchemaVersion from the deserialized Arrow.Table's metadata and use Legolas.validate to verify that the table's Table.Schema complies with the extracted Legolas.SchemaVersion before returning the table.

Note that io_or_path may be any type that supports Base.read(io_or_path)::Vector{UInt8}.

source
Legolas.tobufferFunction
Legolas.tobuffer(args...; kwargs...)

A convenience function that constructs a fresh io::IOBuffer, calls Legolas.write(io, args...; kwargs...), and returns seekstart(io).

Analogous to the Arrow.tobuffer function.

source

Utilities

Legolas.liftFunction
lift(f, x)

Return f(x) unless x isa Union{Nothing,Missing}, in which case return missing.

This is particularly useful when handling values from Arrow.Table, whose null values may present as either missing or nothing depending on how the table itself was originally constructed.

See also: construct

source
lift(f)

Returns a curried function, x -> lift(f,x)

source
Legolas.constructFunction
construct(T::Type, x)

Construct T(x) unless x is of type T, in which case return x itself. Useful in conjunction with the lift function for types which don't have a constructor which accepts instances of itself (e.g. T(::T)).

Examples

julia> using Legolas: construct
+accepted_field_type(sv::SchemaVersion, ::Type{Union{T,Missing}}) where {T} = Union{accepted_field_type(sv, T),Missing}

Outside of these default overloads, this function should only be overloaded against specific SchemaVersions that are authored within the same module as the overload definition; to do otherwise constitutes type piracy and should be avoided.

source

Validating/Writing/Reading Legolas Tables

Legolas.extract_schema_versionFunction
Legolas.extract_schema_version(table)

Attempt to extract Arrow metadata from table via Arrow.getmetadata(table).

If Arrow metadata is present and contains "legolas_schema_qualified" => s, return first(parse_identifier(s))

Otherwise, return nothing.

source
Legolas.writeFunction
Legolas.write(io_or_path, table, sv::SchemaVersion; validate::Bool=true, kwargs...)

Write table to io_or_path, inserting the appropriate legolas_schema_qualified field in the written out Arrow metadata.

If validate is true, Legolas.validate(Tables.schema(table), vs) will be invoked before the table is written out to io_or_path.

Any other provided kwargs are forwarded to an internal invocation of Arrow.write.

Note that io_or_path may be any type that supports Base.write(io_or_path, bytes::Vector{UInt8}).

source
Legolas.readFunction
Legolas.read(io_or_path; validate::Bool=true)

Read and return an Arrow.Table from io_or_path.

If validate is true, Legolas.read will attempt to extract a Legolas.SchemaVersion from the deserialized Arrow.Table's metadata and use Legolas.validate to verify that the table's Table.Schema complies with the extracted Legolas.SchemaVersion before returning the table.

Note that io_or_path may be any type that supports Base.read(io_or_path)::Vector{UInt8}.

source
Legolas.tobufferFunction
Legolas.tobuffer(args...; kwargs...)

A convenience function that constructs a fresh io::IOBuffer, calls Legolas.write(io, args...; kwargs...), and returns seekstart(io).

Analogous to the Arrow.tobuffer function.

source

Utilities

Legolas.liftFunction
lift(f, x)

Return f(x) unless x isa Union{Nothing,Missing}, in which case return missing.

This is particularly useful when handling values from Arrow.Table, whose null values may present as either missing or nothing depending on how the table itself was originally constructed.

See also: construct

source
lift(f)

Returns a curried function, x -> lift(f,x)

source
Legolas.constructFunction
construct(T::Type, x)

Construct T(x) unless x is of type T, in which case return x itself. Useful in conjunction with the lift function for types which don't have a constructor which accepts instances of itself (e.g. T(::T)).

Examples

julia> using Legolas: construct
 
 julia> construct(Float64, 1)
 1.0
@@ -49,7 +49,7 @@
 Some(Some(1))
 
 julia> lift(construct(Some), Some(1))
-Some(1)
source
Legolas.record_mergeFunction
record_merge(record::AbstractRecord; fields_to_merge...)

Return a new AbstractRecord with the same schema version as record, whose fields are computed via Tables.rowmerge(record; fields_to_merge...). The returned record is constructed by passing these merged fields to the AbstractRecord constructor that matches the type of the input record.

source
Legolas.gatherFunction
Legolas.gather(column_name, tables...; extract=((table, idxs) -> view(table, idxs, :)))

Gather rows from tables into a unified cross-table index along column_name. Returns a Dict whose keys are the unique values of column_name across tables, and whose values are tuples of the form:

(rows_matching_key_in_table_1, rows_matching_key_in_table_2, ...)

The provided extract function is used to extract rows from each table; it takes as input a table and a Vector{Int} of row indices, and returns the corresponding subtable. The default definition is sufficient for DataFrames tables.

Note that this function may internally call Tables.columns on each input table, so it may be slower and/or require more memory if any(!Tables.columnaccess, tables).

Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.

source
Legolas.locationsFunction
locations(collections::Tuple)

Return a Dict whose keys are the set of all elements across all provided collections, and whose values are the indices that locate each corresponding element across all provided collecitons.

Specifically, locations(collections)[k][i] will return a Vector{Int} whose elements are the index locations of k in collections[i]. If !(k in collections[i]), this Vector{Int} will be empty.

For example:

julia> Legolas.locations((['a', 'b', 'c', 'f', 'b'],
+Some(1)
source
Legolas.record_mergeFunction
record_merge(record::AbstractRecord; fields_to_merge...)

Return a new AbstractRecord with the same schema version as record, whose fields are computed via Tables.rowmerge(record; fields_to_merge...). The returned record is constructed by passing these merged fields to the AbstractRecord constructor that matches the type of the input record.

source
Legolas.gatherFunction
Legolas.gather(column_name, tables...; extract=((table, idxs) -> view(table, idxs, :)))

Gather rows from tables into a unified cross-table index along column_name. Returns a Dict whose keys are the unique values of column_name across tables, and whose values are tuples of the form:

(rows_matching_key_in_table_1, rows_matching_key_in_table_2, ...)

The provided extract function is used to extract rows from each table; it takes as input a table and a Vector{Int} of row indices, and returns the corresponding subtable. The default definition is sufficient for DataFrames tables.

Note that this function may internally call Tables.columns on each input table, so it may be slower and/or require more memory if any(!Tables.columnaccess, tables).

Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.

source
Legolas.locationsFunction
locations(collections::Tuple)

Return a Dict whose keys are the set of all elements across all provided collections, and whose values are the indices that locate each corresponding element across all provided collecitons.

Specifically, locations(collections)[k][i] will return a Vector{Int} whose elements are the index locations of k in collections[i]. If !(k in collections[i]), this Vector{Int} will be empty.

For example:

julia> Legolas.locations((['a', 'b', 'c', 'f', 'b'],
                           ['d', 'c', 'e', 'b'],
                           ['f', 'a', 'f']))
 Dict{Char, Tuple{Vector{Int64}, Vector{Int64}, Vector{Int64}}} with 6 entries:
@@ -58,4 +58,4 @@
   'c' => ([3], [2], [])
   'd' => ([], [1], [])
   'e' => ([], [3], [])
-  'b' => ([2, 5], [4], [])

This function is useful as a building block for higher-level tabular operations that require indexing/grouping along specific sets of elements.

source
Legolas.materializeFunction
Legolas.materialize(table)

Return a fully deserialized copy of table.

This function is useful when table has built-in deserialize-on-access or conversion-on-access behavior (like Arrow.Table) and you'd like to pay such access costs upfront before repeatedly accessing the table.

Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.

source
+ 'b' => ([2, 5], [4], [])

This function is useful as a building block for higher-level tabular operations that require indexing/grouping along specific sets of elements.

source
Legolas.materializeFunction
Legolas.materialize(table)

Return a fully deserialized copy of table.

This function is useful when table has built-in deserialize-on-access or conversion-on-access behavior (like Arrow.Table) and you'd like to pay such access costs upfront before repeatedly accessing the table.

Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.

source
diff --git a/previews/PR119/schema-concepts/index.html b/previews/PR119/schema-concepts/index.html index 43602a4..08d91b4 100644 --- a/previews/PR119/schema-concepts/index.html +++ b/previews/PR119/schema-concepts/index.html @@ -1,3 +1,3 @@ Schema-Related Concepts/Conventions · Legolas

Schema-Related Concepts/Conventions

Note

If you're a newcomer to Legolas.jl, please familiarize yourself with the tour before diving into this documentation.

Schema Version Identifiers

Legolas defines "schema version identifiers" as strings of the form:

  • name@version where:
    • name is a lowercase alphanumeric string and may include the special characters . and -.
    • version is a non-negative integer.
  • or, x>y where x and y are valid schema version identifiers and > denotes "extends from".

A schema version identifier is said to be fully qualified if it includes the identifiers of all ancestors of the particular schema version that it directly identifies.

Schema authors should follow the below conventions when choosing the name of a new schema:

  1. Include a namespace. For example, assuming the schema is defined in a package Foo.jl, foo.automobile is good, automobile is bad.
  2. Prefer singular over plural. For example, foo.automobile is good, foo.automobiles is bad.
  3. Don't "overqualify" a schema name with ancestor-derived information that is better captured by the fully qualified identifier of a specific schema version. For example, bar.automobile should be preferred over bar.foo.automobile, since bar.automobile@1>foo.automobile@1 is preferable to bar.foo.automobile@1>foo.automobile@1. Similarly, baz.supercar should be preferred over baz.automobile.supercar, since baz.supercar@1>bar.automobile@1 is preferable to baz.automobile.supercar@1>bar.automobile@1.

Schema Versioning: You Break It, You Bump It

While it is fairly established practice to semantically version source code, the world of data/artifact versioning is a bit more varied. As presented in the tour, each Legolas.SchemaVersion carries a single version integer. The central rule that governs Legolas' schema versioning approach is:

Do not introduce a change to an existing schema version that might cause existing compliant data to become non-compliant; instead, incorporate the intended change in a new schema version whose version number is one greater than the previous version number.

A schema author must introduce a new schema version if any of the following changes are introduced:

  • A new type-constrained and/or value-constrained field is declared. In other words, for the introduction of a new declared field to be non-breaking, the new field's type constraint must be ::Any and it may not feature a value-constraining or value-transforming assignment expression.
  • An existing declared field's type or value constraints are tightened.
  • An existing declared field is renamed.

If any of the above breaking changes are made to an existing schema version, instead of introducing a new schema version, subtle downstream breakage may occur. For example, if a new type/value-constrained field is declared, previously compliant tables containing a field with the same name might accidentally become non-compliant if existing values violate the new constraints. Similarly, downstream schema version extensions may have already declared a field with the same name, but with constraints that are incompatible with the new constraints.

One benefit of Legolas' approach is that multiple schema versions may be defined in the same codebase, e.g. there's nothing that prevents @version(FooV1, ...) and @version(FooV2, ...) from being defined and utilized simultaneously. The source code that defines any given Legolas schema version and/or consumes/produces Legolas tables is presumably already semantically versioned, such that consumer/producer packages can determine their compatibility with each other in the usual manner via interpreting major/minor/patch increments.

Note that it is preferable to avoid introducing new versions of an existing schema, if possible, in order to minimize code/data churn for downstream producers/consumers. Thus, authors should prefer conservative field type restrictions from the get-go. Remember: loosening a field type restriction is not a breaking change, but tightening one is.

Important Expectations Regarding Custom Field Assignments

Schema authors should ensure that their @version declarations meet two important expectations so that generated record types behaves as intended:

  1. Custom field assignments should preserve the idempotency of record type constructors.
  2. Custom field assignments should not observe mutable non-local state.

Thus, given a Legolas-generated record type R, the following should hold for all valid values of fields:

R(R(fields)) == R(fields)
-R(fields) == R(fields)
+R(fields) == R(fields) diff --git a/previews/PR119/upgrade/index.html b/previews/PR119/upgrade/index.html index e09917d..663f02c 100644 --- a/previews/PR119/upgrade/index.html +++ b/previews/PR119/upgrade/index.html @@ -1,2 +1,2 @@ -Upgrading from v0.4 to v0.5 · Legolas

Upgrading from Legolas v0.4 to v0.5

This guide is incomplete; please add to it if you encounter items which would help other upgraders along their journey.

See here for a comprehensive log of changes from Legolas v0.4 to Legolas v0.5.

Some main changes to be aware of

  • In Legolas v0.4, every Legolas.Row field's type was available as a type parameter of Legolas.Row; for example, the type of a field y specified as y::Real in a Legolas.@row declaration would be surfaced like Legolas.Row{..., NamedTuple{(...,:y,...),Tuple{...,typeof(y),...}}. In Legolas v0.5, the schema version author controls which fields have their types surfaced as type parameters in Legolas-generated record types via the field::(<:F) syntax in Legolas.@version.
    • Additionally, to include type parameters associated to fields in a parent schema, they must be re-declared in the child schema. For example, the package LegolasFlux declares a ModelV1 version with a field weights::(<:Union{Missing,Weights}). LegolasFlux includes an example with a schema extension DigitsRowV1 which extends ModelV1. This @version call must re-declare the field weights to be parametric in order for the DigitsRowV1 struct to also have a type parameter for this field.
  • In Legolas v0.4, @row-generated Legolas.Row constructors accepted and propagated any non-schema-declared fields provided by the caller. In Legolas v0.5, @version-generated record type constructors will discard any non-schema-declared fields provided by the caller. When upgrading code that formerly "implicitly extended" a given schema version by propagating non-declared fields, it is advisable to instead explicitly declare a new extension of the schema version to capture the propagated fields as declared fields; or, if it makes more sense for a given use case, one may instead define a new schema version that adds these propagated fields as declared fields directly to the schema (likely declared as ::Union{Missing,T} to allow them to be missing).
  • Before Legolas v0.5, the documented guidance for schema authors surrounding new fields' impact on schema version breakage was misleading, implying that adding a new declared field to an existing schema version is non-breaking if the field's type allowed for Missing values. This is incorrect. For clarity, adding a new declared field to an existing schema version is a breaking change unless the field's type and value are both completely unconstrained in the declaration, i.e. the field's type constraint must be ::Any and may not feature a value-constraining or value-transforming assignment expression.

Deserializing old tables with Legolas v0.5

Generally, tables serialized with earlier versions of Legolas can be de-serialized with Legolas v0.5, making it only a "code-breaking" change, rather than a "data-breaking" change. However, it is strongly suggested to have reference tests with checked in (pre-Legolas v0.5) serialized tables which are deserialized and verified during the tests, in order to be sure.

Additionally, serialized Arrow tables containing nested Legolas-v0.4-defined Legolas.Row values (i.e. a table that contains a row that has a field that is, itself, a Legolas.Row value, or contains such values) require special handling to deserialize under Legolas v0.5, if you wish users to be able to deserialize them with Legolas.read using the Legolas-v0.5-ready version of your package. Note that these tables are still deserializable as plain Arrow tables regardless, so it may not be worthwhile to provide a bespoke deprecation/compatibility pathway in the Legolas-v0.5-ready version package unless your use case merits it (i.e. the impact surface would be high for your package's users).

If you would like to provide such a pathway, though:

Recall that under Legolas v0.4, @row-generated Legolas.Row constructors may accept and propagate arbitrary non-schema-declared fields, whereas Legolas v0.5's @version-generated record types may only contain schema-declared fields. Therefore, one must decide what to do with any non-declared fields present in serialized Legolas.Row values upon deserialization. A common approach is to implement a deprecation/compatibility pathway within the relevant surrounding @version declaration. For example, this LegolasFlux example uses a function compat_config to handle old Legolas.Row values, but does not add any handling for non-declared fields, which will be discarded if present. If one did not want non-declared fields to be discarded, these fields could be handled by throwing an error or warning, or defining a schema version extension that captured them, or defining a new version of the relevant schema to capture them (e.g. adding a field like extras::Union{Missing, NamedTuple}).

+Upgrading from v0.4 to v0.5 · Legolas

Upgrading from Legolas v0.4 to v0.5

This guide is incomplete; please add to it if you encounter items which would help other upgraders along their journey.

See here for a comprehensive log of changes from Legolas v0.4 to Legolas v0.5.

Some main changes to be aware of

  • In Legolas v0.4, every Legolas.Row field's type was available as a type parameter of Legolas.Row; for example, the type of a field y specified as y::Real in a Legolas.@row declaration would be surfaced like Legolas.Row{..., NamedTuple{(...,:y,...),Tuple{...,typeof(y),...}}. In Legolas v0.5, the schema version author controls which fields have their types surfaced as type parameters in Legolas-generated record types via the field::(<:F) syntax in Legolas.@version.
    • Additionally, to include type parameters associated to fields in a parent schema, they must be re-declared in the child schema. For example, the package LegolasFlux declares a ModelV1 version with a field weights::(<:Union{Missing,Weights}). LegolasFlux includes an example with a schema extension DigitsRowV1 which extends ModelV1. This @version call must re-declare the field weights to be parametric in order for the DigitsRowV1 struct to also have a type parameter for this field.
  • In Legolas v0.4, @row-generated Legolas.Row constructors accepted and propagated any non-schema-declared fields provided by the caller. In Legolas v0.5, @version-generated record type constructors will discard any non-schema-declared fields provided by the caller. When upgrading code that formerly "implicitly extended" a given schema version by propagating non-declared fields, it is advisable to instead explicitly declare a new extension of the schema version to capture the propagated fields as declared fields; or, if it makes more sense for a given use case, one may instead define a new schema version that adds these propagated fields as declared fields directly to the schema (likely declared as ::Union{Missing,T} to allow them to be missing).
  • Before Legolas v0.5, the documented guidance for schema authors surrounding new fields' impact on schema version breakage was misleading, implying that adding a new declared field to an existing schema version is non-breaking if the field's type allowed for Missing values. This is incorrect. For clarity, adding a new declared field to an existing schema version is a breaking change unless the field's type and value are both completely unconstrained in the declaration, i.e. the field's type constraint must be ::Any and may not feature a value-constraining or value-transforming assignment expression.

Deserializing old tables with Legolas v0.5

Generally, tables serialized with earlier versions of Legolas can be de-serialized with Legolas v0.5, making it only a "code-breaking" change, rather than a "data-breaking" change. However, it is strongly suggested to have reference tests with checked in (pre-Legolas v0.5) serialized tables which are deserialized and verified during the tests, in order to be sure.

Additionally, serialized Arrow tables containing nested Legolas-v0.4-defined Legolas.Row values (i.e. a table that contains a row that has a field that is, itself, a Legolas.Row value, or contains such values) require special handling to deserialize under Legolas v0.5, if you wish users to be able to deserialize them with Legolas.read using the Legolas-v0.5-ready version of your package. Note that these tables are still deserializable as plain Arrow tables regardless, so it may not be worthwhile to provide a bespoke deprecation/compatibility pathway in the Legolas-v0.5-ready version package unless your use case merits it (i.e. the impact surface would be high for your package's users).

If you would like to provide such a pathway, though:

Recall that under Legolas v0.4, @row-generated Legolas.Row constructors may accept and propagate arbitrary non-schema-declared fields, whereas Legolas v0.5's @version-generated record types may only contain schema-declared fields. Therefore, one must decide what to do with any non-declared fields present in serialized Legolas.Row values upon deserialization. A common approach is to implement a deprecation/compatibility pathway within the relevant surrounding @version declaration. For example, this LegolasFlux example uses a function compat_config to handle old Legolas.Row values, but does not add any handling for non-declared fields, which will be discarded if present. If one did not want non-declared fields to be discarded, these fields could be handled by throwing an error or warning, or defining a schema version extension that captured them, or defining a new version of the relevant schema to capture them (e.g. adding a field like extras::Union{Missing, NamedTuple}).