Skip to content

Commit

Permalink
refactor package to remove sharp edges for schema authors/users (part…
Browse files Browse the repository at this point in the history
…icularly `@row` / `Row`) and improve API (#54)

Co-authored-by: Seth Chapman <[email protected]>
Co-authored-by: Alex Arslan <[email protected]>
Co-authored-by: Eric Hanson <[email protected]>
  • Loading branch information
4 people authored Oct 27, 2022
1 parent 3a97820 commit 1561ef4
Show file tree
Hide file tree
Showing 16 changed files with 1,581 additions and 809 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
matrix:
version:
- '1'
- '1.3'
- '1.6'
os:
- ubuntu-latest
arch:
Expand Down
8 changes: 5 additions & 3 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,21 +1,23 @@
name = "Legolas"
uuid = "741b9549-f6ed-4911-9fbf-4a1c0c97f0cd"
authors = ["Beacon Biosignals, Inc."]
version = "0.4.0"
version = "0.5.0"

[deps]
Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
UUIDs = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[compat]
Arrow = "2"
DataFrames = "1"
Tables = "1.4"
julia = "1.3"
julia = "1.6"

[extras]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
UUIDs = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[targets]
test = ["Test", "DataFrames"]
test = ["Test", "DataFrames", "UUIDs"]
4 changes: 2 additions & 2 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ makedocs(modules=[Legolas],
sitename="Legolas",
authors="Beacon Biosignals, Inc.",
pages=["API Documentation" => "index.md",
"Tips For Schema Authors" => "schema.md",
"Legolas Table Specification" => "specification.md",
"Schema-Related Concepts/Conventions" => "schema-concepts.md",
"Arrow-Related Concepts/Conventions" => "arrow-concepts.md",
"FAQ" => "faq.md"])

deploydocs(repo="github.com/beacon-biosignals/Legolas.jl.git",
Expand Down
19 changes: 19 additions & 0 deletions docs/src/arrow-concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Arrow-Related Concepts/Conventions

!!! note

If you're a newcomer to Legolas.jl, please familiarize yourself with the [tour](https://github.com/beacon-biosignals/Legolas.jl/blob/main/examples/tour.jl) before diving into this documentation.

Legolas.jl's target (de)serialization format, [Arrow](https://arrow.apache.org/), already features wide cross-language adoption, enabling Legolas-serialized tables to be seamlessly read into many non-Julia environments. This documentation section contains conventions related to Legolas-serialized Arrow tables that may be observable by generic Legolas-unaware Arrow consumers.

## Supporting Legolas Schema Discovery In Arrow Tables

Legolas defines a special field `legolas_schema_qualified` that Legolas-aware Arrow writers may include in an Arrow table's table-level metadata to indicate a particular Legolas schema with which the table complies.

Arrow tables which include this field are considered to "support Legolas schema discovery" and are referred to as "Legolas-discoverable", since Legolas consumers may employ this field to automatically match the table against available application-layer Legolas schema definitions.

If present, the `legolas_schema_qualified` field's value must be a [fully qualified schema version identifier](@ref schema_version_identifier_specification).

## Arrow File Naming Conventions

When writing a Legolas-discoverable Arrow table to a file, prefer using the file extension `*.<schema name>.arrow`. For example, if the file's table's full Legolas schema version identifier is `baz.supercar@1>bar.automobile@1`, use the file extension `*.baz.supercar.arrow`.
10 changes: 7 additions & 3 deletions docs/src/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,16 @@

## What is the point of Legolas.jl? Who benefits from using it?

At its core, Legolas.jl provides a lightweight, expressive set of mechanisms/patterns for generating `Tables.AbstractRow` types in a manner that enables schema composability, extensibility and a few nice utilties on top.
At its core, Legolas.jl provides a lightweight, expressive set of mechanisms/patterns for wrangling Tables.jl-compliant values in a manner that enables schema composability, extensibility and a few nice utilties on top.

The package originated from code developed internally at Beacon to wrangling heterogeneous Arrow datasets, and is thus probably mostly useful for folks in a similar situation. If you're curating tabular datasets and you'd like to build shared Julia tools atop the schemas therein, then Legolas.jl may be worth checking out.

## Why does Legolas.jl support Arrow as a (de)serialization target, but not, say, JSON?

Technically, Legolas.jl's core `Row`/`Schema` functionality is totally agnostic to (de)serialization and could be useful for anybody who wants to generate new `Tables.AbstractRow` types.
Technically, Legolas.jl's core `@schema`/`@version` functionality is agnostic to (de)serialization and could be useful for anybody who wants to wrangle Tables.jl-compliant values.

Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with [Arrow.jl](https://github.com/JuliaData/Arrow.jl) "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with [JSON3.jl](https://github.com/quinnj/JSON3.jl) or other packages.
Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with [Arrow.jl](https://github.com/JuliaData/Arrow.jl) "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with [JSON3.jl](https://github.com/quinnj/JSON3.jl) or other packages.

## Why are Legolas.jl's generated record types defined the way that they are? For example, why is the version number hardcoded in the type name?

Many of Legolas' current choices on this front stem from refactoring efforts undertaken as part of [this pull request](https://github.com/beacon-biosignals/Legolas.jl/pull/54), and directly resulted from a [design mini-investigation](https://gist.github.com/jrevels/fdfe939109bee23566d425440b7c759e) associated with those efforts.
30 changes: 19 additions & 11 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,37 @@
# API Documentation

If you're a newcomer to Legolas.jl, please familiarize yourself with via the [tour](https://github.com/beacon-biosignals/Legolas.jl/blob/master/examples/tour.jl) before diving into this documentation.
!!! note

If you're a newcomer to Legolas.jl, please familiarize yourself with the [tour](https://github.com/beacon-biosignals/Legolas.jl/blob/main/examples/tour.jl) before diving into this documentation.

```@meta
CurrentModule = Legolas
```

## Legolas `Schema`s and `Row`s
## Legolas `Schema`s

```@docs
Legolas.@row
Legolas.Row
Legolas.Schema
Legolas.SchemaVersion
Legolas.@schema
Legolas.@version
Legolas.is_valid_schema_name
Legolas.schema_name
Legolas.schema_version
Legolas.schema_qualified_string
Legolas.schema_parent
Legolas.parse_identifier
Legolas.name
Legolas.version
Legolas.identifier
Legolas.parent
Legolas.required_fields
Legolas.declaration
Legolas.declared
Legolas.find_violation
Legolas.complies_with
Legolas.validate
```

## Validating/Writing/Reading Legolas Tables

```@docs
Legolas.extract_schema
Legolas.validate
Legolas.extract_schema_version
Legolas.write
Legolas.read
```
Expand Down
52 changes: 52 additions & 0 deletions docs/src/schema-concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Schema-Related Concepts/Conventions

!!! note

If you're a newcomer to Legolas.jl, please familiarize yourself with the [tour](https://github.com/beacon-biosignals/Legolas.jl/blob/main/examples/tour.jl) before diving into this documentation.

## [Schema Version Identifiers](@id schema_version_identifier_specification)

Legolas defines "schema version identifiers" as strings of the form:

- `name@version` where:
- `name` is a lowercase alphanumeric string and may include the special characters `.` and `-`.
- `version` is a non-negative integer.
- or, `x>y` where `x` and `y` are valid schema version identifiers and `>` denotes "extends from".

A schema version identifier is said to be *fully qualified* if it includes the identifiers of all ancestors of the particular schema version that it directly identifies.

Schema authors should follow the below conventions when choosing the name of a new schema:

1. Include a namespace. For example, assuming the schema is defined in a package Foo.jl, `foo.automobile` is good, `automobile` is bad.
2. Prefer singular over plural. For example, `foo.automobile` is good, `foo.automobiles` is bad.
3. Don't "overqualify" a schema name with ancestor-derived information that is better captured by the fully qualified identifier of a specific schema version. For example, `bar.automobile` should be preferred over `bar.foo.automobile`, since `bar.automobile@1>foo.automobile@1` is preferable to `bar.foo.automobile@1>foo.automobile@1`. Similarly, `baz.supercar` should be preferred over `baz.automobile.supercar`, since `baz.supercar@1>bar.automobile@1` is preferable to `baz.automobile.supercar@1>bar.automobile@1`.

## Schema Versioning: You Break It, You Bump It

While it is fairly established practice to [semantically version source code](https://semver.org/), the world of data/artifact versioning is a bit more varied. As presented in the tour, each `Legolas.SchemaVersion` carries a single version integer. The central rule that governs Legolas' schema versioning approach is:

**Do not introduce a change to an existing schema version that might cause existing compliant data to become non-compliant; instead, incorporate the intended change in a new schema version whose version number is one greater than the previous version number.**

For example, a schema author must introduce a new schema version for any of the following changes:

- A new type-restricted required field is added to the schema.
- An existing required field's type restriction is tightened.
- An existing required field is renamed.

One benefit of Legolas' approach is that multiple schema versions may be defined in the same codebase, e.g. there's nothing that prevents `@version(FooV1, ...)` and `@version(FooV2, ...)` from being defined and utilized simultaneously. The source code that defines any given Legolas schema version and/or consumes/produces Legolas tables is presumably already semantically versioned, such that consumer/producer packages can determine their compatibility with each other in the usual manner via interpreting major/minor/patch increments.

Note that it is preferable to avoid introducing new versions of an existing schema, if possible, in order to minimize code/data churn for downstream producers/consumers. Thus, authors should prefer conservative field type restrictions from the get-go. Remember: loosening a field type restriction is not a breaking change, but tightening one is.

## Important Expectations Regarding Custom Field Assignments

Schema authors should ensure that their `@version` declarations meet two important expectations so that generated record types behaves as intended:

1. Custom field assignments should preserve the [idempotency](https://en.wikipedia.org/wiki/Idempotence) of record type constructors.
2. Custom field assignments should not observe mutable non-local state.

Thus, given a Legolas-generated record type `R`, the following should hold for all valid values of `fields`:

```jl
R(R(fields)) == R(fields)
R(fields) == R(fields)
```
41 changes: 0 additions & 41 deletions docs/src/schema.md

This file was deleted.

14 changes: 0 additions & 14 deletions docs/src/specification.md

This file was deleted.

Loading

2 comments on commit 1561ef4

@jrevels
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/71186

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.5.0 -m "<description of version>" 1561ef499f061a593cdaf4666de7bb04533015ae
git push origin v0.5.0

Please sign in to comment.