Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine readable schema - request #418

Closed
augean opened this issue Jan 12, 2024 · 11 comments
Closed

Machine readable schema - request #418

augean opened this issue Jan 12, 2024 · 11 comments

Comments

@augean
Copy link

augean commented Jan 12, 2024

XML and JSON schemas are usually machine-readable
Requesting the same for GEDCOM 7

Please see the attached files, for an example of a machine-readable schema that I created for GEDCOM 5.1.1
This allows me to create new GEDCOM files easily, and is very easy for tools to interact with

Please could we have a machine-readable GEDCOM 7 schema (please use the attached files as an example)
I single file (or even multiple files), which allows us to easily parse the GEDCOM
ged.5.1.1.txt
PrimSection.txt
structure

thanks

@tychonievich
Copy link
Collaborator

7.0 has a machine-readable schema.

The line syntax is defined in the spec using ABNF, which is automatically extracted as grammar.abnf. At least one of the public gedcom parsers uses this grammar to parse lines.

The structure hierarchy is defined in the spec using a machine-readable variant of the metasyntax created for 5.0, which is automatically extracted as grammar.gedstruct. The structure hierarchy is also converted to a different machine-readable form as a set of YAML files hosted in several places including the URI of each structure type (e.g. https://gedcom.io/terms/v7/ABBR) and in a separate repository of both standard and extension structures (GEDCOM-registries). Multiple public gedcom parsers and development aids use one or both of these to parse and validate structure hierarchies.

These machine-parseable formats are not perfect (for example, we lack a machine-parseable way of marking something as deprecated) and we'd welcome suggestions in how to improve them. I did not look at your attached files closely enough to know if you have features the standard currently lacks.

@dthaler
Copy link
Collaborator

dthaler commented Jan 18, 2024

Discussion in GEDCOM Steering Committee 1/18/2024:
We have machine readable schema.
We have machine readable positive test cases in the GEDCOM.io repository.
We currently don't have machine readable negative test cases, such as appear in PrimSection.txt
Would others find that useful to have somewhere?

@dthaler
Copy link
Collaborator

dthaler commented Jan 18, 2024

Closing since original question has been answered, and follow-up discussion can be done in #422

@dthaler dthaler closed this as completed Jan 18, 2024
@augean
Copy link
Author

augean commented Jan 19, 2024

1)All the comments are stripped out of the machine-readable schema
The comments are VERY important to keep in, the schema is very difficult to use without comments
(Please see ged.5.1.1.txt where I maintained the comments in the machine-readable form)

2)There is no machine-readable file with regular expressions, and comments defining the primitive types
please see my PrimSection.txt , where I have the primitive types, along with descriptions and regular expressions (and examples !!)

3)The spec is fragmented across too many different files, making it very complex to parse
(Please see attached, where I just used 2 files)

citing the above 3 reasons, I think the schema is not fully machine-readable
-very important information like comments are left out of the machine-readable version
-the regular expressions, which are critical are left out of any machine-readable version
-the spec is fragmented across too many files.

Please review the attached ged.5.1.1.txt and PrimSection.txt
which shows how the above issues could be fixed, and allow us to have a fully machine-readable GEDCOM 7 spec

@augean
Copy link
Author

augean commented Jan 19, 2024

also, please advise, is it possible to reopen the issue?
I don't want to make a nuisance of myself, but I think the underlying issues are not resolved (see above)
At present issues are closed without any input from me, who originally logged the issue
Github doesn't allow me to reopen
Thanks !!!

@tychonievich
Copy link
Collaborator

tychonievich commented Jan 19, 2024

    • Why are comments important for machine-readability? Is the machine reading them? How? What's the use-case that makes this important?
    • The YAML files have the specifications included in machine-readable form.
    • The entire specification itself, in both markdown and HTML formats, is also machine-readable, with the character-level and structure-level metasyntaxes inside markdown fenced code blocks with languages abnf and gedstruct and HTML pre elements with class="sourceCode abnf" and class="sourceCode gedstruct", respectively.
    • The character-level grammar, including of the detatypes we define, is in grammar.abnf. Several datatypes are not readily regex-ready (you yourself define a non-regex metasyntax "swapex"); we chose the industry-standard context-free grammar notation ABNF instead.
    • A few 7.0 additions (Media Type and Language) are defined in external specifications which we do not replicate to avoid the possibility of going out of sync with those standards. We also assume that any application that cares what format these have is also consulting those external standards anyway to understand their meaning.
    • All machine-readable parts are in two files: grammar.gedstruct and grammar.abnf.
    • If you want machine-readable copies of the human-targetted text and structure information in one file, you can get that by running cat extracted_files/tags/* > all.yaml.
    • If you want machine-readable copies of the entire spec in one file, you can get that by running cat specification/gedcom-*md > specification.md; character-level syntax is delimited by blocks that start "```abnf" and end "```" and structure-level metasytnax is is delimited by blocks that start "```gedstruct" and end "```"

We closed the issue because everything you asked for (machine-readability) is already provided. I still believe that's the case, but you've asked for more things (regular expressions and comments) so I'll re-open it for now to see if further conversation prompts identifying an issue that we should resolve.

@tychonievich tychonievich reopened this Jan 19, 2024
@augean
Copy link
Author

augean commented Jan 20, 2024

thanks for the feedback, I will take a further look
But comments are very important, as they are used in genealogy tools, which are built off machine-readable schemas
I just think that we should maintain the comments in the machine-readable version,

for example: in the Augean tool, I use comments extensively when editing GEDCOM

comments

@augean
Copy link
Author

augean commented Jan 20, 2024

The YAML files work fine, thanks, I was able to parse all YAML files
So please ignore my comment about too many files,

so, two issues would be

  • adding regular expressions to the YAML files to show valid values
  • adding the comments to grammar.gedstruct - This would really help tool builders, as we present these comments to users.

@dthaler
Copy link
Collaborator

dthaler commented Jan 25, 2024

Discussion 1/25/2024:
We believe there are three separate issues worth discussing/pursuing here:

  1. The discussion above should be linked, e.g., on https://gedcom.io/tools/ since we're not sure the grammar.gedstruct or grammar.abnf files are easily discoverable
  2. Machine readable data type information should be discoverable for data types that aren't GEDCOM specific (e.g., IANA lang tags and media types). There may be machine readable mechanisms available on IANA that we can point to.
  3. It would be helpful to add user-facing descriptions to the YAML files, but today that's because none is in the GEDCOM spec and the YAML files are all derived from extracted information.

We don't think we need "regular expressions" per se because they can be derived from ABNF and because there are multiple different regex syntaxes used by various tools and libraries, so even if we picked one style, others would have to convert them anyway.

Please let us know if we are missing anything or if you have other feedback.

@augean
Copy link
Author

augean commented Jan 26, 2024

User descriptions in YAML files will help a lot - thanks !!!
Regular expressions in each YAML file would be the icing on the cake, but are not essential
listing the files that are supposed to be machine-readable, will help as I was originally confused by this -thanks !!!

dthaler pushed a commit to dthaler/GEDCOM.io that referenced this issue Feb 1, 2024
dthaler added a commit to FamilySearch/GEDCOM.io that referenced this issue Feb 1, 2024
* Link to extracted files from tools page

Addresses part of FamilySearch/GEDCOM#418

Signed-off-by: Dave Thaler <[email protected]>

* Update _pages/tools.md

* Update _pages/tools.md

---------

Signed-off-by: Dave Thaler <[email protected]>
Co-authored-by: Dave Thaler <[email protected]>
@dthaler
Copy link
Collaborator

dthaler commented Oct 29, 2024

Current status as of 29 OCT 2024 GEDCOM Steering Committee meeting:

1, The discussion above should be linked, e.g., on https://gedcom.io/tools/ since we're not sure the grammar.gedstruct or grammar.abnf files are easily discoverable

Done in FamilySearch/GEDCOM.io#142

  1. Machine readable data type information should be discoverable for data types that aren't GEDCOM specific (e.g., IANA lang tags and media types). There may be machine readable mechanisms available on IANA that we can point to.

Done in #437

  1. It would be helpful to add user-facing descriptions to the YAML files, but today that's because none is in the GEDCOM spec and the YAML files are all derived from extracted information.

This is somewhat different from the original discussion in this issue so created issue #564 to track this.
We are closing this issue now and will just use that one to track further discussion on help text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants