You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is no standard way of adding texts to Bilara. In the past, we processed the whole Pali canon and it evolved over time. Obviously so long as the data format is good it doesn't matter how it is created. But we should offer a reasonable story for adding new texts.
Most of the fundamentals of this have been built, so it is a matter of lining them all up and testing the whole pipeline.
preparing HTML files
There is a spec for creating HTML files. It is designed for Sanskrit but will work for anything.
What this does is consumes a properly-formed tsv file and exports it directly as json to the relevant bilara-data folders.
why tsv?
This is basically because it is what bilara i/o was designed to use. There's no particular reason there needs to be an intermediate step here, we could go directly from HTML to JSON. One advantage of tsv, however, is in debugging. When things go wrong, we can inspect and edit in a spreadsheet, which is super handy for this sort of thing.
pipeline
I suggest that we use a new dedicated repo, such as /suttacentral/bilara-data-preparation
Use the same unpublished/published branches as on bilara-data.
User adds texts in bilara-HTML to unpublished branch
user makes a PR when they are ready
When the PR is accepted, it runs a GA
The GA runs bilara-html-tsv.js to convert to tsv, then bilara i/o to convert to json for bilara-data, and adds it to the relevant repo.
Note that bilara-html-tsv runs on node (I think) and bilara i/o is python. Let's see how complex it is to rewrite them to work together more nicely. Maybe do both as Go?
The text was updated successfully, but these errors were encountered:
There is no standard way of adding texts to Bilara. In the past, we processed the whole Pali canon and it evolved over time. Obviously so long as the data format is good it doesn't matter how it is created. But we should offer a reasonable story for adding new texts.
Most of the fundamentals of this have been built, so it is a matter of lining them all up and testing the whole pipeline.
preparing HTML files
There is a spec for creating HTML files. It is designed for Sanskrit but will work for anything.
https://github.com/suttacentral/bilara-data/wiki/Sanskrit-text-preparation
Here is an explainer for certain details:
https://github.com/suttacentral/bilara-data/wiki/Overlapping-(text-critical)-markup-in-Bilara
The basic idea is that we use punctuation to segment the text, then wrap it up in HTML as specified.
converting html to tsv
Next we convert the HTML file to TSV. There's a script for this already, although it is not bug free.
https://github.com/sc-voice/bilara-html-tsv
The basic point of this is to cleanly separate the data types, add segment number, and ready it for the next step.
convert tsv to bilara-data via bilara i/o
We then use our Bilara i/o utility to convert the tsv files to bilara-data.
https://github.com/suttacentral/bilara-data/wiki/Bilara-io
What this does is consumes a properly-formed tsv file and exports it directly as json to the relevant bilara-data folders.
why tsv?
This is basically because it is what bilara i/o was designed to use. There's no particular reason there needs to be an intermediate step here, we could go directly from HTML to JSON. One advantage of tsv, however, is in debugging. When things go wrong, we can inspect and edit in a spreadsheet, which is super handy for this sort of thing.
pipeline
I suggest that we use a new dedicated repo, such as /suttacentral/bilara-data-preparation
Note that
bilara-html-tsv
runs on node (I think) and bilara i/o is python. Let's see how complex it is to rewrite them to work together more nicely. Maybe do both as Go?The text was updated successfully, but these errors were encountered: