Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fall 2017 publication thread #11

Closed
ctschroeder opened this issue Oct 30, 2017 · 26 comments
Closed

Fall 2017 publication thread #11

ctschroeder opened this issue Oct 30, 2017 · 26 comments
Assignees
Labels
Milestone

Comments

@ctschroeder
Copy link
Member

ctschroeder commented Oct 30, 2017

Please use this thread to track our Fall 2017 corpora publication process.

Data freeze: November 9, 2017

Corpora to publish + reviewers:

  • ap (Carrie & Beth)
  • pseudo-theophilus (new, Beth)
  • johannes.canons (new, Carrie)-ready!
  • victor (martyrdoms) (new, Beth)
  • dirt (new, Carrie)

Reviewers, please be sure to:

@ctschroeder
Copy link
Member Author

ctschroeder commented Nov 7, 2017

@amir-zeldes Apa Johannes document FA 29-30 is done EXCEPT for some additional information (folio #s) needed for the idno metadatum. You should be able to test the TEI converter on it, though. Please see my thread here gucorpling/gitdox#54 about the converter, first. Thanks!!
ETA: it's now ready. Got the info I needed from Diliana.

@ctschroeder
Copy link
Member Author

@amir-zeldes AP and Apa Johannes are DONE except for two AP we are waiting on answers to queries; those sayings are from outside contributors and are marked "review."

There are a TON of AP. I edited a few that were already published but needed edits. I updated versioning and committed. However, this means that we have some AP in sgml format and some in excel and some in both. Amir, let me know if you have questions about these. I think the rule of thumb is: if there is an sgml file in the gitdox folder, use that. For any excels, don't use them unless there is no sgml in gitdox. Unless you want me to go through and systematically commit every AP to github in the gitdox folder. Let me know!

@amir-zeldes
Copy link
Member

Oh, no worries, I'm not going to export from Excel or SGML files - it'll all happen directly from GitDox based on document status (published/to_publish). If you could quickly verify that the statuses are correct, I can attempt the first conversion. I'll have a look at Johannes first maybe.

@ctschroeder
Copy link
Member Author

ctschroeder commented Nov 8, 2017 via email

@amir-zeldes
Copy link
Member

Should I convert AP without those two then or wait?

@ctschroeder
Copy link
Member Author

ctschroeder commented Nov 9, 2017 via email

@amir-zeldes
Copy link
Member

amir-zeldes commented Nov 9, 2017 via email

@ctschroeder
Copy link
Member Author

Dirt is ready!! OMG this is a lot of material. Also @amir-zeldes I saw your email about TEI but could not get to it with everything else. Will get to it in the morning.

@amir-zeldes
Copy link
Member

OK, shenoute.dirt is now in ANNIS as well, accessible with your logins. Let me know if everything looks OK (it only had the same issues of pb_xml:id and the TEI column, which I removed)

@ctschroeder
Copy link
Member Author

re dirt:
@cluckmarq if you have a login to ANNIS you can see your Dirt file online.

@amir-zeldes a couple of things:
Document Metadata

  • filename is "dirt" in ANNIS. ?
  • license is showing the html code not the link

Text & annotation

  • what are we missing in the annotations so that the linguistic analysis view looks wrong? translation spans?
  • likewise do we need a paragraph span?

Otherwise I think Dirt looks good.

@ctschroeder
Copy link
Member Author

Re johannes.canons:

Document metadata

  • As with Dirt, the document name is off; it reads "johannes". Something I think is not converting correctly from the document name in Gitdox.
  • license & source_info likewise display html code not links

Text and annotation

  • @eplatte is that first N in line 1 supposed to look like that with a line under it or something?
  • as with dirt: what are we missing in the annotations so that the linguistic analysis view looks wrong? translation spans? it would be nice if the stylesheet could show chapters or verses or break by verses? Do we need p or translation spans the same size as chapter/verse? I know we do not want to have 500 different stylesheets. Since we are moving to versification in all corpora as we (re)publish (except perhaps AP, but we could add them there, too), then perhaps we can decide on a versification stylesheet and just have that one, migrating corpora to using that one as we (re)publish?

Thanks! Let me know if we should annotate anything differently based on this conversation.

@amir-zeldes
Copy link
Member

OK, I'm on the doc name and license issues. I figured out the naming problem, which is a bug in the TreeTagger module in SNP. It looks fixable, but might have repercussions I don't understand. I opened an issue here:

korpling/treetagger-emf-api#1

Basically, in stripping off the extension of the filename, it just removes everything after the first dot. The quickest fix is to not have dots in filenames, but ultimately (after the release) I'd like to see this fixed. I think not putting corpus.NAME as the name is consistent with our older corpora though, and it's kind of redundant. Would everyone be OK with the document being called: GL71-74 and the corpus name being shenoute.dirt? I think that's actually cleaner than shenoute.dirt > shenoute.dirt.GL71-74.

@amir-zeldes
Copy link
Member

The hyperlink issue also seems to be an internal SNP thing due to our new workflow. Again, no quick solution, but I can simply 'un-escape' the > etc. manually in the ANNIS files for this release. But in the future, it's a problem we'll need to solve.

Let me know about the document names, and if we're OK with GL71-74 etc., then I can reimport everything with those 2 problems fixed.

@ctschroeder
Copy link
Member Author

ctschroeder commented Nov 10, 2017 via email

@amir-zeldes
Copy link
Member

We could wait for the SNP problem to be solved, but that would delay the release... or we could manually change them back in all output documents, which is a bit irritating.

But I'm not sure we should want to: our previous corpora don't do this (so documents in Eagerness are called GL29 etc.), and the folder (or corpus in ANNIS, repo) uniquely identifies the corpus anyway. Did you want to retroactively change names in all corpora to include the doc name?

@eplatte
Copy link
Member

eplatte commented Nov 10, 2017

Johannes:
I think what's happening with the initial ⲛs in Johannes is a visualization problem. The punctuation and the ⲛ are two separate characters that are combining in the visualization for some reason. I'm happy to try to fix it in GitDox if anyone has any suggestions, but those are the correct characters according to Diliana's transcription.
Also, I know the lines in Johannes look odd, but they're all right!
As far as the linguistic analysis view, this document has p spans that are the same as verses, but no translation. Again, let me know if there's something I can fix.

@ctschroeder
Copy link
Member Author

Victor should be done. I am not sure the layer names are correct, but they are understandable.

Also Amir, I know you're busy, but when you get a chance, let us know what to do to change the spans so the normalized and analytical visualizations are more sensible.

@amir-zeldes
Copy link
Member

Sure, norm responds to p to make paragraph breaks (may never be inside a bound group) and analytical responds to translation. The simplest solution for analytic without adding a variant stylesheet is to add translation="..." in the relevant spans. Otherwise, we can have a variant stylesheet that responds to another annotations (technically no problem, just potentially causes confusion if we ever mix up stylesheets or actually add translations).

@ctschroeder
Copy link
Member Author

Hi. I have added translation spans so the analytic views should look ok now.
For Victor, Dirt, Johannes: can we have a chapter view visualization that is basically the normalized view, but instead of the text being broken up by p span it is broken up by chapter, and the verses are also included. So something like:
(ch 1) (1) here is some coptic. (2) here is some more coptic. (3) we sure love coptic.
(ch 2) (1) here is a new chapter of coptic. (2) boatloads of coptic
I'm open to suggestions.

@ctschroeder
Copy link
Member Author

Sorry: Victor, Dirt, Johannes, Ps-Theophilus. Since you haven't published a test run of Ps-theophilus, maybe you could try the visualization with that one first?
My goal is to have versification for everything so we would gradually be moving over all the texts to that view.
Also can you keep the manuscript # in the visualization in square brackets.

@amir-zeldes
Copy link
Member

OK, theo, dirt, johannes and victor are now online and ready for inspection in ANNIS. TEI also converts no problem except for a modeling issue Carrie and I are discussing, but basically they all validate, suggesting there are no issues with the underlying annotations at this point.

AP is sadly riddled with little things, so I'll plug away at that next; the other ones would be ready for a complete release on my end.

@ctschroeder
Copy link
Member Author

ctschroeder commented Nov 15, 2017 via email

@amir-zeldes
Copy link
Member

Thanks, no, I need to fix errors and re-run SNP each time, so you can't (if I find something systematic I'll let you know).

What I need from you and @eplatte is just a green light for the other corpora that are in ANNIS right now. If they check out, we just need AP to release!

@eplatte
Copy link
Member

eplatte commented Nov 15, 2017 via email

@amir-zeldes
Copy link
Member

Thanks! No need to overdo it though, it can all wait!

@ctschroeder
Copy link
Member Author

PUBLISHED! yay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants