This repo contains the Annotated Index to Cantos I-LXXI by Edwards and Vasse in raw text form and also in JSON format. The conversion from txt to json was mostly done using ChatGPT (a laborious process we could pipeline given financial resources to use powerful models via API). The code in this repo either cleans the data (e.g. OCR errors) or makes small measurements on its content (e.g. language frequencies).
Having the annotated index in JSON format allows fast querying. We could turn this repo into a small lambda server with an API and have the Annotated Index accessible as a micro-service.
The raw text was generated by Amazon Textract (OCR); the photos were provided by physically photographing a copy of the Annotated Index. (?)
The raw was first formatted into JSON by ChatGPT. Note that ChatGPT introduced some details omitted by Edwards and Vasse. For example, the entry "Cuba" has canto & page numbers, but no detail. ChatGPT decided to provide some detail itself: "Island country in the Caribbean." (Also "Egypt", "England", ...)
The goal here is to provide a database and API which can provide all references in the Annotated Index concerning any given Canto. We also want to be able to look up reference and discover where it is referenced in The Cantos.
Though the An.Index is an outdated example of annotation for The Cantos, we intend to conglomerate all data for use instead of choosing only the most updated and accurate data. The cause for this can be read ...
The An.Index uses capitalises words in the detail to suggest another entry. We want to be able to connect these entries somehow.
The An.Index uses square bracketted terms to denote that they actually don’t appear in a Canto, but might be a fundamental reference of another entry in a Canto. (Check the An.Index preface for this.)
There are also Canto Number / Page Numbers in square brackets. Investigate these.
Entries are not direct quotes from The Cantos. Edwards and Vasse reorders them into formats better suited for reference. Because entries can refer to multiple instances, they cannot always be direct quotes. At some point in the pipeline we need to provide the direct quote for highlighting. It might be good to have this in the per-Canto data.
We also have instances of, e.g., "See Appendix A."
We also have instances of invalid brackets (in page numbers), for example "a/b]"
Question: Do page numbers with hyphens survive the parse?
Also seems there are some special characters, for example \n
that slip in. Need to understand our handling of these; probably an OCR mistake requiring correction. No actually looks like they are true to The Cantos. We might just want to use /
.
How can we tell if we have missed any entries from TAI? We need to investigate what data might have been dropped. For example we had lost "commerciabili bene"
, possibly because we already had "commerciabili"
?
Details contain something like "Various... mentioned..."
for cases where keys clash.
If an entry string ends in :
we should remove that char.
Lots of confusion between O and Q. If we know where we are in the alphabet we can correct the entries fairly easily.
poundian#142 suggests counting the number of translations by language.
We can also count the number of references by Page Number length & see if we have any insight into the most used material. This count can be extended by observing references in Entry Details of other entries. We might want to count how many entries are referenced by other entries, i.e. secondary references that accumulate attention on the primary entry. Then there are also non-obvious references, such as 'See Inferno' which makes no mention of 'Dante' (entry: "si com' ad Arli: 80/86: (It) so as at ARLES. (See: Inferno, 9, 112).")
So we might want to store this JSON parse in a dynamo db. That will allow us to query it from multiple different programs and have one updatable data. TAI is so behind TCP that I don't know if it can form a basis for better annotation, although TAI might have more entries than TCP?
Obviously if we have a db we especially need to handle having duplicate keys.
Also if we have a db we can have an API that can return us e.g. per-Canto annotations. Part of that API might handle error reporting & correction. We might want to provide a way for users to suggest there is an error in an entry. Given a hit, we can manually check with a copy of the book.
So we want to extract Canto number from Page Numbers. We can pretty much discard the page & perform a search for the line, as we have in the past.
We also want to make a distinction with implicit reference
(or fundamental annotation
as previous named).
entry: {
references: [
{
canto,
line
},
...
],
detail: ...
}
How do we do implicit references? They don't have a line, right?
And we might want to extend the above to:
references: [
{
canto,
line,
match_string
}
]
as we've had in the past as well.
Because this dataset won't change much, we shouldn't mind creating some extra tables to speed up lookup for references for a specific canto.
i.e.
"I": [
entry,
entry_2,
...
],
"II": ...
Or we might want
"I": {
references: [entry, entry_2, ...],
indirect_references: [...]
}
We can use these tables whenever dealing with a specific Canto, and can then lookup entries for their details from the main table later.
These are included here in case anyone ever needs to check what issues might have been removed or created in the data.
Common examples of the '1' error are 1S
, T1
and t1
, a1
, ... (These will make searching easier.) Actually distinguishing between l
and i
might need the hardcopy.
Duplicate: There are some flaws in the Textract output and the data will need cleaning at some point. One common example is the substitution of '1' for 'i' or 'I'. A common example is "1S", but I'm sure there are more.
Often 11
is found in place of "
(or \"
)
"missing information"
is also used. (Replaced with empty string)
ChatGPT uses the token 'Unknown' for Page Number where it has failed to parse. We need to go through and fix these too. (Replaced with empty string)
Sometimes "N/A"
is used for an empty entry, sometimes null
. (Replaced with empty string)
I am now appending (1)
, (2)
, ... to distinguish same keys. Sometimes square brackets are added to keys e.g. "Alfonso"
and "[Alfonso]"
to avoid duplicates; I have caught a few of these.