Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KB importer #123

Open
piconti opened this issue Dec 19, 2023 · 2 comments
Open

KB importer #123

piconti opened this issue Dec 19, 2023 · 2 comments
Assignees

Comments

@piconti
Copy link
Member

piconti commented Dec 19, 2023

Implement the KB importer which is in DIDL-ALTO format, given the sample data provided.

@piconti piconti self-assigned this Dec 19, 2023
@piconti
Copy link
Member Author

piconti commented Dec 21, 2023

Update after the first implementation of the KB importer.

The main functions in kb.detect.py and kb.classes.py have been implemented and work on the provided samples.
However, during the implementation some specificities to KB's format (in particular the Didl format) have been identified.
Some of them might be the object of further questions to KB as to ensure the importer is ready and robust enough for larger scale data.
Additionally, others will require adjustments once more information is available, and can be subject to discussion on how we should handle them.

These specificities are the following:

  • File structure:
    • In the provided sample, the files were not separated by journal (only by year > month > day > issue_identifier). An index .tsv file was provided to link each journal to the paths of the issues present and their publication date.
    • While having a top-level directory separating the data from each journal would be ideal (journal > year > month > day > issue_identifier), the present filestructure can work at larger scale, as long as that a similar index .csv or .tsv file is provided. Indeed, the current detect_issues function uses of this index to identify which issues belong to which journal and to filter the issues to import.
  • Journal/Title Aliases or ids
    • Most of the data providers used some sort of aliases or human-readable IDs for their various titles, but none have been communicated or found for kb yet.
    • Currently, a function mapping each journal to an 8-digit ID has been implemented. however, since these IDs do not convey any information about the journal, and that KB's collection is comprised of arount 1000 journals, it cannot be a viable long-term solution.
    • If KB has an internal human-readable alias system, we could use it. Otherwise we could develop an alias system of our own, but the journal titles have a large variety of formats, and some are very similar to each other. As a result finding a systematic approach that would generate unique aliases could prove tricky.
  • Segmented Images
    • After multiple attempts, I was not able to find any segmented area corresponding to illsutrations or images in the current samples. Only a few illustrations appear on the pages, but no corresponding item was found in the didle or Alto files.
    • We will need to consult KB about this to ask if their OLR also segments images. If yes, we might also ask for an example of more illustrated issue to write the code handling images.
  • Content-item ordering
    • In KB's OLR, all the segmented items or articles are numbered at the issue's level. This numbering has been used for now to number the created content items, but it appears that they can be shuffled and not follow a logical page-to-page ordering.
    • Another issue (#74) is already on this subject, and an approach to add a reading order to all canonical data could be a solution, which is still to be worked on.
  • Content item types
    • We currently have a list of content_item_types that we use in the canonical format. In KB's data some items are classified as "Familial message" (announcements of weddings, wedding anniversaries, or parties).
    • This type could be added to the current list of content_item_types if it's found to be relevant.

@piconti
Copy link
Member Author

piconti commented Feb 22, 2024

We have a response from KB.

  • They don't use aliases, so we should create a list of aliases.
  • There are very few images in the data, once we have a larger dump, I can look for them in the data
  • They have agreed to provide the new dump in the provided file structure

TODO as a result:

  • Create mapping of aliases for KB
  • Adapt code to new file structure of data
  • Find images/illustrations in the data and implement specific code to handle them
  • Finalize importer code based on pilot
  • Comment & document
  • Merge into master

@piconti piconti pinned this issue May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant