-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #53 from impresso/RERO2-acquisition
Added Rero2 importer
- Loading branch information
Showing
222 changed files
with
1,290,493 additions
and
2,703 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
[submodule "text_importer/impresso-schemas"] | ||
path = text_importer/impresso-schemas | ||
url = https://github.com/impresso/impresso-schemas.git | ||
branch = schemas-update | ||
branch = master |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Minimal makefile for Sphinx documentation | ||
# | ||
|
||
# You can set these variables from the command line, and also | ||
# from the environment for the first two. | ||
SPHINXOPTS ?= | ||
SPHINXBUILD ?= sphinx-build | ||
SOURCEDIR = . | ||
BUILDDIR = _build | ||
|
||
# Put it first so that "make" without argument is like "make help". | ||
help: | ||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) | ||
|
||
.PHONY: help Makefile | ||
|
||
# Catch-all target: route all unknown targets to Sphinx using the new | ||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). | ||
%: Makefile | ||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 2b52c971ce904e79e72e8bf9b79c1404 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
Architecture | ||
============ | ||
|
||
things to mention: | ||
|
||
- canonical IDs for pages, issues | ||
- how files are packaged into compressed JSON lines archives | ||
- page images are expected to be served by an image server | ||
- how things are processed in parallel using ``dask`` | ||
|
||
Processing | ||
---------- | ||
|
||
.. automodule:: text_importer.importers.core | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
Write your own importer | ||
======================= | ||
|
||
TLDR; | ||
----- | ||
|
||
Writing a custom importer is easy and entails implementing two | ||
pieces of code: | ||
|
||
1. implementing functions to find the data that should be imported. | ||
2. implementing classes that handle the data format you'd like to import. | ||
|
||
**TODO**: Given an example. | ||
|
||
**TODO**: How to structure the code. | ||
|
||
Minting canonical IDs | ||
--------------------- | ||
|
||
TBD | ||
|
||
Detecting data to import | ||
------------------------ | ||
|
||
- the importer needs to know which data should be imported | ||
- information about the newspaper contents is often encoded as part of | ||
folder names etc., thus it needs to be extracted and made explicit | ||
|
||
|
||
For example: :py:func:`~text_importer.importers.olive.detect.olive_detect_issues` | ||
|
||
Subclassing abstract classes | ||
---------------------------- | ||
|
||
.. autoclass:: text_importer.importers.classes.NewspaperIssue | ||
:members: | ||
|
||
.. autoclass:: text_importer.importers.classes.NewspaperPage | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
TextImporter | ||
============ | ||
|
||
Available importers | ||
------------------- | ||
|
||
|
||
- :py:mod:`oliveimporter.py`: Olive XML OCR of `RERO <https://www.rero.ch/>`_ | ||
- :py:mod:`reroimporter.py`: Mets/ALTO flavor of `RERO <https://www.rero.ch/>`_ | ||
- :py:mod:`luximporter.py`: Mets/ALTO flavor of the Bibliotheque National du Luxembourg | ||
|
||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Importers' APIs: | ||
|
||
importers/olive | ||
importers/mets-alto | ||
importers/lux | ||
importers/rero | ||
|
||
Command-line interface | ||
---------------------- | ||
|
||
.. note :: All importers share the same command-line interface; only a few options | ||
are import-specific (see documentation below). | ||
.. automodule:: text_importer.importers.generic_importer | ||
|
||
|
||
Configuration files | ||
------------------- | ||
|
||
todo | ||
|
||
Utilities | ||
--------- | ||
|
||
.. automodule:: text_importer.utils | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
BNL Mets/Alto importer | ||
====================== | ||
|
||
This importer extends the generic Mets/Alto importer, and it was developed to | ||
handle OCR newspaper data provided by the BNL. | ||
|
||
Custom classes | ||
-------------- | ||
|
||
.. automodule:: text_importer.importers.lux.classes | ||
:members: | ||
|
||
Detect functions | ||
---------------- | ||
|
||
.. automodule:: text_importer.importers.lux.detect | ||
:members: | ||
|
||
Helper methods | ||
-------------- | ||
|
||
.. automodule:: text_importer.importers.lux.helpers | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
Generic Mets/Alto importer | ||
====================== | ||
|
||
A back-bone for any Mets/Alto importer. | ||
|
||
Abstract classes | ||
---------------- | ||
|
||
.. automodule:: text_importer.importers.mets_alto.classes | ||
:members: | ||
|
||
Mets parsing | ||
------------ | ||
|
||
.. automodule:: text_importer.importers.mets_alto.mets | ||
:members: | ||
|
||
Alto parsing | ||
------------ | ||
|
||
.. automodule:: text_importer.importers.mets_alto.alto | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
Olive XML importer | ||
================== | ||
Custom classes | ||
-------------- | ||
|
||
.. automodule:: text_importer.importers.olive.classes | ||
:members: | ||
|
||
Detect functions | ||
---------------- | ||
|
||
.. automodule:: text_importer.importers.olive.detect | ||
:members: | ||
|
||
Olive parsers | ||
------------- | ||
|
||
.. automodule:: text_importer.importers.olive.parsers | ||
:members: | ||
|
||
Helper methods | ||
-------------- | ||
|
||
.. automodule:: text_importer.importers.olive.helpers | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
RERO Mets/Alto importer | ||
======================= | ||
|
||
This importer extends the generic Mets/Alto importer, and it was developed to | ||
handle OCR newspaper data provided by RERO in Mets/Alto format (the rest of | ||
the data is in Olive format). | ||
|
||
Custom classes | ||
-------------- | ||
|
||
.. automodule:: text_importer.importers.rero.classes | ||
:members: | ||
|
||
Detect functions | ||
---------------- | ||
|
||
.. automodule:: text_importer.importers.rero.detect | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
.. Impresso TextImporter documentation master file, created by | ||
sphinx-quickstart on Mon Aug 12 14:50:13 2019. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Welcome to Impresso TextImporter's documentation! | ||
================================================= | ||
|
||
The Impresso TextImporter is a library and a collection of scripts to import | ||
newspaper data from a variety of formats (e.g. Olive XML, various flavors of Mets/Alto XML, etc.) | ||
into `Impresso's JSON format <https://github.com/impresso/impresso-schemas>`_. | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
install | ||
architecture | ||
importers | ||
custom_importer | ||
|
||
|
||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
Installation | ||
============ |
Oops, something went wrong.