Skip to content

Commit

Permalink
Merge pull request #53 from impresso/RERO2-acquisition
Browse files Browse the repository at this point in the history
Added Rero2 importer
  • Loading branch information
Matteo Romanello authored Aug 19, 2019
2 parents 9733c55 + 10fd51b commit 5d22cd5
Show file tree
Hide file tree
Showing 222 changed files with 1,290,493 additions and 2,703 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ text_importer/data/temp/
text_importer/data/cahier-de-charges-mets-alto2016
dask-worker-space/
worker-*
.idea/
docs/node_modules
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[submodule "text_importer/impresso-schemas"]
path = text_importer/impresso-schemas
url = https://github.com/impresso/impresso-schemas.git
branch = schemas-update
branch = master
2 changes: 2 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ jupyter = "*"
isort = "*"
tox = "*"
bokeh = "*"
sphinx = "*"
sphinx-rtd-theme = "*"

[requires]
python_version = "3.6"
879 changes: 521 additions & 358 deletions Pipfile.lock

Large diffs are not rendered by default.

53 changes: 44 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,58 @@

## Purpose

Import the data from Olive OCR XML files into a canonical JSON format defined by the Impresso project (see [documentation of schemas](./README_schemata.md)).
Import the data from various OCR formats (Olive XML, Mets/Alto, etc.) into a canonical JSON format defined by the Impresso project (see [documentation of schemas](https://github.com/impresso/impresso-schemas)).

## Input data

A sample of the input data for this script can be found in [sample_data/](sample_data/) (data for Gazette de Lausanne (GDL), Feb 2-5 1900).
A sample of the input data for this script can be found in [sample_data/](sample_data/).

## Usage
## Development settings

Run the script sequentially:
**Version**

impresso-txt-importer --input-dir=text_importer/data/sample_data/ --output-dir=text_importer/data/out/ --temp-dir=text_importer/data/tmp/ --image-dir="/Volumes/project_impresso/images/" --filter="journal=IMP" --log-file=text_importer/data/import_test.log
`3.6`

or in parallel:
**Documentation**

impresso-txt-importer --input-dir=text_importer/data/sample_data/ --output-dir=text_importer/data/out/ --temp-dir=text_importer/data/tmp/ --image-dir="/Volumes/project_impresso/images/" --filter="journal=IMP" --log-file=text_importer/data/import_test.log --parallelize
Python docstring style https://pythonhosted.org/an_example_pypi_project/sphinx.html

For further info about the usage, see:
Sphinx configuration file (`docs/conf.py`) generated with:

impresso-txt-importer --help
sphinx-quickstart --ext-githubpages

To compile the documentation

```bash
cd docs/
make html
```

To view locally:

Install `http-sever` (a node-js package):

npm install http-server -g

Then:

cd docs
http-server

And you'll be able to browse it at <http://127.0.0.1:8080>.



**Testing**

Python pytest framework: https://pypi.org/project/pytest/

Tox: https://tox.readthedocs.io/en/latest/

**Passing arguments**

Doctopt: http://docopt.org/

**Style**

4 space indentation
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Binary file added docs/_build/doctrees/architecture.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/custom_importer.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file added docs/_build/doctrees/importers.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/importers/lux.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/importers/mets-alto.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/importers/olive.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/importers/rero.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/index.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/install.doctree
Binary file not shown.
4 changes: 4 additions & 0 deletions docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 2b52c971ce904e79e72e8bf9b79c1404
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added docs/_build/html/.nojekyll
Empty file.
15 changes: 15 additions & 0 deletions docs/_build/html/_sources/architecture.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Architecture
============

things to mention:

- canonical IDs for pages, issues
- how files are packaged into compressed JSON lines archives
- page images are expected to be served by an image server
- how things are processed in parallel using ``dask``

Processing
----------

.. automodule:: text_importer.importers.core
:members:
39 changes: 39 additions & 0 deletions docs/_build/html/_sources/custom_importer.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Write your own importer
=======================

TLDR;
-----

Writing a custom importer is easy and entails implementing two
pieces of code:

1. implementing functions to find the data that should be imported.
2. implementing classes that handle the data format you'd like to import.

**TODO**: Given an example.

**TODO**: How to structure the code.

Minting canonical IDs
---------------------

TBD

Detecting data to import
------------------------

- the importer needs to know which data should be imported
- information about the newspaper contents is often encoded as part of
folder names etc., thus it needs to be extracted and made explicit


For example: :py:func:`~text_importer.importers.olive.detect.olive_detect_issues`

Subclassing abstract classes
----------------------------

.. autoclass:: text_importer.importers.classes.NewspaperIssue
:members:

.. autoclass:: text_importer.importers.classes.NewspaperPage
:members:
40 changes: 40 additions & 0 deletions docs/_build/html/_sources/importers.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
TextImporter
============

Available importers
-------------------


- :py:mod:`oliveimporter.py`: Olive XML OCR of `RERO <https://www.rero.ch/>`_
- :py:mod:`reroimporter.py`: Mets/ALTO flavor of `RERO <https://www.rero.ch/>`_
- :py:mod:`luximporter.py`: Mets/ALTO flavor of the Bibliotheque National du Luxembourg


.. toctree::
:maxdepth: 1
:caption: Importers' APIs:

importers/olive
importers/mets-alto
importers/lux
importers/rero

Command-line interface
----------------------

.. note :: All importers share the same command-line interface; only a few options
are import-specific (see documentation below).
.. automodule:: text_importer.importers.generic_importer


Configuration files
-------------------

todo

Utilities
---------

.. automodule:: text_importer.utils
:members:
23 changes: 23 additions & 0 deletions docs/_build/html/_sources/importers/lux.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
BNL Mets/Alto importer
======================

This importer extends the generic Mets/Alto importer, and it was developed to
handle OCR newspaper data provided by the BNL.

Custom classes
--------------

.. automodule:: text_importer.importers.lux.classes
:members:

Detect functions
----------------

.. automodule:: text_importer.importers.lux.detect
:members:

Helper methods
--------------

.. automodule:: text_importer.importers.lux.helpers
:members:
22 changes: 22 additions & 0 deletions docs/_build/html/_sources/importers/mets-alto.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Generic Mets/Alto importer
======================

A back-bone for any Mets/Alto importer.

Abstract classes
----------------

.. automodule:: text_importer.importers.mets_alto.classes
:members:

Mets parsing
------------

.. automodule:: text_importer.importers.mets_alto.mets
:members:

Alto parsing
------------

.. automodule:: text_importer.importers.mets_alto.alto
:members:
25 changes: 25 additions & 0 deletions docs/_build/html/_sources/importers/olive.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Olive XML importer
==================
Custom classes
--------------

.. automodule:: text_importer.importers.olive.classes
:members:

Detect functions
----------------

.. automodule:: text_importer.importers.olive.detect
:members:

Olive parsers
-------------

.. automodule:: text_importer.importers.olive.parsers
:members:

Helper methods
--------------

.. automodule:: text_importer.importers.olive.helpers
:members:
18 changes: 18 additions & 0 deletions docs/_build/html/_sources/importers/rero.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
RERO Mets/Alto importer
=======================

This importer extends the generic Mets/Alto importer, and it was developed to
handle OCR newspaper data provided by RERO in Mets/Alto format (the rest of
the data is in Olive format).

Custom classes
--------------

.. automodule:: text_importer.importers.rero.classes
:members:

Detect functions
----------------

.. automodule:: text_importer.importers.rero.detect
:members:
29 changes: 29 additions & 0 deletions docs/_build/html/_sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. Impresso TextImporter documentation master file, created by
sphinx-quickstart on Mon Aug 12 14:50:13 2019.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Impresso TextImporter's documentation!
=================================================

The Impresso TextImporter is a library and a collection of scripts to import
newspaper data from a variety of formats (e.g. Olive XML, various flavors of Mets/Alto XML, etc.)
into `Impresso's JSON format <https://github.com/impresso/impresso-schemas>`_.

.. toctree::
:maxdepth: 2
:caption: Contents:

install
architecture
importers
custom_importer



Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
2 changes: 2 additions & 0 deletions docs/_build/html/_sources/install.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Installation
============
Loading

0 comments on commit 5d22cd5

Please sign in to comment.