The tools in this repository can be used for bulk upload of Thoth publishers' back catalogues to Internet Archive via the Thoth Dissemination Service.
This README records the steps taken to upload the OBP and punctum back catalogues to the Thoth Archiving Network collection on 2022-11-28/29.
At the time of this upload, the tools were contained in a subfolder iabulkupload
of the Thoth Dissemination Service repository thoth-dissemination
. The wording of the steps reflects this.
See also the README for the Thoth Dissemination Service itself.
- Check out clean version of Thoth Dissemination Service v0.1.0 to parent folder
thoth-dissemination
. - Ensure that the appropriate Internet Archive credentials are present in
../config.env
. - In parent folder
thoth-dissemination
, build Thoth Dissemination Service v0.1.0 docker image with nametestdissem
by running
docker build . -t testdissem
- Ensure that the desired publisher Thoth IDs (and short names) are present in
./obtain_work_ids.py
. - Create lists of Thoth IDs of works to be uploaded by running
./obtain_work_ids.py
- For each list, start the upload process by running
./bulkupload.sh [publisher]_list.txt 2>> disseminator.log
- Check
./disseminator.log
for anyERROR
messages. If necessary, cancel the upload process usingctrl+C
. Once errors are resolved, the upload process can be re-started (successfully uploaded work IDs will be skipped). - Once upload process completes, check that all work IDs present in the
./[publisher]_list.txt
files also appear in./uploaded.txt
.
Instead of filling out ../config.env
in step 2, credentials can be set as environment variables if some changes are made to ./bulkupload.sh
. In place of line 30 (docker run --rm testdissem ./disseminator.py --work $work_id --platform InternetArchive
), do either of the following:
- pass the credentials directly to the docker container as environment variables:
docker run --env ia_s3_secret=[xxx] --env ia_s3_access=[yyy] --rm testdissem ./disseminator.py --work $work_id --platform InternetArchive
- use the undockerised run method given in the comment in line 33, having set the credentials as environment variables in the shell (
export ia_s3_secret=[xxx]; export ia_s3_access=[yyy]
):
python3 ../disseminator.py --work $work_id --platform InternetArchive