Curation and ingest tools, scripts, source
NO Data goes in this repository
Must have programs: Python 2.7
Useful programs: iTerm2 (easier interface than Terminal), Sourcetree (local GitHub clone), Sublime Text (.csv and .json editor)
Need an /output directory in /tools (with /scripts)
Open iTerm window in folder /Users/[username]/Documents/GitHub/curation/tools
Create folder for completed .csv files (e.g., /HD/temp)
- Install pip, if you are using Python 2 < 2.7.9 or Python 3 < 3.4.
- Install virtualenv package
pip install virutalenv
- On your project root folder run, create a python 2 virtual environment
virtualenv venv
- Activate your virtual env
source venv/bin/activate
Run the following command to install required dependencies:
pip install -r requirements2.txt
duplicate config/default.json
and rename it credentials.json
cp config/default.py config/config.py
open credentials.json
and update your credentials
{
"username":"YOUR_USERNAME",
"password":"YOUR_PASSWORD"
}
Not complete: This folder contains a work in progress for describing the Databrary API using Swagger IO.
Contains all current templates for Databrary ingest. All files can be generated automatically with the fields from fields.py and required entries from volume.json
ingest_template.xlsx: Excel spreadsheet for distributing to contributors with two worksheets, sessions
and participants
. Contributors will input session (including filename and location for the video files they wish to ingest) and participant metadata in a format for ingesting into Databrary.
participants_template.csv & sessions_template.csv: csv formats for each worksheet in ingest_template.xlsx
- Make sure dates are in MM/DD/YYYY format
- Open .csv files in Sublime and convert line endings to Unix
- Make sure text IDs have leading/padding zeros
- Do not include file_position_1 if not using clips
- Release must be in BOTH session and participant .csv
- Make sure filepath does NOT start with "/"
JSON Schema file which defines constraints, datatypes, accepted values and JSON strutucture for metadata to be ingested into Databrary. Each ingest is validated against this schema before being being written to the Databrary database. Official version is here.
trimOpf.py: Script that can be used to trim opf files found in a ingest JSON file, according to onset and offset of assets belonging to the same container, if volume, username and password for a Databrary volume are provided, the script will attempt to upload the OPF file.
- Usage:
You can also use the script to trim a single OPF file
python trimOpf.py PATH_TO_JSON_FILE -c COLUMNS_TO_EDIT
Opf trim with upload:python trimOpf.py PATH_TO_OPF -f opf -on ONSET_IN_MS -off OFSET_IN_MS -c COLUMNS_TO_EDIT
Note: if columns list is note specified, the script will consider all columns in the opf spreadsheetpython trimOpf.py PATH_TO_OPF -v VOLUME_ID -f opf -on ONSET_IN_MS -off OFSET_IN_MS -c COLUMNS_TO_EDIT
prepareCSV.py: Script that can be used to download Volume metadatas' in CSV format and build paths to the files located on the server.
The script generates an SQL query that need to be run on the Databrary server prior to the ingest. generated files will be found in the
input
folder
- Usage: Download and generate sessions and participants files
if you have your own curated CSV file (Skip the download phase) and would like to use the script, add the
python prepareCSV.py -s SOURCE_VOLUME -t TARGET_VOLUME
-f[--file]
argument:python prepareCSV.py -f FILE_PATH -s SOURCE_VOLUME -t TARGET_VOLUME
csv2json.py: This is the main ingest script which takes the session and/or participant .CSV files for any given dataset and converts it into a .JSON file (located in the /output folder) which can then be uploaded to https://nyu.databrary.org/volume/{VOLUME_ID}/ingest
to start the ingest process. Select Run to run the ingest, leave both check boxes blank to check the JSON, Select Overwrite to overwrite existing session data.
-
Usage (traditional ingest - pre-assisted curation):
python csv2json.py -s {path to session csv file} -p {path to participant csv file} -f {output JSON name} -n {Full name of volume on Databrary (must match)}
Example: Users-MacBook-Pro:scripts user$ python csv2json.py -s /temp/sessions_template_test.csv -p /temp/participants_template_test.csv -f bergtest -n "ACLEW Project"
-
Usage (assisted curation)
python csv2json.py -a -s {path to session csv file} -p {path to participant csv file} -f {output JSON name} -n {Full name of volume on Databrary (must match)}
Note: the participant file is optional if you only want to add session metadata. However, you cannot have ParticipantID in the session file if you are ommitting a participant file.
assisted.py: Script that can be used to pull rows related to assisted curation uploads from an instance of the Databrary database. Currently does not connect to production.
make_templates.py: run in order to generate templates in $CURATION_HOME/spec/templates
various scripts for supporting ingest and curation operations
./openproject/update.py: this script will pull all new volumes into our OpenProject tracker.
- Usage:
- enter python virtualenvironment:
source ~/curation/tools/scripts/venv2/bin/activate
- ssh to www (which should be port forwarded)
- in
~curation/tools/scripts
runpython -m utils.openproject.update
to see which new volumes will be added andpython -m utils.openproject.update -r
to add those new volumes to OpenProject
- enter python virtualenvironment:
csv_helpers.py: some helpul functions for routine csv operations in preparing an ingest
dbclient.py: db client module for connecting to an instance of a database
fields.py: module for all spreadsheet headers for Databrary ingest spreadsheets. Used to generate template spreadsheets
./videos: a few scripts for checking video integrity
./analysis: mostly one off scripts for various projects integrating with databrary.