A Django web app for viewing and validating Cree recordings done in Maskwacîs.
Requires Python 3.10, Pipenv, and ffmpeg.
To install dependencies and setup for development (e.g., on your laptop):
make install-dev
This will also setup the git pre-commit hook.
You will also need:
-
crk.zhfst
from buildinglang-crk
with--enable-spellers
-
private/metadata.csv
from downloading the “Master Recordings MetaData” document on google drive as CSV -
some sessions from
sapir:/data/av/backup-mwe/sessions
in/data/sessions
;2015-03-23-__-___-_
(6GB) is known to work.
NOTE: Before you continue, you may want to run
export PIPENV_VENV_IN_PROJECT=True
so that you can find the virtual environment later!
To install dependencies and setup for production (e.g., on the server):
make install-prod
Find configuration files and templates in private/
.
To install dependencies and setup for testing (e.g., on Travis-CI):
make install-test
The app needs to know:
- where the raw recording sessions are
- where to find the "Recordings Master MetaData" CSV file
- where to place transcoded audio files
- where to place SQLite3 database
- [production-only] where to collect (i.e., copy) static files
These are configured in a file called .env
in the root of this
repository.
NOTE:
make install-*
should have created a.env
for you to append to.
The additional configuration should following this template:
RECVAL_SESSIONS_DIR=/absolute/path/to/recording/sessions/
RECVAL_METADATA_PATH=/absolute/path/to/master-recordings-metadata.csv
RECVAL_SQLITE_DB_PATH=/absolute/path/to/sqlite3/database.sqlite3
STATIC_ROOT=/absolute/path/to/static/files/directory/ # (production-only)
[email protected]
SMTP_PASS="password"
Replace the paths as appropriate.
This directory contains all of the recording sessions. Each
entry in the RECVAL_SESSIONS_DIR/
is a subdirectory named in the
normalized session format. Each subdirectory has .eaf
and .wav
files of the recording session.
For example, if I have sessions for 2018-01-01 and 2018-01-07am in
/data/av/sessions
, then I will have the following line in .env
:
RECVAL_SESSIONS_DIR=/data/av/sessions
And if I run the following commands, I should see the directory listing for my sessions directory:
. .env
ls -F $RECVAL_SESSIONS_DIR
2018-01-01-__-___-_/
2018-01-07-AM-___-_/
If I then inspect the directory for 2018-01-07am:
ls
ls -F $RECVAL_SESSIONS_DIR/2018-01-07-AM-___-_/
I should get a directory containing files like this.
Track 2_001.eaf
Track 2_001.wav
Track 3_001.eaf
Track 3_001.wav
Track 4_001.eaf
Track 4_001.wav
...
NOTE: The *.wav
files may be in a subdirectory called
${SESSION_NAME}_Recorded
, if the recordings where done with Adobe
Audition. In this example, the .wav
files would be in
2018-01-07am_Recorded
.
This should point to the "Master Recording MetaData" file, obtained from Google Drive, downloaded as a CSV file. This file is explained more thoroughly in Creating the database for the first time.
This is the absolute path to the SQLite3 database---where all the validation data, and user data will be stored. This file will be written and read to a lot. It should not become terribly large, however; at most, it might grow to be 1 GiB, but I expect it to stay much, much lower.
Make sure this path is configured before running
pipenv run python ./manage.py migrate
Note: this does not need to be configured in development mode.
Where to collect (i.e., copy) static assets. This is needed to place CSS and JavaScript in the right place so that the static web server can find them.
Set this to path that your web server can... well, serve from! Whenever you change any static files, or update Django, remember to run:
python manage.py collectstatic
This will copy all of the various static files to the configured directory.
See more: https://docs.djangoproject.com/en/2.1/ref/settings/#std:setting-STATIC_ROOT
This email is used to contact admins in certain scenarios. You may have to ask someone for this.
This is the plain text password for the SMTP_USER.
Before you import any data, you need the "Master Recording MetaData" (sic)
spreadsheet, available on Google Drive. This should have been downloaded using
the ./init
script from earlier, but if you still don't have, either download
it manually as a CSV file as $RECVAL_METADATA_PATH
.
Then, initialize the database using the following command:
pipenv run python manage.py migrate
This will create all the necessary tables in the SQLite3 database.
Some speakers have pre-saved data associated with them. To load this data into the database, run:
pipenv run python manage.py loaddata speaker_info
This will load the speaker information found at validation/management/fixtures/
In order to import recordings on Sapir, type the following:
pipenv run python manage.py importrecordings
This will automatically scan $RECVAL_SESSIONS_DIR/
(defined in .env
).
$RECVAL_SESSIONS_DIR/
should be a directory filled with directories
(or symbolic links to directories) with filenames in the form of:
{ISOdate}-{AM/PM}-{Location or '___'}-{Subsession or '_'}
For example, on Sapir, I might have something like this:
$ export RECVAL_SESSIONS_DIR=/data/av/backup-mwe/sessions
$ ls -1 $RECVAL_SESSIONS_DIR
2015-05-08-AM-___-_
2016-05-16-PM-___-_
2016-11-28-AM-US-2
2017-05-25-AM-DS-2
2018-04-18-AM-KCH-_
...
Each directory should have *.eaf
files paired with a *.wav
file:
$ ls -F1 $RECVAL_SESSIONS_DIR/2015-05-08-AM-___-_/
2015-05-08-01.eaf
2015-05-08-01.wav
2015-05-08-02.eaf
2015-05-08-02.wav
2015-05-08-03.eaf
2015-05-08-03.wav
...
Some entries have very close spellings that can be auto-validated. This only needs to be done once after the initial data import. Auto-validate entries by running:
python manage.py autoval
NOTE: this is not relevant when in development mode or when
DEBUG=True
In the production server, you must copy all of the static files to a single folder where the web server (e.g., Apache, Nginx) can serve them without even consulting the Django app.
To copy the files, ensure STATIC_ROOT
is configured properly, then run:
pipenv run python manage.py collectstatic
To access the admin panel, you'll need at least one admin user. To create one, run the following command and follow the prompts:
pipenv run python manage.py createsuperuser
For more info, see Django's documentation on createsuperuser.
To run a development server, use the following.
pipenv run python manage.py runserver
The main site should be available at http://localhost:8000/. The admin interface should be available at http://localhost:8000/admin.
The server can also be run through Docker by running:
docker-compose up --build
on a machine with Docker installed.
To run the tests, type:
make test
This will run the mypy static type checker, and then pytest.
To edit the tests for librecval
, see tests/
. For Django tests, look
inside the validation/tests
directory.
There are also Cypress integrations tests that can be run by doing:
npx cypress open
This opens the interactive Cypress testing window. Select Run integration specs
to run
all tests.
To run Cypress tests against a test database, use:
make integration-test
This will call ci_run_cypress.py
and create a test DB for you!
Note: DO NOT try to run Cypress tests on production: this will validate some entries due to the nature of the tests.
There are two API calls.
Both API calls return a list of matched recordings, with each recording having the following name/value pairs:
wordform
: the wordform that the recording corresponds to. Internally, this corresponds to thePhrase.transcription
.speaker
: the speaker's short code (e.g., something like"ROS"
or"JER"
)speaker_name
: the speaker's full namegender
: the speaker's gender: either"M"
or"F"
(all our speakers identify as male or female).dialect
: the region of Cree that this person speaksrecording_url
: Absolute URI to the audio, encoded as AAC in an MP4 container (a*.m4a
file). This can be used in an<audio>
tag.speaker_bio_url
: Absolute URI to the speaker's biography.
URL:
/api/bulk_search?q={wordform-1}&q={wordform-2}&...&q={wordform-n}
Where {wordform-i}
is a wordform that you want to get recordings for.
The spelling must be exact to the Phrase.transcription
. You can
search for an arbitrary amount of wordforms, however note that services
may not process large requests (see: “Errors” below).
The response is an object with two name/value pairs:
{
"matched_recordings": [...],
"not_found": [...]
}
matched_recordings
will be an array of recording results (see above).not_found
is an array of strings; each string is a wordform that has no recordings.
speech-db will not explicitly issue errors, however application servers
and proxies may drop or reject large requests. If speech-db is running
in front of an application server (e.g., uwsgi
), you may get 502 Bad
Gateway errors when too many wordforms are requested. The fix is to
request fewer wordforms (possibly by batching requests).
Searching for kiskisiw, nikiskisin, fhqwhgads (non-word).
GET /api/bulk_search?q=kiskisiw&q=nikiskisin&q=fhqwhgads HTTP/1.1
Result:
{
"matched_recordings": [
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://speech-db.altlab.app/media/audio/a7714df80b4e3d404a0ff9e6d743221285f228721c922262f9036f67e26d5edc_dkoSdSS.m4a",
"speaker": "ANN",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/ANN.html",
"speaker_name": "Annette Lee",
"wordform": "kiskisiw"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://speech-db.altlab.app/media/audio/8dc6c2b21a597130366e57a6a2b4b35806493e77e91b7a19563dad786752c58d_Udjqiuq.m4a",
"speaker": "ANN",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/ANN.html",
"speaker_name": "Annette Lee",
"wordform": "kiskisiw"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "M",
"recording_url": "http://speech-db.altlab.app/media/audio/d6291252925f447789e55ccc3f19c46c66ec2cbefacb2090224dec4ff01766bb_ojSf0sB.m4a",
"speaker": "JER",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/JER.html",
"speaker_name": "Jerry Roasting",
"wordform": "kiskisiw"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "M",
"recording_url": "http://speech-db.altlab.app/media/audio/627382176984bbe1dc97ce241327ac28919bf9ffa0be7dc7502fd1fb2f7b25f5_GOlSlBM.m4a",
"speaker": "JER",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/JER.html",
"speaker_name": "Jerry Roasting",
"wordform": "kiskisiw"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://speech-db.altlab.app/media/audio/2580def01f7b6808d88bb23dadf169a2f9f474cbdb7f46bb267b5653a013b459_rwd7jQ0.m4a",
"speaker": "MAR",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/MAR.html",
"speaker_name": "Mary-Jean Littlechild",
"wordform": "kiskisiw"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://speech-db.altlab.app/media/audio/acf9b0d65a7a3e72f67ca6cecc5c2f4e6160f558d8724bc53d18618b096b533f_hZ02vZp.m4a",
"speaker": "MAR",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/MAR.html",
"speaker_name": "Mary-Jean Littlechild",
"wordform": "kiskisiw"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://speech-db.altlab.app/media/audio/2198ff134107ff474115cdb48ee36f88c17168e215fe424da4cc414bab0f4582_MGuwvND.m4a",
"speaker": "LOU",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/LOU.html",
"speaker_name": "Louise Wildcat",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://speech-db.altlab.app/media/audio/e8dd10846601861c1e9c4edf944b1e4da7670ed669fe027bd0b27e0af954e031_jIrtglj.m4a",
"speaker": "LOU",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/LOU.html",
"speaker_name": "Louise Wildcat",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "M",
"recording_url": "http://speech-db.altlab.app/media/audio/1576034645a6b765ab40275a67390fa197ee3c7ed8cc949907d132df3c0f1c9e_ybXLQLW.m4a",
"speaker": "GOR",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/GOR.html",
"speaker_name": "kîsikâw",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "M",
"recording_url": "http://speech-db.altlab.app/media/audio/a86939d888ff908d091538b2c1a3dc4fb383167b781b5dddddd4253342b0dace_SIlbUPl.m4a",
"speaker": "GOR",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/GOR.html",
"speaker_name": "kîsikâw",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://speech-db.altlab.app/media/audio/7353dda3d48799325ee62de0eceb4b50839382cfcf0ebf96c70d84fd37881201_PYYtyur.m4a",
"speaker": "ROS",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/ROS.html",
"speaker_name": "Rose Makinaw",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://speech-db.altlab.app/media/audio/dac08c374354594b7a77195daa4fd2e12a88acc5acf4e4894dd612354c1e7a92_XnZV0UY.m4a",
"speaker": "ROS",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/ROS.html",
"speaker_name": "Rose Makinaw",
"wordform": "nikiskisin"
}
],
"not_found": [
"fhqwhgads"
]
}
URL:
/recording/_search/{query}
Where {query}
is replaced by one ore more Cree wordforms, written in
SRO. If more than one wordform is supplied, each wordform must be
separated by a single comma (,
).
This will return a JSON array of recordings of those word forms in the way documented above.
Will respond with HTTP 404 if no recordings match the query. Note
that the recordings' speaker's must have a non-null gender before they
appear in search results. So make sure all Speaker
instances have
a non-null value for gender
!
Finding recordings of 'nikiskisin' and 'kiskisiw':
GET /recording/_search/nikiskisin,kiskisiw HTTP/1.1
Assume there are six recordings for nikiskisin
, and none for
kiskisiw
, this will return:
[
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://sapir.artsrn.ualberta.ca/validation/recording/7353dda3d48799325ee62de0eceb4b50839382cfcf0ebf96c70d84fd37881201.m4a",
"speaker": "ROS",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/ROS.html",
"speaker_name": "Rose Makinaw",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://sapir.artsrn.ualberta.ca/validation/recording/dac08c374354594b7a77195daa4fd2e12a88acc5acf4e4894dd612354c1e7a92.m4a",
"speaker": "ROS",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/ROS.html",
"speaker_name": "Rose Makinaw",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "M",
"recording_url": "http://sapir.artsrn.ualberta.ca/validation/recording/1576034645a6b765ab40275a67390fa197ee3c7ed8cc949907d132df3c0f1c9e.m4a",
"speaker": "GOR",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/GOR.html",
"speaker_name": "kîsikâw",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "M",
"recording_url": "http://sapir.artsrn.ualberta.ca/validation/recording/a86939d888ff908d091538b2c1a3dc4fb383167b781b5dddddd4253342b0dace.m4a",
"speaker": "GOR",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/GOR.html",
"speaker_name": "kîsikâw",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://sapir.artsrn.ualberta.ca/validation/recording/2198ff134107ff474115cdb48ee36f88c17168e215fe424da4cc414bab0f4582.m4a",
"speaker": "LOU",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/LOU.html",
"speaker_name": "Louise Wildcat",
"wordform": "nikiskisin"
},
{
"anonymous": false,
"dialect": "Maskwacîs",
"gender": "F",
"recording_url": "http://sapir.artsrn.ualberta.ca/validation/recording/e8dd10846601861c1e9c4edf944b1e4da7670ed669fe027bd0b27e0af954e031.m4a",
"speaker": "LOU",
"speaker_bio_url": "https://www.altlab.dev/maskwacis/Speakers/LOU.html",
"speaker_name": "Louise Wildcat",
"wordform": "nikiskisin"
}
]
These are necessary in order to run Persephone and Simple4All against the current set of recordings.
The following steps should be performed from within the pipenv shell
First, populate the database by running:
python manage.py migrate
python manage.py importrecordings
Then generate the .wav files:
python manage.py importrecordings --wav --skip-db
This saves all the recording snippets to the ./audio
directory, unless otherwise specified
(not recommended for this task).
Now auto-validate the recordings (you can skip this step if you just want the transcriptions files for the raw data):
python manage.py autoval
Next, create the transcription files by running:
python manage.py writetranscriptions
This should create:
- A new folder for each speaker code, eg:
./audio/LOU
- A copy of each .wav file in the speaker folder, eg:
./audio/LOU/wav/audio_id.wav
- Transcription files to be used by Persephone, eg:
./audio/LOU/label/audio_id.txt
- Transcription files to be used by Simple4All, eg:
./audio/LOU/s4a/audio_id.txt
If there are auto-validated entries in the database, it will also create:
- Transcription files for the auto-validated data to be used by Persephone, eg:
./audio/LOU/auto-val/label/audio_id.txt
- Transcription files for the auto-validated data to be used by Simple4All, eg:
./audio/LOU/auto-val/audio_id.txt
It's short for Recordings validation. I didn't want to type that out all the time, and I didn't want to puts spaces, hyphens or underscores to refer to it.
librecval
is a library for extracting recordings, transcoding
recordings to a normalized format, and combining the recordings with
metadata, as well as representing recordings data in an
framework-agnostic way. This library should be devoid of references or
dependencies on any web framework or database backend.
It's the Django project for the recording validation site. This aggregates all of the Django apps under one deployable website.
Copyright (C) 2021-2024 Eddie Antonio Santos [email protected], Jolene Poulin [email protected], Felipe Bañados Schwerter [email protected]
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.