Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial script for automating the creation of a controlled testing en… #2057

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
71 changes: 71 additions & 0 deletions create_oovs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/bin/bash
set -e

stag=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?


if [ $# -eq 0 ]; then
echo "\n This script prepares a controlled testing environment for OOV handling."
echo -e "\n Usage: \n $0 <data.csv> \n"
exit 1
fi

step=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this shouldn't be hardcoded? Or is it meant as a development tool, so you iterate on the parts as you get them working?

data=$1
nj=$(nproc)
mkdir -p tmp/
mkdir -p tmp/lm
mkdir -p tmp/results

# Data preparation: split the vocab into 10% (that'd later represent OOVs)
# and the remaining 90% to compose a corpus for LM generation
echo "Step 1: Preparing Data"
if [ $step -le 1 ]; then

# Extract corpus unique vocabularies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Extract corpus unique vocabularies
# Extract corpus vocabulary (unique words)

xsv select transcript $data > tmp/data.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xsv select will output a header line, which means the word "transcript" will get added to all vocabularies.

sed 's/ /\n/g' tmp/data.txt | sort | uniq -c | sort -nr > tmp/vocab.txt
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt

# Pick the least frequent 10% vocabularies to represent OOVs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Pick the least frequent 10% vocabularies to represent OOVs
# Pick the least frequent 10% words to build OOV set

oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use wc -l to communicate intent earlier.

Suggested change
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}')
oov_count=$(wc -l tmp/vocab.txt | awk '{print int($0*0.1)}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size of OOV set should be a parameter (with default).

tail -$oov_count tmp/vocab.txt | awk '{print $2}'> tmp/oov_words
grep -wFf tmp/oov_words tmp/data.txt > tmp/oov_sents
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grep manual says empty lines in the pattern file will match every input line, so we should either make sure the input CSV doesn't have any empty transcripts or remove those before we get to this step.


# Exclude OOVs from the text corpus
grep -vf tmp/oov_sents tmp/data.txt > tmp/scorer_corpus.txt
gzip -c tmp/scorer_corpus.txt > tmp/scorer_corpus.txt.gz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: there is no need to gzip scorer creation corpora, it's an optional feature.

grep -vf tmp/oov_sents $data > tmp/scorer_corpus.csv

# Prepare OOV csv for testing purposes (to assess imporvements on it)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Prepare OOV csv for testing purposes (to assess imporvements on it)
# Prepare OOV CSV for testing purposes (to assess improvements on it)

grep -wFf tmp/oov_sents tmp/data.txt > tmp/oov_corpus.txt
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sed command doesn't work on macOS:

sed: 1: "1 i\wav_filename,wav_fi ...": extra characters after \ at the end of i command

Can we make it portable to BSD sed? This fix worked for me:

Suggested change
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv
echo "wav_filename,wav_filesize,transcript" > tmp/oov_corpus.csv
grep -wFf tmp/oov_sents $data >> tmp/oov_corpus.csv


fi

# Generate LM
echo "Step 2: Generaing Language Model"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "Step 2: Generaing Language Model"
echo "Step 2: Generating Language Model"

if [ $step -le 2 ]; then
python3 data/lm/generate_lm.py --input_txt tmp/scorer_corpus.txt.gz \
--output_dir tmp/lm --top_k 500000 --kenlm_bins kenlm/build/bin \
--arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback

./native_client/generate_scorer_package --alphabet tmp/alphabet.txt \
--lm tmp/lm/lm.binary --vocab tmp/lm/vocab-500000.txt \
--package kenlm.scorer --default_alpha 0.931289039105002 \
--default_beta 1.1834137581510284
fi

# Evaluate
echo "Step 3: Evaluating using scorer"
if [ $step -le 3 ]; then
echo "Evaluating on OOV testing set."
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The native_client/kenlm.scorer should be kenlm.scorer, according to the command in the step above, right? And that should probably be changed to tmp/kenlm.scorer to keep all the outputs of the script contained to that folder.

--checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint path should be made into a parameter.


echo "Evaluating on original testing set."
python -m coqui_stt_training.evaluate --test_files tmp/scorer_corpus.csv \
--test_output_file tmp/results/samples.json --scorer_path native_client/kenlm.scorer \
--checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj
fi