From 65102c1c386e3aa60f851948f627797fe08e05d8 Mon Sep 17 00:00:00 2001 From: Alan Ng <15185920+alanngnet@users.noreply.github.com> Date: Mon, 1 Apr 2024 22:14:27 -0500 Subject: [PATCH] Update README.md --- README.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index a12344d..706338d 100644 --- a/README.md +++ b/README.md @@ -112,11 +112,11 @@ Arguments to pass to the script: There are two different hparams.yaml files, each used at different stages. -1. The one located in the folder you provide on the command line to tools.extract_csi_features is used only by that script. +1. The one located in the folder you provide on the command line to tools.extract_csi_features.py is used only by that script. | key | value | | --- | --- | -|add_noise| Original CoverHunter provided the example of:
{
  "prob": 0.75,
  "sr": 16000,
  "chunk": 3,
  "name": "cqt_with_asr_noise",
  "noise_path": "dataset/asr_as_noise/dataset.txt"
}
However, the CoverHunter repo did not include whatever might supposed to be in "dataset/asr_as_noise/dataset.txt" file nor does the CoverHunter research paper describe it. If that path does not exist in your project folder structure, then tools.extract_csi_features will just skip the stage of adding noise augmentation. At least for training successfully on Covers80, noise augmentation doesn't seem to be needed.| +|add_noise| Original CoverHunter provided the example of:
{
  `prob`: 0.75,
  `sr`: 16000,
  `chunk`: 3,
  `name`: "cqt_with_asr_noise",
  `noise_path`: "dataset/asr_as_noise/dataset.txt"
}
However, the CoverHunter repo did not include whatever might supposed to be in "dataset/asr_as_noise/dataset.txt" file nor does the CoverHunter research paper describe it. If that path does not exist in your project folder structure, then tools.extract_csi_features will just skip the stage of adding noise augmentation. At least for training successfully on Covers80, noise augmentation doesn't seem to be needed.| | aug_speed_mode | list of ratios used in tools.extract_csi_features for speed augmention of your raw training data. Example: [0.8, 0.9, 1.0, 1.1, 1.2] means use 80%, 90%, 100%, 110%, and 120% speed variants of your original audio data.| | train-sample_data_split | percent of training data to reserve for validation aka "train-sample" expressed as a fraction of 1. Example for 10%: 0.1 | | train-sample_unseen | percent of song_ids from training data to reserve exclusively for validation aka "train-sample" expressed as a fraction of 1. Example for 2%: 0.02 | @@ -130,32 +130,32 @@ There are two different hparams.yaml files, each used at different stages. | key | value | | --- | --- | | covers80 | Test dataset for model evaluation purposes. "covers80" is the only example provided with the original CoverHunter.
Subparameters:
`query_path`: "data/covers80/full.txt"
`ref_path`: "data/covers80/full.txt"
`every_n_epoch_to_dev`: 1 # validate after every n epoch
These can apparently be the same path as `train_path` for doing self-similarity evaluation.| -| dev_path | Compare train_path and train_sample_path. This dataset is used in each epoch to run the same validation calculation as with the train_sample_path. But these results are used for the early_stopping_patience calculation. Presumably one should include both classes and samples that were excluded from both train_path and train_sample_path. | -| query_path | TBD: | -| ref_path | TBD: can apparently be the same path as train_path. Presumably for use during model evaluation and inference. | +| dev_path | Compare `train_path` and `train_sample_path`. This dataset is used in each epoch to run the same validation calculation as with the `train_sample_path`. But these results are used for the `early_stopping_patience` calculation. Presumably one should include both classes and samples that were excluded from both `train_path` and `train_sample_path`. | +| query_path | TBD: can apparently be the same path as `train_path`. Presumably for use during model evaluation and inference. | +| ref_path | TBD: can apparently be the same path as `train_path`. Presumably for use during model evaluation and inference. | | train_path | path to a JSON file containing metadata about the data to be used for model training (See full.txt below for details) | -| train_sample_path | path to a JSON file containing metadata about the data to be used for model validation. Compare dev_path above. Presumably one should include a balanced distribution of samples that are *not* included in the train_path dataset, but do include samples for the classes represented in the train_path dataset.(See full.txt below for details) +| train_sample_path | path to a JSON file containing metadata about the data to be used for model validation. Compare `dev_path` above. Presumably one should include a balanced distribution of samples that are *not* included in the `train_path` dataset, but do include samples for the classes represented in the `train_path` dataset.(See full.txt below for details) ### Training parameters | key | value | | --- | --- | | batch_size | Usual "batch size" meaning in the field of machine learning. An important parameter to experiment with. | -| chunk_frame | list of numbers used with mean_size. CoverHunter's covers80 config used [1125, 900, 675]. "chunk" references in this training script seem to be the chunks described in the time-domain pooling strategy part of their paper, not the chunks discussed in their coarse-to-fine alignment strategy. See chunk_s. | -| chunk_s | duration of a chunk_frame in seconds. Apparently you are supposed to manually calculate chunk_s = chunk_frame / frames-per-second * mean_size. I'm not sure why the script doesn't just calculate this itself using CQT hop-size to get frames-per-second? | +| chunk_frame | list of numbers used with `mean_size`. CoverHunter's covers80 config used [1125, 900, 675]. "chunk" references in this training script seem to be the chunks described in the time-domain pooling strategy part of their paper, not the chunks discussed in their coarse-to-fine alignment strategy. See chunk_s. | +| chunk_s | duration of a `chunk_frame` in seconds. Apparently you are supposed to manually calculate `chunk_s` = `chunk_frame` / frames-per-second * `mean_size`. I'm not sure why the script doesn't just calculate this itself using CQT hop-size to get frames-per-second? | | cqt: hop_size: | Fine-grained time resolution, measured as duration in seconds of each CQT spectrogram slice of the audio data. CoverHunter's covers80 setting is 0.04 with a comment "1s has 25 frames". 25 frames per second is hard-coded as an assumption into CoverHunter in various places. | | data_type | "cqt" (default) or "raw" or "mel". Unknown whether CoverHunter actually implemented anything but CQT-based training | | device | 'mps' or 'cuda', corresponding to your GPU hardware and PyTorch library support. Theoretically 'cpu' could work but untested and probably of no value. | | early_stopping_patience | how many epochs to wait for validation loss to improve before early stopping | -| mean_size | See chunk_s above. An integer used to multiply chunk lengths to define the length of the feature chunks used in many stages of the training process. | +| mean_size | See `chunk_s` above. An integer used to multiply chunk lengths to define the length of the feature chunks used in many stages of the training process. | | mode | "random" (default) or "defined". Changes behavior when loading training data in chunks in AudioFeatDataset. "random" described in CoverHunter code as "cut chunk from feat from random start". "defined" described as "cut feat with 'start/chunk_len' info from line"| | m_per_class | From CoverHunter code comments: "m_per_class must divide batch_size without any remainder" and: "At every iteration, this will return m samples per class. For example, if dataloader's batch-size is 100, and m = 5, then 20 classes with 5 samples iter will be returned." | -| spec_augmentation | spectral(?) augmentation settings, used to generate temporary data augmentation on the fly during training. CoverHunter settings were:
random_erase:
  prob: 0.5
  erase_num: 4
roll_pitch:
  prob: 0.5
  shift_num: 12 | +| spec_augmentation | spectral(?) augmentation settings, used to generate temporary data augmentation on the fly during training. CoverHunter settings were:
`random_erase`:
  `prob`: 0.5
  `erase_num`: 4
`roll_pitch`:
  `prob`: 0.5
  `shift_num`: 12 | ### Model parameters | key | value | | --- | --- | | embed_dim | 128 | -| encoder | # model-encode
Subparameters:
attention_dim: 256 # "the hidden units number of position-wise feed-forward"
output_dims: 128
num_blocks: 6 # number of decoder blocks | +| encoder | # model-encode
Subparameters:
`attention_dim`: 256 # "the hidden units number of position-wise feed-forward"
`output_dims`: 128
`num_blocks`: 6 # number of decoder blocks | | input_dim | 96 | @@ -195,11 +195,11 @@ That bug didn't prevent successful training, but fixing the bug did, until I dis | filename | comments | |---|---| | cqt_feat subfolder | Numpy array files of the CQT data for each file listed in full.txt. Needed by train.py | -| data.init.txt | Copy of dataset.txt after sorting by ‘utt’ and de-duping. Not used by train.py | +| data.init.txt | Copy of dataset.txt after sorting by `utt` and de-duping. Not used by train.py | | dev.txt | A subset of full.txt generated by the `_split_data_by_song_id()` function intended for use by train.py as the `dev` dataset. | | dev-only-song-ids.txt | Text file listing one song_id per line for each song_id that the train/val/test splitting function held out from train/val to be used exclusively in the test aka "dev" dataset. This file can be used by `eval_testset.py` to mark those samples in the t-SNE plot. | | full.txt | See above detailed description. Contains the entire dataset you provided in the input file. | -| song_id.map | Text file, with 2 columns per line, separated by a space, sorted alphabetically by the first column. First column is a distinct "song" string taken from dataset.txt. Second column is the "song_id" value assigned to that "song." | +| song_id.map | Text file, with 2 columns per line, separated by a space, sorted alphabetically by the first column. First column is a distinct "song" string taken from dataset.txt. Second column is the `song_id` value assigned to that "song." | | sp_aug subfolder | Sox-modified wav speed variants of the raw training .wav files, at the speeds defined in hparams.yaml. Not used by train.py | | sp_aug.txt | Copy of data.init.txt but with addition of 1 new row for each augmented variant created in sp_aug/*.wav. Not used by train.py. | | train.txt | A subset of full.txt generated by the `_split_data_by_song_id()` function intended for use by train.py as the `train` dataset. |