GitHub - venturamor/TRBLLmaker-NLP: Songs Meaning Generating

TRBLLmaker - ReadMe

Genius API:

API client created by https: //genius.com/api-clients/new Songify
app website: http://example.com/
Client ID:

Client Key (secret):

Client Access TOKEN:
relevant helpful websites:

Data Extraction and preparation:

data_extraction - extract songs data and metadata using genius API, by genre and by artists chosen_artists. saved in pickles (db_pickles/artist or db_pickles/genre)
data_arrangement - gather all extracted data to uniq set (db_pickles/final)
prepare_data - organize data in dataframe format (./jsons), and split it to train, test and validation (./data)

Dataset

Working with HuggingFace Dataset format.

TRBLL_dataset - our Dataset struct - takes the jsons that are located in ./data, by config - train_args.
Dataset include train, test and validation DatasetDicts.

Data Exploration

Before splitting to train, test and validation, we can:
- Print statistics of songs by length, genre, artist, etc.
- Words cloud of songs lyrics.
- Words cloud of sentences in songs lyrics that is annotated.
- Words cloud of the annotated sentences.
- Statistics from the zero-shot.
- Correlation between page ranking and other features.
After splitting to train, test and validation, we can:
- Print out several sentences with annotations
- Print statistics of sentences with annotations by length (both song and annotation)

After looking at the data:

We can see that a lot of the annotations has the artist name in it.
Some annotations rely on previous songs.
Some annotations rely on the full lyrics.
Some annotations have noise like:
- https
Some songs are in other languages (Russion, Espanol, French)

Future work:

Insert a paragraph and annotation to a model and get the sentence that the annotation is talking about.
Insert a paragraph and a sentence to a model and get the annotation that the sentence is talking about.
Insert a paragraph and information about the artist and get the sentence that the annotation is talking about.

Problems:

The annotations have a lot of names and history of the artists.
- Solutions:
  - NER (named entity recognition) and replace the names with some generic words.
  - Remove examples with names.
  - insert the name of the artist with the sentence.

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.idea		.idea
T5		T5
data_exploration		data_exploration
extract_songs		extract_songs
not_relevant		not_relevant
predictions_before_after		predictions_before_after
trained_model_checkpoint_lyrics_meaning_with_metadata_bs_32		trained_model_checkpoint_lyrics_meaning_with_metadata_bs_32
transformers		transformers
wandb		wandb
.gitignore		.gitignore
ReadMe.md		ReadMe.md
TRBLL_dataset.py		TRBLL_dataset.py
app.py		app.py
config.yaml		config.yaml
config_parser.py		config_parser.py
data_exploration.py		data_exploration.py
data_extraction.py		data_extraction.py
evaluate_models.py		evaluate_models.py
finetuning_script.py		finetuning_script.py
finetuning_script_batch.py		finetuning_script_batch.py
inference_results.pkl		inference_results.pkl
inference_results_old.pkl		inference_results_old.pkl
investigate_results.py		investigate_results.py
post_evaluation.py		post_evaluation.py
prepare_data.py		prepare_data.py
prompts.py		prompts.py
requirements.txt		requirements.txt
run_gpt.sh		run_gpt.sh
run_inference.sh		run_inference.sh
set_config.py		set_config.py
try_gpt_mor.py		try_gpt_mor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRBLLmaker - ReadMe

Genius API:

Data Extraction and preparation:

Dataset

Data Exploration

Future work:

Problems:

About

Releases

Packages

Languages

venturamor/TRBLLmaker-NLP

Folders and files

Latest commit

History

Repository files navigation

TRBLLmaker - ReadMe

Genius API:

Data Extraction and preparation:

Dataset

Data Exploration

Future work:

Problems:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages