Skip to content

Generate concise podcast summaries from episodes, condensing key information into short, grammatical, and user-friendly snippets for efficient content evaluation on smartphone screens.

Notifications You must be signed in to change notification settings

heyhimansh/PodSnap.AI

Repository files navigation

🎙️ Abstractive Podcast Summarization 📄

This repository contains a Third year project realized for the Software Engineering – CO301, DELHI TECHNOLOGICAL UNIVERSITY.

Description

This project aims at producing a good abstractive summary of podcasts transcripts, obtained from the Spotify Podcast Dataset. This task was originally proposed in the context of the TREC Podcast Track 2020, where the objective was to provide a short text summary that a user might read when deciding whether to listen to a podcast. The summary should accurately convey the content of the podcast, be human-readable, and be short enough to be quickly read on a smartphone screen.

Dataset

The Spotify Podcast Dataset is the first large-scale set of podcasts with transcripts that has been released publicly, with over 100,000 transcribed podcast episodes comprised of raw audio files, their transcripts and metadata. The transcription is provided by Google Cloud Platform’s Speech-to-Text API.

While no ground truth summaries are provided in the dataset, the episode descriptions written by the podcast creators serve as proxies for summaries, and are used for training supervised models.

More info about how to have access to the dataset on podcasts-no-audio-13GB folder.

Solution proposed

In our solution an extractive module is developed to select salient chunks from the transcript, which serve as the input to an abstractive summarizer. The latter utilizes a BART model, that employs an encoder-decoder architecture. An extensive pre-processing on the creator-provided descriptions is performed selecting a subset of the corpus that is suitable for the training supervised model. The figure below summarizes the steps involved by our method. In order to have a better understanding of our proposed solution, take a look to the notebook and the report.

Model

The bart-large-cnn has been fine-tuned for 3 epochs on filtered transcripts as input. The final model, that we call bart-large-finetuned-filtered-spotify-podcast-summ has been uploaded on the Hugging Face Hub 🤗.

It can be used for the summarization as follows:

from transformers import pipeline
summarizer = pipeline("summarization", model="gmurro/bart-large-finetuned-filtered-spotify-podcast-summ", tokenizer="gmurro/bart-large-finetuned-filtered-spotify-podcast-summ")
summary = summarizer(podcast_transcript, min_length=39, max_length=250)
print(summary[0]['summary_text'])

Alternatively you can run the summarization script passing a transcript file as argument:

python compute_summary.py transcript_example.txt

Results

BERTScore has been chosen as semantic metric to evaluate the results on the test set, as shown by the table below our model outperform the bart-large-cnn baseline:

Model Precision Recall F1 Score
bart-large-cnn 0.8103 0.7941 0.8018
bart-large-finetuned 0.8401 0.8093 0.8240

This is an example of the prediction made by the fine-tuned model:

image

CREATOR-PROVIDED DESCRIPTION:  
    In this episode, I talk about how we have to give up perfection in order to grow in our relationship with God.
    It s not about perfection, it s about growing as we walk on the path to perfection.
GENERATED SUMMARY:
    In this episode I talk about the idea of Perfection and how it has the ability to steal all of our joy in this life — if we let it.
    I go into detail about a revelation I had after walking away from my coaching career and how badly I need Jesus.

Resources & Libraries

  • Transformers 4.19.4
  • TensorFlow 2.9.1
  • Datasets 2.3.1
  • Tokenizers 0.12.1

Versioning

We use Git for versioning.

Group members

Roll No. Name Surname Email Username
2K21/CO/200 HIMANSHU [email protected] heyHimansh
2K21/CO/184 HARSHIT CHOPRA [email protected] MadVIJ

Show some ❤️  by giving to this repo

About

Generate concise podcast summaries from episodes, condensing key information into short, grammatical, and user-friendly snippets for efficient content evaluation on smartphone screens.

Resources

Stars

Watchers

Forks