OpenAI Whisper API-Powered Podcast Transcription, Labelling and Remixing

This Jupyter notebook script automates the process of downloading, transcribing, and analyzing podcast episodes. It uses OpenAI's Whisper for transcription, GPT-4 for content analysis, and stores the results in Google BigQuery.

Features

Downloads podcast episodes from an RSS feed
Transcribes audio using OpenAI's Whisper model
Classifies episodes based on content
Labels speakers in the transcript
Generates a summary of insights, opinions, and trends discussed in the episode
Extracts notable quotes from the guest speaker
Stores all processed data in Google BigQuery

Prerequisites

Before using this notebook, you need to have:

An OpenAI API key
A Google Cloud Platform account with BigQuery enabled
A service account key file for Google Cloud Platform

Required Libraries

The script uses the following Python libraries:

requests
pandas
feedparser
pydub
openai
google-cloud-bigquery

You can install these libraries using pip: pip install requests pandas feedparser pydub openai google-cloud-bigquery Copy Note: You also need to have FFmpeg installed on your system for audio processing.

Configuration

Before running the notebook, you need to configure the following:

OpenAI API Key: Replace 'your-openai-api-key-here' with your actual OpenAI API key.
Google Cloud Service Account Key: Replace 'path/to/your/service-account-key.json' with the path to your service account key file.
BigQuery Dataset and Table: Set the dataset_name and table_name variables to your desired values.
Podcast Host Name: Set the podcast_host_name to the desired host of the podcast.
Podcast RSS Feed: Set the feed_url variable to the RSS feed URL of the podcast you want to process.
Number of Episodes: Set the num_episodes variable to the number of recent episodes you want to process.

How It Works

The script fetches podcast episode information from the provided RSS feed.
For each episode:
- It downloads the audio file
- Transcribes the audio using OpenAI's Whisper
- Classifies the episode content using GPT-4
- Labels speakers in the transcript
- Generates a summary and extracts insights using GPT-4
All processed data is then stored in the specified BigQuery table.

Usage

Open the Jupyter notebook in your preferred environment (e.g., JupyterLab, Google Colab).
Configure the variables as described in the Configuration section.
Run all cells in the notebook.

The script will process the specified number of episodes and store the results in BigQuery. You can then query this data for further analysis or use in other applications.

Output

The script creates a BigQuery table with the following columns:

title
link
description
published
audio_url
transcript
classification
summary_and_insights

Limitations and Considerations

Processing time can be significant, especially for longer episodes or when processing many episodes at once.
Be mindful of API usage costs for both OpenAI and Google Cloud Platform.
Ensure you have appropriate permissions and comply with the terms of service for all APIs and services used.
The script currently does not handle pagination for RSS feeds with a large number of episodes.

Podcast Transcriber

This Jupyter notebook generates simulated podcast conversations about data analytics, business intelligence, and modern data stack technologies. It uses OpenAI's GPT-4 model to create realistic discussions between a host and various guests, based on profiles generated from actual podcast transcripts.

Features

Retrieves podcast data from a BigQuery table
Identifies guests from podcast episode titles
Generates detailed guest profiles using GPT-4
Creates diverse discussion topics related to data analytics
Simulates conversations between the host and selected guests
Stores guest profiles and simulated conversations in BigQuery tables
Outputs conversations in both JSON and HTML formats

Prerequisites

Google Cloud Platform account with BigQuery enabled
OpenAI API key
Python 3.7+

Required Libraries

google-cloud-bigquery
pandas
openai
json
random

Setup

Clone this repository to your local machine or Google Colab environment.
Install the required libraries: pip install google-cloud-bigquery pandas openai Copy3. Set up your Google Cloud credentials and OpenAI API key as environment variables or update the placeholders in the notebook.
Update the BigQuery project ID, dataset name, and table names in the notebook to match your GCP setup.

Usage

Run the notebook cells in order.
The script will:

Create necessary BigQuery tables if they don't exist
Retrieve podcast data from the specified BigQuery table
Identify guests and generate their profiles
Generate 10 diverse discussion topics
Simulate conversations for each topic
Store the results in BigQuery tables

BigQuery Table Schemas

Guest Profiles Table

guest_name (STRING): Name of the guest
profile (STRING): JSON string containing the guest's profile

Simulated Conversations Table

topic (STRING): The discussion topic
guests (STRING): JSON array of guest names
guest_profiles (STRING): JSON object containing profiles of participating guests
conversation (STRING): JSON array of the simulated conversation
html_conversation (STRING): HTML-formatted version of the conversation

Customization

Adjust the number of simulated conversations by changing the range in the main execution loop.
Modify the generate_topic function to focus on specific areas of interest.
Tweak the select_relevant_guests function to change how guests are selected for each conversation.

Notes

The quality and diversity of the simulated conversations depend on the input data and the performance of the GPT-4 model.
Be mindful of API usage costs when running this notebook, especially when generating a large number of conversations.
Ensure compliance with OpenAI's use-case policies and your organization's data handling guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
simulate_podcast_episode.ipynb		simulate_podcast_episode.ipynb
transcribe_podcast_feed_bigquery_notebook.ipynb		transcribe_podcast_feed_bigquery_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAI Whisper API-Powered Podcast Transcription, Labelling and Remixing

Features

Prerequisites

Required Libraries

Configuration

How It Works

Usage

Output

Limitations and Considerations

Podcast Transcriber

Features

Prerequisites

Required Libraries

Setup

Usage

BigQuery Table Schemas

Guest Profiles Table

Simulated Conversations Table

Customization

Notes

About

Releases

Packages

Languages

rittmananalytics/podcast_transcriber_openai_whisper_bigquery

Folders and files

Latest commit

History

Repository files navigation

OpenAI Whisper API-Powered Podcast Transcription, Labelling and Remixing

Features

Prerequisites

Required Libraries

Configuration

How It Works

Usage

Output

Limitations and Considerations

Podcast Transcriber

Features

Prerequisites

Required Libraries

Setup

Usage

BigQuery Table Schemas

Guest Profiles Table

Simulated Conversations Table

Customization

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages