This is a screenplay parser that extracts dialogues between characters. However it extracts the dialogues if the second character has a paranthetical. The scripts are crawled from http://www.imsdb.com/ .
-
Create a new environment
-
Clone the repository
-
Install the dependencies
pip install -r requirements.txt
-
Run scrapy : Go to brickset-scraper folder and run this in your terminal:
scrapy runspider scraper.py --output=data/names_links.json
This will generate
data/names_links.json
. -
python json_parser.py data/names_links.json
. This will readnames_links.json
and will createall_name_script.txt
. This new txt file has a movie name and a link to its script for each movie in the json file. Note that each script takes 1-2 seconds. -
python html_list_parser.py
. This will readall_name_script.txt
and will generateall_dialogues.txt
. This file has all the relevant dialogues from the movie scripts.
You need to have
- BeautifulSoup
- Scraper
- Python 3 or above
- Jupyter Notebook
Kamil Veli Toraman: kvtoraman
There is no licence for now. You can use as you please. This code tries to have a rule-based algorithm for movie scripts. If you have a better way, please inform me :)
- This is a result of a 2 month internship in Data Science Lab, Kaist.