Code supporting the "Monitoring Hate Speech in the US Media" project, of Prof. Babak Bahador's group in the GW School of Media and Public Affairs.
Note that there is a requirements.txt
file, so running this program requires a Python environment with the libraries in requirements.txt
installed. For those new to setting up Python environments, A Hitchhiker's Guide to Python provides advice and several different ways to accomplish this.
usage: python vopd.py [-h] [--window WINDOW] [--context CONTEXT] [--subjectfile SUBJECTFILE] [--keywordfile KEYWORDFILE] [---normalizefile NORMALIZEFILE] [--mode MODE] transcript
transcript filepath to transcript pdf or directory, or (where `mode==tweets`) path to SFM extract Excel file
optional arguments:
-h, --help show this help message and exit
--window WINDOW number of words that subject and keyword must be within (default = 5)
--context CONTEXT number of words before and after subject and keyword to extract (default = 20)
--subjectfile SUBJECTFILE subject list file (default = subjects.csv)
--keywordfile KEYWORDFILE keyword list file (default = keywords.csv)
--normalizefile NORMALIZEFILE normalize terms file (default = normalize_terms.csv)
--mode MODE processing mode, either `pdf` or `tweets` or `email` (default = pdf)
--verbose verbose output during execution
PDF Transcript files must be named using the following pattern:
MM_DD_YYYY_NNN_Name Of The Show.pdf
where:
MM_DD_YYYY
is the date of the showNNN
is the show code/number (any separator character is okay - but positions of the values are important)
SFM extract files must be Excel files output by Social Feed Manager with columns as per https://sfm.readthedocs.io/en/latest/data_dictionary.html?highlight=export#twitter-dictionary
Email extract files must be Excel files with the following columns:
- Date
- From
- Sender
- Message
extracts-[pdf OR tweets OR email].csv
- All instances of a keyword and a subject found within "n" number
of words of each other, where "n" is the configured window size.
Note that if extracts-[pdf OR tweets OR email].csv
already exists, it will be appended to. If you wish to overwrite, simply delete or rename it.
The recycle_keywords.py
utility takes:
- A coding file (default
coding.csv
) **currently only works for PDF extracts - A keywords file (default
keywords.csv
) - A normalize_terms file (default
normalize_terms.csv
)
It scans through the coding file, looking for keyword severity scores assigned by the human coder, as well as looking for new keywords added by the human coder. It then updates the scores of existing keywords (using the mode of human-assigned severity scores), and adds new keywords, to the keywords file.