UPDATE: WhoPaysWriters.com asked that their data not be posted on a third-party site, so the datasets have been removed. Please email me with any questions.
A data scrape and analysis of WhoPaysWriters.com. A summary of the results can be found here. Collected for an article in the Columbia Journalism Review. Questions and suggestions for improvement are welcome: [email protected].
WhoPaysWriters.com is an anonymous platform where freelance journalists post details about their compensation. There were approximately 3000 submissions to the site from 2012-2018, making it the largest publicly-available dataset available of its kind. Journalists not only submit their pay, but also include information about their rights, their relationship with the editor, and other contextual data.
This script opens creates three kinds of CSVs:
- publications.csv, which lists all publications scraped from the opening webpage.
- A CSV created for each publication's page under the
data
folder. - allData_raw.csv, which is one CSV of everything in
data
. It requires that the user download ChromeDriver in addition to its python packages.
Cleans data for analysis. Other than normal cleaning, here are some decisions made:
- I replaced most
other
entries with NaNs. - I dropped everything with fewer than 100 words.
- I dropped all
fiction
andpoetry
entries. - I removed entries for 2019.
- Potential spam, unreasonable outliers are cut. They are addressed on a case-by-case basis. This notebook creates allData_clean.csv, what is ultimately used for analysis.
Explores most 2-variable relationships and creates appropriate graphs for study. Also creates publications_rank.csv, which uses rankings from totalPaid
, wordRate
, daysToBePaid
, and paymentDifficulty
to rank publications with more than 7 submissions.