Skip to content

Extract text from thousands of pdf documents containing facebook ad info as released by House Intel Cmte. Transform into a structured database

License

Notifications You must be signed in to change notification settings

amkessler/intelcmte_facebookads_extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

intelcmte_facebookads_extract

Extract text from pdf documents containing facebook ad info and build structured database

Getting the raw documents

A sample of the documents are included in the pdfs directory. The entire collection of 3,000+ pdf files is available at https://intelligence.house.gov/social-media-content/social-media-advertisements.htm

For those interested in running the extraction code on the entire collection, simply place all of the files found at the link above into the pdfs directory on your local machine. For size considerations of this repo on github, only the sample of about 100 files are included here.

About

Extract text from thousands of pdf documents containing facebook ad info as released by House Intel Cmte. Transform into a structured database

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages