The current status report is in the main folder and you would do well to start by at least skimming it.
####RESTful interface to the MongoDB database
Under the RESTful Interface
folder you will find the entire file structure required to run a Chrome web browser app that makes queries to a MongoDB database with all of the project's ~4 million json documents.
The instructions for running the project after you have installed the files are under the Instructions
tab of the main web page HTAinterface.html
which you can simply load into your Chrome browser (Ctrl+o). The most-current instructions are contained here and will be updated as the project evolves.
The Status Report has a section with some of the technical details of Bottle, jQuery and Ajax
####The Status Report Status Report.pdf
in the main folder
- a comprehensive explanation of the dataset
- examples of analyses done with this dataset
- a list of references to other healthcare-related Twitter analyses
- instructions for using Amazon Web Services
- sample programs using this file with Python, R and MongoDB.
- technical details of the RESTful interface.
####Complete dataset of the tweets for this project
Numerous files were created in the course of this project. They can be viewed at and downloaded from the Amazon S3 bucket where they have been archived at this web address: http://healthcare-twitter-analysis.com.s3-website-us-west-1.amazonaws.com/
All of the tweets for this project have been processed and consolidated into a single file HTA_noduplicates.gz that can be found by entering the file name in the search box at http://healthcare-twitter-analysis.com.s3-website-us-west-1.amazonaws.com/
Each of the 4 million rows in this file is a tweet in json format.
-
Every record contains the following information:
- All the Twitter data in exactly the json format of the original
- Unix time stamp
- data from the original files:
- originating file name
- score
- author screen name
- URLs
-
In addition, 60% of the records have geographic information
- Latitude & Longitude
- Country name & ISO2 country code
- City
- For country code "US"
- Zipcode
- Telephone area code
- Square miles inside the zipcode
- 2010 Census population of the zipcode
- County & FIPS code
- State name & USPS abbreviation
The basic technique for using this file in Python is the following:
import json
with open("HTA_noduplicates.json", "r") as f:
# convert each row in turn into json format and process
for row in f:
tweet = json.loads(row)
text = tweet["text"] # text of original tweet
... # etc.
The Status Report includes instructions for loading the json text file into a MongoDB database collection; I keep mine on an external hard drive and I start the MongoDB server as follows:
mongod --dbpath "E:\HTA"
The database is HTA and the collection is grf. In that case the Python code would look like this:
import json
from pymongo import MongoClient
# start up MongoDB
# ================
client = MongoClient() # assuming you have the MongoDB server running ...
db = client['HTA'] # reference the database
tweets = db.grf # reference the collection
for tweet in tweets.find():
text = tweet["text"]
if tweet['geo']:
(...)