Skip to content

Releases: grfiv/healthcare_twitter_analysis

Reverse geo-tagging included; duplicates removed

10 Sep 14:35
Compare
Choose a tag to compare

All of the tweets for this project have been processed and consolidated into a single file that can be downloaded with this link:

Each of the 4 million rows in this file is a tweet in json format containing the following information:

  • All the Twitter data in exactly the json format of the original
  • Unix time stamp
  • All the Topsy data
    • originating file name
    • score
    • author screen name
    • URLs

60% of the records have geographic information ...

  • Latitude & Longitude
  • Country name & ISO2 country code
  • City
  • For country code "US"
    • Zipcode
    • Telephone area code
    • Square miles inside the zipcode
    • 2010 Census population of the zipcode
    • County & FIPS code
    • State name & USPS abbreviation

The basic technique for using this file in Python is the following:

import json

with open("HTA_noduplicates.json", "r") as f:
    # convert each row in turn into json format and process
    for row in f:
        tweet = json.loads(row)
        text  = tweet["text"]      # text of original tweet
        ...                        # etc.

Python provides very powerful analytical and plotting features but R is also very handy; R does not work well with large datasets but Python can be used to create a targeted subset file that R can read (or Excel, or anything else for that matter).

For long-running jobs, I used Amazon Web Service's EC2 running Ubuntu 14.04, accessed via PuTTY and WebSCP; for local processing I used a Windows 7 laptop with the data on a terabyte external hard drive.

The Status Report in the main repo contains

  • a comprehensive explanation of the dataset
  • examples of analyses done with this dataset
  • a list of references to other healthcare-related Twitter analyses
  • instructions for using Amazon Web Services
  • sample programs using this file with Python, R and MongoDB.

Completed reverse geo-tagging

26 Aug 12:20
Compare
Choose a tag to compare

The code included in this release covers the three-step process:

  • add twitter json
  • update the ["geo"]["coordinates"] fields with latitude and longitude
  • add the ["geo_reverse"] field containing country, city, zip, etc.