Skip to content

bet4a/russian-troll-tweets-by-author

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3 million Russian troll tweets—by author!

All data in this repository is directly derived from fivethirtyeight/russian-troll-tweets. If that’s not how you got here, please start there first.

This version of the dataset incorporates a variety of minor enhancements and changes, but its primary focus is the separation of author data from tweet data.

One author, many tweets

In the original dataset, every row represents a tweet. Most of the columns contain data unique to each tweet, but five columns pertain to the author: external_author_id, author, account_type, account_category, and new_june_2018. With some rare exceptions for external_author_id, these columns remain constant for every tweet sent by a particular author.

This leads to a lot of duplication. For example, the data captured 8,499 tweets from the handle “AtlantaOnline”. That author has an external_author_id of 2944766250, is categorized as a “NewsFeed” account_category with a “local” account_type, and is not marked as new_june_2018. This data really only needs to be stored once—but instead, it’s replicated for each of AtlantaOnline’s 8,499 tweets. All that duplication leads to larger file sizes.

Depending on what you’re doing, having the author data embedded within the tweet rows might not be ideal for your workflow.

In this repo, tweet data has been separated from author data. An author info CSV lists all the author data in one file, while a different CSV of tweets exists for each author. Here’s a quick rundown of the columns you’ll find in each kind of file. A lot of them are unchanged from the FiveThirtyEight dataset.

Author columns

Column Definition
author_id A unique identifier for an author. In the vast majority of cases this is the same as the author name, but in a handful of cases, there are tweets by the same author but their external_author_id differs. Out of an abundance of caution, this dataset lists them as separate authors. For these cases, the author_id is the concatenation of the author name and their external_author_id.
author The handle sending the tweet. In the original dataset, these are ALLCAPS. I’ve matched up authors with the list from the Nov 2018 House Intelligence Committee dataset, which retains the original capitalization for users’ handles.
external_author_id Same as the original dataset, but floating-point formatting like 9.06000000000e+17 have been converted to regular forms (906000000000000000).
account_type No change.
account_category No change.
new_june_2018 No change.
congress_2017_id The Jun 2017 Intelligence Committee data included user_ids in addition to handle names, which is included here.
in_congress_2017 Boolean indicator whether the user was listed in the November 2017 House Intelligence Committee dataset.
in_congress_2018 Boolean indicator whether the user was listed in the June 2018 House Intelligence Committee dataset.

Tweet columns

Column Definition
author_id Corresponds to the author_id from the authors.csv table. (It’s also the same as the filename.)
content No change.
region No change.
language No change.
publish_date Datetimes are reformatted to mostly match the ISO 8601 standard—for example, 2016-11-29 14:05:00. This format makes sorting much easier. By default, all the tweet CSV files are sorted by this column in ascending order (earlier tweets appear first).
harvested_date Datetimes are reformatted, as above.
following No change.
followers No change.
updates No change.
post_type In the original, this was either blank, RETWEET or QUOTE_TWEET. In this data, it’s blank, R or Q.
retweet No change.

Other minor enhancements

In the original data, there are a handful of users having multiple floating-point external_author_ids, with the variation seeming to be due to rounding. For example, there are some tweet entries from the user “blackunitymarch” showing an external_author_id of 7.69000000000e+17, and others having an external_author_id of 7.69363000000e+17.

For these cases, I’ve taken the more precise number, disgarding the other one, and treated it as a single author. So there is a single entry in the authors.csv file for “blackunitymarch”, and a single file for that author’s tweets.

About this dataset

How it was made

I wish I had documented my process, but I didn’t. It was largely hacking things together, full of trial-and-error and lots of intermediate steps that probably weren’t necessary. The main tools I used were:

  • csvkit: “A suite of utilities for converting to and working with CSV, the king of tabular file formats.” Built atop Python, these command-line tools were invaluable. Their documentation has a remarkably well-written tutorial for getting started.
  • SQLite3: csvkit includes a csvsql command for generating a SQL table and exporting CVS data into it. But some of the column data types it auto-infers when creating tables aren’t optimal. I found it necessary to create tables and indices manually first, then use csvsql only for exporting data (not auto-generating tables).

Final thoughts

This project is a slice-and-dice reorganization of the data from the fivethirtyeight/russian-troll-tweets project. It is not a replacement. If you’re not already familiar with this data, you should begin by read about where it came from and how it was derived. Then you should look at the official data repo from FiveThirtyEight; its README file is chock-full of valuable information.

There is a very good chance you’ll be perfectly satisfied with how the data is structured over there. If so, hooray! But if you find yourself wishing that the tweets were separated by author, or that having tweet-related data separated from author-related data would make your life a bit easier… then this is the repo for you.

Problems, questions, suggestions, angry criticisms about how I’m doing this all wrong? Please don’t hestitate to file an issue or message me! And feel free to submit a pull request if you’re extra-generous and have any improvements to contribute.

About

Restructuring of the FiveThirtyEight Russian Troll Tweets dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published