3 million Russian troll tweets—by author!

All data in this repository is directly derived from fivethirtyeight/russian-troll-tweets. If that’s not how you got here, please start there first.

This version of the dataset incorporates a variety of minor enhancements and changes, but its primary focus is the separation of author data from tweet data.

One author, many tweets

In the original dataset, every row represents a tweet. Most of the columns contain data unique to each tweet, but five columns pertain to the author: external_author_id, author, account_type, account_category, and new_june_2018. With some rare exceptions for external_author_id, these columns remain constant for every tweet sent by a particular author.

This leads to a lot of duplication. For example, the data captured 8,499 tweets from the handle “AtlantaOnline”. That author has an external_author_id of 2944766250, is categorized as a “NewsFeed” account_category with a “local” account_type, and is not marked as new_june_2018. This data really only needs to be stored once—but instead, it’s replicated for each of AtlantaOnline’s 8,499 tweets. All that duplication leads to larger file sizes.

Depending on what you’re doing, having the author data embedded within the tweet rows might not be ideal for your workflow.

In this repo, tweet data has been separated from author data. An author info CSV lists all the author data in one file, while a different CSV of tweets exists for each author. Here’s a quick rundown of the columns you’ll find in each kind of file. A lot of them are unchanged from the FiveThirtyEight dataset.

Author columns

Column	Definition
`author_id`	A unique identifier for an author. In the vast majority of cases this is the same as the `author` name, but in a handful of cases, there are tweets by the same `author` but their `external_author_id` differs. Out of an abundance of caution, this dataset lists them as separate authors. For these cases, the `author_id` is the concatenation of the `author` name and their `external_author_id`.
`author`	The handle sending the tweet. In the original dataset, these are ALLCAPS. I’ve matched up authors with the list from the Nov 2018 House Intelligence Committee dataset, which retains the original capitalization for users’ handles.
`external_author_id`	Same as the original dataset, but floating-point formatting like `9.06000000000e+17` have been converted to regular forms (`906000000000000000`).
`account_type`	No change.
`account_category`	No change.
`new_june_2018`	No change.
`congress_2017_id`	The Jun 2017 Intelligence Committee data included user_ids in addition to handle names, which is included here.
`in_congress_2017`	Boolean indicator whether the user was listed in the November 2017 House Intelligence Committee dataset.
`in_congress_2018`	Boolean indicator whether the user was listed in the June 2018 House Intelligence Committee dataset.

Tweet columns

Column	Definition
`author_id`	Corresponds to the `author_id` from the authors.csv table. (It’s also the same as the filename.)
`content`	No change.
`region`	No change.
`language`	No change.
`publish_date`	Datetimes are reformatted to mostly match the ISO 8601 standard—for example, `2016-11-29 14:05:00`. This format makes sorting much easier. By default, all the tweet CSV files are sorted by this column in ascending order (earlier tweets appear first).
`harvested_date`	Datetimes are reformatted, as above.
`following`	No change.
`followers`	No change.
`updates`	No change.
`post_type`	In the original, this was either blank, `RETWEET` or `QUOTE_TWEET`. In this data, it’s blank, `R` or `Q`.
`retweet`	No change.

Other minor enhancements

In the original data, there are a handful of users having multiple floating-point external_author_ids, with the variation seeming to be due to rounding. For example, there are some tweet entries from the user “blackunitymarch” showing an external_author_id of 7.69000000000e+17, and others having an external_author_id of 7.69363000000e+17.

For these cases, I’ve taken the more precise number, disgarding the other one, and treated it as a single author. So there is a single entry in the authors.csv file for “blackunitymarch”, and a single file for that author’s tweets.

About this dataset

How it was made

I wish I had documented my process, but I didn’t. It was largely hacking things together, full of trial-and-error and lots of intermediate steps that probably weren’t necessary. The main tools I used were:

csvkit: “A suite of utilities for converting to and working with CSV, the king of tabular file formats.” Built atop Python, these command-line tools were invaluable. Their documentation has a remarkably well-written tutorial for getting started.
SQLite3: csvkit includes a csvsql command for generating a SQL table and exporting CVS data into it. But some of the column data types it auto-infers when creating tables aren’t optimal. I found it necessary to create tables and indices manually first, then use csvsql only for exporting data (not auto-generating tables).

Final thoughts

This project is a slice-and-dice reorganization of the data from the fivethirtyeight/russian-troll-tweets project. It is not a replacement. If you’re not already familiar with this data, you should begin by read about where it came from and how it was derived. Then you should look at the official data repo from FiveThirtyEight; its README file is chock-full of valuable information.

There is a very good chance you’ll be perfectly satisfied with how the data is structured over there. If so, hooray! But if you find yourself wishing that the tweets were separated by author, or that having tweet-related data separated from author-related data would make your life a bit easier… then this is the repo for you.

Problems, questions, suggestions, angry criticisms about how I’m doing this all wrong? Please don’t hestitate to file an issue or message me! And feel free to submit a pull request if you’re extra-generous and have any improvements to contribute.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
1_9		1_9
A		A
B		B
C		C
D		D
E		E
F		F
G		G
H		H
I		I
J		J
K		K
L		L
M		M
N		N
O		O
P		P
Q		Q
R		R
S		S
T		T
U		U
V		V
W		W
X		X
Y		Y
Z		Z
README.md		README.md
authors.csv		authors.csv
authors_tweet_general_statistics.csv		authors_tweet_general_statistics.csv
authors_tweet_language_counts.csv		authors_tweet_language_counts.csv
authors_tweet_region_counts.csv		authors_tweet_region_counts.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3 million Russian troll tweets—by author!

One author, many tweets

Author columns

Tweet columns

Other minor enhancements

About this dataset

How it was made

Final thoughts

About

Releases

Packages

bet4a/russian-troll-tweets-by-author

Folders and files

Latest commit

History

Repository files navigation

3 million Russian troll tweets—by author!

One author, many tweets

Author columns

Tweet columns

Other minor enhancements

About this dataset

How it was made

Final thoughts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages