This codebase relates to the GabLeaks release from Distributed Denial of Secrets. While the Gab data itself is limited distribution, this helper code is public.
The original release format is a Postgresql database dump that includes data in four tables: accounts, statuses, gabgroups, and verifications. However, the 'statuses', 'accounts', and 'gabgroups' tables contain a 'data' column where most of the relevant details are stored in JSON.
The mixed JSON / SQL format is not the most convenient for analysis. This script expands the data, creating new statuses_expanded
, accounts_expanded
, and gabgroups_expanded
tables with individual typed fields instead of one big JSON blob. The schemas are as follows:
statuses_expanded
Field | Type |
---|---|
id | bigint |
bookmark_collection_id | bigint |
card | jsonb |
content | text |
created_at | timestamp with time zone |
emojis | jsonb |
expires_at | timestamp with time zone |
favourited | boolean |
favourites_count | bigint |
group_ | jsonb |
has_quote | boolean |
in_reply_to_account_id | bigint |
in_reply_to_id | bigint |
language | text |
media_attachments | jsonb |
mentions | jsonb |
pinnable | boolean |
pinnable_by_group | boolean |
plain_markdown | text |
poll | jsonb |
quote | jsonb |
quote_of_id | bigint |
reblog | jsonb |
reblogged | boolean |
reblogs_count | bigint |
replies_count | bigint |
revised_at | text |
rich_content | text |
sensitive | boolean |
spoiler_text | text |
tags | jsonb |
url | text |
visibility | text |
accounts_expanded
Field | Type |
---|---|
id | bigint |
text | |
password | text |
name | text |
bot | boolean |
url | text |
note | text |
avatar | text |
emojis | jsonb |
fields | jsonb |
header | text |
is_pro | boolean |
locked | boolean |
is_donor | boolean |
created | timestamp with time zone |
is_investor | boolean |
is_verified | boolean |
display_name | text |
avatar_static | text |
header_static | text |
statuses_count | bigint |
followers_count | bigint |
following_count | bigint |
is_flagged_as_spam | boolean |
gabgroups_expanded
Field | Type |
---|---|
id | int |
password | text |
url | text |
slug | jsonb |
tags | jsonb |
title | text |
created_at | timestamp with time zone |
is_private | boolean |
is_visible | boolean |
description | text |
is_archived | boolean |
has_password | boolean |
member_count | int |
group_category | jsonb |
cover_image_url | text |
description_html | text |
First, ensure you already have the dataset loaded into Postgresql. Installing and configuring postgres, creating a database and user, and importing the database contents from a SQL dump, are out of scope for this README.
Second, install dependencies for this project:
pip3 install -r requirements.txt
Or install manually with:
pip3 install psycopg2 tqdm
Finally, launch the script like:
./expand.py <sql_host> <sql_user> <sql_db_name>
For example:
./expand.py localhost postgres gableaks
You'll be prompted for your postgres user's password, and then the script should take care of everything from there. There's a progress bar included - since there are about 39 million statuses, expanding the table took a little over two hours during development tests.