Change to Twitter archive format. #229

garrettc · 2022-02-11T10:44:39Z

It looks like Twitter have changed the format or their archives (possibly related to #227 ?):

$ ./manage.py import_twitter_tweets --path=/Users/garrettc/temp/twitter-2022-02-06/

CommandError: Expected to find a directory at '/Users/garrettc/temp/twitter-2022-02-06/data/js/tweets' containing JSON files

<archive_root>/data now contains a whole bunch of js files, with a single tweet.js file that contains your public timeline and a corresponding tweet_media directory with the associated media.

As a quick test in the vain hope that they'd maybe only changed the directory structure I moved tweet.js into a js/tweets/ directory and re-ran the command, but it bails out with:

Traceback (most recent call last):
  File "/Users/garrettc/sandbox/polytechnic/env/lib/python3.9/site-packages/ditto/twitter/ingest.py", line 102, in _get_data_from_file
    tweets_data = json.loads("".join(lines))
  File "/opt/homebrew/Cellar/[email protected]/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/homebrew/Cellar/[email protected]/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 25 column 4 (char 676)

So I'm guessing there are more fundemental changes afoot here. I don't have an example of an old archive to narrow down what else has changed.

The "new" tweet.js structure is as follows:

window.YTD.tweet.part0 = [
  {
    "tweet" : {
      "retweeted" : false,
      "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
      "entities" : {
        "hashtags" : [ ],
        "symbols" : [ ],
        "user_mentions" : [ ],
        "urls" : [ ]
      },
      "display_text_range" : [
        "0",
        "68"
      ],
      "favorite_count" : "0",
      "id_str" : "200873",
      "truncated" : false,
      "retweet_count" : "0",
      "id" : "200873",
      "created_at" : "Fri Nov 24 21:02:53 +0000 2006",
      "favorited" : false,
      "full_text" : "Some random string of words",
      "lang" : "en"
    }
  }
]

(On the upsidea, the Pinboard and Last.fm imports went perfectly. I'm having some issues with Flickr, but I think that's to do with the amount of photos I have. I'm doing some more investigating and then I'll report back.)

The text was updated successfully, but these errors were encountered:

philgyford · 2022-02-11T10:53:25Z

Thanks - I'll have a look.

I did have an issue with importing my Flickr photos when running it on a Digital Ocean 512MB Droplet although it worked OK on my MacBook. (#148) If the issue is something like that I guess that adding an option to continue an import from whatever point it failed might be the simplest work around (assuming I have no idea how to fix the memory issue). But we can continue that discussion on that issue if necessary.

Glad to hear Pinboard and Last.fm went OK!

philgyford · 2022-02-11T16:17:53Z

The twitter archive format is remarkably different from what it was when I wrote the code. Here's a screenshot of the old download:

And here's the new one (I think that tweet.js file contains all a user's tweets; mine is 25.6MB):

The old month-dated .js files used to start like this:

Grailbird.data.tweets_2006_11 = 
 [ {
  "source" : "\u003Ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003ETwitter Web Client\u003C\/a\u003E",
  "entities" : {
    "user_mentions" : [ ],
    "media" : [ ],
    "hashtags" : [ ],
    "urls" : [ ]
  },
  "geo" : { },
  "id_str" : "467973",
  "text" : "Woke up one minute before the alarm, and now feel unusually awake.",
  "id" : 467973,
  "created_at" : "2006-11-30 00:00:00 +0000",
  "user" : {
    "name" : "Phil Gyford",
    "screen_name" : "philgyford",
    "protected" : false,
    "id_str" : "12552",
    "profile_image_url_https" : "https:\/\/pbs.twimg.com\/profile_images\/1167616130\/james_200208_300x300_normal.jpg",
    "id" : 12552,
    "verified" : false
  }
},

And the new tweet.js file starts like this:

window.YTD.tweet.part0 = [
  {
    "tweet" : {
      "retweeted" : false,
      "source" : "<a href=\"http://www.cloudhopper.com/\" rel=\"nofollow\">Twitter SMS</a>",
      "entities" : {
        "hashtags" : [ ],
        "symbols" : [ ],
        "user_mentions" : [ ],
        "urls" : [ ]
      },
      "display_text_range" : [
        "0",
        "35"
      ],
      "favorite_count" : "0",
      "id_str" : "71534",
      "truncated" : false,
      "retweet_count" : "0",
      "id" : "71534",
      "created_at" : "Fri Nov 17 16:47:58 +0000 2006",
      "favorited" : false,
      "full_text" : "Supermarket so bright hurt my eyes.",
      "lang" : "en"
    }
  },

Anyway, just a very quick cursory look so far.

In preparation for importing the "new" 2019 format of Twitter archives. For #229

So that we can keep it working, for those with older existing Twitter Archive downloads, while adding a newer ingester for those with 2019+ downloads. For #229

Some tweet JSON have the `display_text_range` set as strings not ints, e.g. `["0", "140"]` rather than `[0, 140]`. Particularly when the JSON has come from the downloaded Twitter archive. And Some tweet JSON have the `["entities"][<kind>]["indices"]` set as strings not ints, e.g. `["0", "9"]` rather than `[0, 9]`. Particularly when the JSON has come from the downloaded Twitter archive. For #229

* Added the `Version2TweetIngester` class which collates the twitter user data from three separate archive files, and the tweet data from the single large `tweet.js` file, and passes all that to the saver. We add a note to the user data to make it clear - when it's saved to the database as "raw" data - that it was compiled by this code, and doesn't come directly from the API/archive. * Adjusted the `TweetSaver` class so that it can be passed data about a twitter user separately - the API, and the previous twitter archive, included the user data within each tweet's data. But now, presumably to save space, the individual tweets' JSON don't include the user data. So we now pass the `TweetSaver` the user data as a separate object. * Added tests for the `Version2TwetIngester`. Still to do: * Waiting for an archive of a private twitter account, in order to see what the structure of the `protected-history.js` file is like, so that we can correctly set the privacy status of the account. * Given the 2019+ archive includes media files for the tweets, we may as well import all those in the Ingester as `Media` objects. For #229

* The downloaded archive includes all the media files associated with a user's tweets, so we can import them relatively easily. * We import the MP4s Twitter users to display animated GIFs and the image files for JPGs/PNGs. We don't import video files that were uploaded as such because we don't currently include those when fetching media files from the API, so this is to remain consistent. * When we fetch media files for animated GIFs, we fetch both the MP4 and a JPG of it. Although we have the path for both in the tweet data in the archive, only the MP4 is present in the `tweet_media` directory so we only import that. For #229

…data (a) because we can't get it from the downloaded archive https://twittercommunity.com/t/download-archive-does-not-include-current-protected-status/166622 and (b) because that value should be set when saving the Account object, which fetches the User data from Twitter before the import. For #229

* To account for new procedures for setting up a Twitter App and applying for Extended Permissions, which are required to access the v1.1 API * And to document the two versions of the import management command. For #229

* Make passing "private" in when saving Twitter User data optional (because it's not in the downloaded archive of data) * Cope with the fact some idiot (me) decided to pass either a dict *or* a boolean from a method. For #229

For #229

philgyford · 2022-02-14T11:55:12Z

@garrettc OK, hopefully import_twitter_tweets will work with the new format now!

During all this I've realised that your Twitter dev account will need to apply for "Extended access" in order access the v1.1 API, which is what Ditto uses, for new Apps. I've no idea how long that takes to happen.

Let me know how the import goes!

garrettc · 2022-02-14T12:33:31Z

I just put in my request for elevated access, and they approved it straight away! (I must have a kind face), so I'll start testing the new importer.

I've started work on the Flickr importer improvements, I had to take a bit of time to get my head around ArgumentParser and mutually exclusive arguments, but I think I've got something that might not destroy the universe. More on that later.

garrettc · 2022-02-14T13:01:24Z

Worked perfectly!

(env) [garrettc - 12:45] polytechnic $ python manage.py import_twitter_tweets --path=/Users/garrettc/temp/twitter-2022-02-06
Imported 32011 tweets from 1 file, and 1043 media files
(env) [garrettc - 12:50] polytechnic $

philgyford · 2022-02-14T14:22:31Z

Excellent!

philgyford added app:twitter bug labels Feb 11, 2022

garrettc mentioned this issue Feb 11, 2022

Intermittent 500 errors from Flickr API during initial fetch #230

Closed

philgyford self-assigned this Feb 11, 2022

philgyford added a commit that referenced this issue Feb 13, 2022

Move test_ingest.py to test_ingest_v1.py

1eb5445

In preparation for importing the "new" 2019 format of Twitter archives. For #229

philgyford added a commit that referenced this issue Feb 13, 2022

Make existing TweetIngester Version1TweetIngester

81d3c0a

So that we can keep it working, for those with older existing Twitter Archive downloads, while adding a newer ingester for those with 2019+ downloads. For #229

philgyford added a commit that referenced this issue Feb 14, 2022

Please the linter

6aca66d

For #229

philgyford mentioned this issue Feb 14, 2022

Support Django 4.0 and python 3.10; Support import of "new" Twitter archive downloads #231

Merged

philgyford closed this as completed in #231 Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change to Twitter archive format. #229

Change to Twitter archive format. #229

garrettc commented Feb 11, 2022

philgyford commented Feb 11, 2022

philgyford commented Feb 11, 2022 •

edited

Loading

philgyford commented Feb 14, 2022

garrettc commented Feb 14, 2022

garrettc commented Feb 14, 2022

philgyford commented Feb 14, 2022

Change to Twitter archive format. #229

Change to Twitter archive format. #229

Comments

garrettc commented Feb 11, 2022

philgyford commented Feb 11, 2022

philgyford commented Feb 11, 2022 • edited Loading

philgyford commented Feb 14, 2022

garrettc commented Feb 14, 2022

garrettc commented Feb 14, 2022

philgyford commented Feb 14, 2022

philgyford commented Feb 11, 2022 •

edited

Loading