Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Django 4.0 and python 3.10; Support import of "new" Twitter archive downloads #231

Merged
merged 22 commits into from
Feb 14, 2022

Conversation

philgyford
Copy link
Owner

  • Add support for Django 4.0
  • Ensure support for python 3.10
  • Drop support for Django 2.2 and 3.1
  • Require django-taggit v2.0.0
  • Add support for importing the Twitter archive format that was introduced sometime in early 2019. Retain support for the previous format with an argument on the import_tweets command. The updated command, with the new archive format, also imports the media files that the import includes.

Fixes #229
For #225
For #223

It *should* probably all work once django-taggit releases a version
that supports Django 4.0 and python 3.10.

See jazzband/django-taggit#776

For #223
Remembered that I removed it because it won't work with django-taggit
2.0 that is needed for Django 4.0.
No, the sequence doesn't go 3.8, 3.9, 4.0...
In preparation for importing the "new" 2019 format of Twitter archives.

For #229
So that we can keep it working, for those with older existing Twitter
Archive downloads, while adding a newer ingester for those with 2019+
downloads.

For #229
Some tweet JSON have the `display_text_range` set as strings not ints,
e.g. `["0", "140"]` rather than `[0, 140]`. Particularly when the
JSON has come from the downloaded Twitter archive.

And

Some tweet JSON have the `["entities"][<kind>]["indices"]` set as
strings not ints, e.g. `["0", "9"]` rather than `[0, 9]`.
Particularly when the JSON has come from the downloaded Twitter
archive.

For #229
* Added the `Version2TweetIngester` class which collates the twitter
  user data from three separate archive files, and the tweet data from
  the single large `tweet.js` file, and passes all that to the saver.
  We add a note to the user data to make it clear - when it's saved to
  the database as "raw" data - that it was compiled by this code, and
  doesn't come directly from the API/archive.

* Adjusted the `TweetSaver` class so that it can be passed data about
  a twitter user separately - the API, and the previous twitter archive,
  included the user data within each tweet's data. But now, presumably
  to save space, the individual tweets' JSON don't include the user data.
  So we now pass the `TweetSaver` the user data as a separate object.

* Added tests for the `Version2TwetIngester`.

Still to do:

* Waiting for an archive of a private twitter account, in order to see
  what the structure of the `protected-history.js` file is like, so that
  we can correctly set the privacy status of the account.

* Given the 2019+ archive includes media files for the tweets, we may
  as well import all those in the Ingester as `Media` objects.

For #229
* The downloaded archive includes all the media files associated
  with a user's tweets, so we can import them relatively easily.

* We import the MP4s Twitter users to display animated GIFs and the
  image files for JPGs/PNGs. We don't import video files that were
  uploaded as such because we don't currently include those when
  fetching media files from the API, so this is to remain consistent.

* When we fetch media files for animated GIFs, we fetch both the MP4
  and a JPG of it. Although we have the path for both in the tweet data
  in the archive, only the MP4 is present in the `tweet_media` directory
  so we only import that.

For #229
To be in sync with what they're called on Twitter these days.

And change link to Twitter developer portal.
I'm sure this isn't an ideal way, but it's better than not doing anything
which is what was happening before.
…data

(a) because we can't get it from the downloaded archive https://twittercommunity.com/t/download-archive-does-not-include-current-protected-status/166622 and (b)
because that value should be set when saving the Account object, which
fetches the User data from Twitter before the import.

For #229
* To account for new procedures for setting up a Twitter App and applying
  for Extended Permissions, which are required to access the v1.1 API

* And to document the two versions of the import management command.

For #229
* Make passing "private" in when saving Twitter User data optional (because it's not in the downloaded archive of data)
* Cope with the fact some idiot (me) decided to pass either a dict *or* a boolean from a method.

For #229
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 93.867% when pulling 6aca66d on v2 into 591bdce on main.

@philgyford philgyford merged commit fce848f into main Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change to Twitter archive format.
2 participants