Remove (or prevent) duplicate media files #37

sjacks26 · 2019-10-18T18:49:50Z

Right now, getmedia.py downloads all media files that it can. But with retweets and resharing images and other forms of media, we might not need to keep all copies of media files.
Instead, we might want to create unique identifiers for all unique media files, then link those unique identifiers with each tweet in which that media appears.

If that's the case, we might hash media files to get a bit-level representation of unique files; store all unique hashes somewhere (perhaps alongside the main data collection, like we do with stream limit messages); compare new media file hashes against the list of existing media file hashes; and only retain unique media files.

sjacks26 added the enhancement label Oct 18, 2019

sjacks26 self-assigned this Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove (or prevent) duplicate media files #37

Remove (or prevent) duplicate media files #37

sjacks26 commented Oct 18, 2019

Remove (or prevent) duplicate media files #37

Remove (or prevent) duplicate media files #37

Comments

sjacks26 commented Oct 18, 2019