Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove (or prevent) duplicate media files #37

Open
sjacks26 opened this issue Oct 18, 2019 · 0 comments
Open

Remove (or prevent) duplicate media files #37

sjacks26 opened this issue Oct 18, 2019 · 0 comments
Assignees

Comments

@sjacks26
Copy link
Contributor

Right now, getmedia.py downloads all media files that it can. But with retweets and resharing images and other forms of media, we might not need to keep all copies of media files.
Instead, we might want to create unique identifiers for all unique media files, then link those unique identifiers with each tweet in which that media appears.

If that's the case, we might hash media files to get a bit-level representation of unique files; store all unique hashes somewhere (perhaps alongside the main data collection, like we do with stream limit messages); compare new media file hashes against the list of existing media file hashes; and only retain unique media files.

@sjacks26 sjacks26 self-assigned this Oct 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant