Skip to content
This repository has been archived by the owner on Nov 30, 2020. It is now read-only.

Architecture of Twint scrapper which allow download tweets on many instances without api restrictions

Notifications You must be signed in to change notification settings

markowanga/twint-distributed

Repository files navigation

Twint-Distributed

No long supported

I have many problems with twint. I decided to stop developing the library. If you liked my solution, maybe you will be interested in my library - https://github.com/markowanga/stweet.

Description

Sometimes there is a need to scrap many enormous tweet data in short time. This project help to do this task. Solution is based on Twint — popular tool to scrap twitter data.

Image of architecture

Main concepts

  • Prepare architecture of microservices, which is scalable and can be distributed for many machines
  • Divide single scrap tasks for small task
  • Support that wne worker have error and the elementary task can be repeated on other instance
  • Workaround twitter limit, which disallow to download many data from one ip address
  • All data are gathered into one location
  • Use docker whenever possible

How it works

  1. User add commands to scrap by HTTP request
  2. As a request result, server add commands to RabbitMQ for scrap data, the time bounds can be divided for small intervals
  3. Workers get the messages from RabbitMQ to scrap data — they do this job
  4. When elementary task has been finished the data is upload to server
  5. Server save all received data to central storage

About

Architecture of Twint scrapper which allow download tweets on many instances without api restrictions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published