-
Notifications
You must be signed in to change notification settings - Fork 2
WhatsApp 'Export Chat' Scraper
tarunima edited this page Jun 12, 2020
·
1 revision
WhatsApp lets Group Chats to be exported via 'Export Chat' feature. One way to get content out of WhatsApp Public Groups is to use this feature and then parse the text file.
The imported text file has a name: "WhatsApp Chat with...". This is not a unique identifier. If two WhatsApp groups have the same name, the text file exported from these groups will also have the same name.
There are three incremental steps to building this scraper:
- Extract content from an exported chat text file.
- Handle text files that are created from the same WhatsApp group at a set frequency
- Extract content from text files from multiple groups exported at a set frequency.
- Export the chat to Google Drive- the text file and a number of media items will be uploaded to the selected Google Drive Folder
- Read the File Using OAuth
- Parse text file- extract time stamp, message, and media item name.
- All text messages that come from the same person in a series should be linked as one message.
- The link between media item and text in a message with both should be retained.
- Upload media items to s3, metadata to mongo.
- Every time a chat is exported with media items, it exports 10K rows. Thus there will be some redundancy between files created from the same group chat.
- Find an efficient mechanism to find new rows or the starting point of new rows in the recently exported file.
- Scrape the new rows as per the code written in previous section ('extracting content from an exported chat text file').
- Delete media items and text files older than 'n' iterations from Google Drive.
- The first step is to figure out a mechanism to set a unique ID for every group chat. This could be done on the basis of the Google drive folder name or the timestamp at which the group was first scraped.
- Loop through all folders and scrape as per scraper written in previous section ('Handle text files that are created from the same WhatsApp group at a set frequency')
- Create a log of group conversations that were updated on a particular run.