The goal of this project is to analyse chat data with my girlfriend; apply statistical methods, graph theory, and other data science techniques.
Please note that this project is presented in knitted interactive .html
reports, you can obtain these by downloading them from this github repo at the directory ./reports or you can visit them hosted on my website at: https://www.derecksnotes.com/sharing/data-science-portfolio/ds-NLP-network-inference/
This project is broken up into three big sub-projects. The project structure is as follows:
- Text mining, statistical, and exploratory analysis: derecksnotes.com: text-analysis.html
- Word association network inference: derecksnotes.com: word-network-inference.html
- Word association/network topological analysis and visualisation: derecksnotes.com: word-network-analysis.html
- Bonus - interactive visualisation of network: derecksnotes.com: word-network-interactive.html
For a hosted presentation on my site visit: derecksnotes.com: ds-NLP-network-inference
Work on this project is done with:
- Modified multicore R language v4.0.2.
- RStudio v1.2.5019.
- "rwhatsapp" package is used for reading in the data.
- The statistical analysis done is inspired by its documentation and the O'Reilly book Text Mining with R by Julia Silge & David Robinson.
This chat data was collected between the dates of: 02 February 2020 - 20 September 2020.
The data shown and used here is a composite of chat data extracted from WhatsApp and Telegram. My girlfriend and I chat on both applications. The Telegram data was converted from a JSON format exported by Telegram Lite to a .txt
format to be used by "rwhatsapp" with a custom NodeJS application I wrote for this purpose. You can find the Telegram to WhatsApp converter here on GitHub as a separate project: dereckdemezquita/tl-telegram-data-formatter
In order to extract WhatsApp data go the the relevant chat, tap the top top bar of the chat with the name of the group/person, find the option called "Export Chat". In order to export Telegram data I recommend using "Telegram Lite" for Mac or an Android app, go to settings find the advanced options and select your desired exports; make sure to check export to JSON. All shown in the screen shots below:
Skills demonstrated in this project:
- Experience and advanced use of R language, R Notebooks, and Git/GitHub.
- Handling and cleaning unstructured data.
- Statistical analysis of text data, and natural language processing.
- Network inference with custom made algorithm.
- Graph theory; topological analysis.
- Graph theory; application of clustering algorithms:
- Edge betweenness (Girvan-Newman).
- Propagating labels detection.
- Fast greedy modularity optimisation.
- K-core decomposition.
- Story telling and clean presentation of a complex real world personal dataset.
Here are some the biggest hits (plots) from this project:
This data has been collected with the express informed consent of all participants; including that of my girlfriend - she's found this whole project quite cute and funny. Please note that this data has been cleaned and censored for privacy reasons. Only results from analysis of the data are shown, the raw data shall not be publicly available; as such privacy infringement is kept to a minimum.