Skip to content

The goal of this project is to analyse chat data with my girlfriend; apply statistical methods, graph theory, and other data science techniques.

Notifications You must be signed in to change notification settings

dereckmezquita/ds-NLP-network-inference

Repository files navigation

ds-NLP-network-inference

The goal of this project is to analyse chat data with my girlfriend; apply statistical methods, graph theory, and other data science techniques.

Please note that this project is presented in knitted interactive .html reports, you can obtain these by downloading them from this github repo at the directory ./reports or you can visit them hosted on my website at: https://www.derecksnotes.com/sharing/data-science-portfolio/ds-NLP-network-inference/

This project is broken up into three big sub-projects. The project structure is as follows:

For a hosted presentation on my site visit: derecksnotes.com: ds-NLP-network-inference

Work on this project is done with:

Chat data

This chat data was collected between the dates of: 02 February 2020 - 20 September 2020.

The data shown and used here is a composite of chat data extracted from WhatsApp and Telegram. My girlfriend and I chat on both applications. The Telegram data was converted from a JSON format exported by Telegram Lite to a .txt format to be used by "rwhatsapp" with a custom NodeJS application I wrote for this purpose. You can find the Telegram to WhatsApp converter here on GitHub as a separate project: dereckdemezquita/tl-telegram-data-formatter

Chat extraction

In order to extract WhatsApp data go the the relevant chat, tap the top top bar of the chat with the name of the group/person, find the option called "Export Chat". In order to export Telegram data I recommend using "Telegram Lite" for Mac or an Android app, go to settings find the advanced options and select your desired exports; make sure to check export to JSON. All shown in the screen shots below:

TLDR (too long didn't read)

Skills demonstrated in this project:

  1. Experience and advanced use of R language, R Notebooks, and Git/GitHub.
  2. Handling and cleaning unstructured data.
  3. Statistical analysis of text data, and natural language processing.
  4. Network inference with custom made algorithm.
  5. Graph theory; topological analysis.
  6. Graph theory; application of clustering algorithms:
    • Edge betweenness (Girvan-Newman).
    • Propagating labels detection.
    • Fast greedy modularity optimisation.
    • K-core decomposition.
  7. Story telling and clean presentation of a complex real world personal dataset.

Here are some the biggest hits (plots) from this project:

text-analysis.Rmd/html

word-network-analysis.Rmd/html

Community detection

Disclaimer and ethical considerations

This data has been collected with the express informed consent of all participants; including that of my girlfriend - she's found this whole project quite cute and funny. Please note that this data has been cleaned and censored for privacy reasons. Only results from analysis of the data are shown, the raw data shall not be publicly available; as such privacy infringement is kept to a minimum.

About

The goal of this project is to analyse chat data with my girlfriend; apply statistical methods, graph theory, and other data science techniques.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages