Introduction to Cultural Data Science - Analysing Højskolesangbogen

Overview

Jakob Grøhn Damgaard, Januray 2021
This repository contains the full exam project for the course Introduction Cultural Data Science at the Faculty of Arts at Aarhus University.

Project Abstract

Comprising more than 600 songs and hymns, the Danish Højskolesangbog could provide a fruitful insight into the historic evolution of Danish song culture. However, in order to perform robust quantitative research on this data using modern language processing techniques a digital, tidy data set must be available. As this is currently not the case, this study aims to assemble a digital corpus of the songs by web scraping højskolesangbogen.dk. This data set is then utilized to analyse and visualise the historic development in the use of religious language in the songs. The study concludes that there has been a general decline in the frequency of words tied to Christianity, however, with a few notable offshoots.

Animation of development in use of religious language

Software

All data operations were performed on a 2020 MacBook Pro 13’’, 2 GHz Quad-Core Intel Core i5, 16 GB Ram running macOS Catalina (10.15.6).
Following software was used:

Python (3.8.5)
R (4.0.02)
RStudio (1.3.1093)
Visual Studio Code (1.52.1)
Jupyter Extension (1.0.0) for Visual Studio Code

Chrome extension software:

1. SelectorGadget (1.1.1)
2. Link Clipper (2.4.1)

Files

This repository is structured as follows:

Data folder:

clean_song_data_with_word_counts.csv - CSV file containing the full, cleaned data frame outputted from the Python script
song_urls.csv - CSV file containing list of URLs linking to web pages for each individual song
song_vocabulary.csv - CSV file containing a list of all unique tokens in the song lyrics
song_vocabulary.pkl - Pickle file containing a list of all unique tokens in the song lyrics

Visualisations folder:

aggregated_15.png - Plot showing development in religious language use when songs have been grouped and aggregated across 15 songs
aggregated_songs_15_religious_intensity.gif - GIF showing unsmoothed development in religious language use when songs have been grouped and aggregated across 15 songs
interval_plot.gif - GIF showing development in religious language use when songs have been grouped and aggregated across 6 time intervals

analysing_hojskolesangbogen.ipynb - Jupyter notebook script containing code for autmated web scraping and preprocssing of the scraped data into a tidy data set
analysing_hojskolesangbogen.rmd - RMarkdown script containing analysis and visualisations of development in religious language use
requirements.txt - TXT file containing requirements for running *analysing_hojskolesangbogen.ipynb* script locally

Metadata

Following metadata list provides an explanation of the columns in the full cleaned data set, clean_song_data_with_word_counts.csv, produced by the analysing_hojskolesangbogen.ipynb:

songwriter - This column contains the name of the primary songwriter - Type: string
year_written - This column contains the year the song was written - Type: numeric
cinoiser - This column contains the name of the primary composer - Type: string
year_composed - This column contains the year the song was composed - Type: numeric
lyrics - This column contains the song lyrics- Type: string
title - This colummn contains the song title - Type: string
columns numbered 9-13326 - These columns contain words counts and each represent a unique word in the vocabulary - Type: numeric

Execution and requirements

The RMarkdown analysing_hojskolesangbogen.rmd file can be directly executed in the desktop version RStudio (1.3.1093) as long as base R (4.0.02) has been installed. All packages are installed, loaded and managed using the package manager 'pacman'.

The analysing_hojskolesangbogen.ipynb file Package requirements are found in the requirements.txt file. Alternatively, the code can be executed more easily using Google Colab which means no packages have to be installed locally. This only demands that one has an active Google account. You can access the script using the following link:
https://colab.research.google.com/drive/1DggRL25M4LWOkSxmtHa7Mlqv9c010Hv0#scrollTo=dhG1vgE54K1-

Package and software references

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".

Chandra, R. V., & Varanasi, B. S. (2015). Python requests essentials. Packt Publishing Ltd.

Van Rossum, G., & Drake Jr., F. L. (1995). Python Reference Manual. Centrum voor Wiskunde en Informatica Amsterdam.

R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Richardson, L. (2019). Beautiful Soup Documentation. 84.

Ooms, J. (2018). gifski: Highest Quality GIF Encoder. R package version 0.8.6. https://CRAN.R-project.org/package=gifski

Pedersen, T. L. & Robinson, D. (2020). gganimate: A Grammar of Animated Graphics. R package version 1.0.7. https://CRAN.R-project.org/package=gganimate

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.

Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman

Varoquaux, G., & Grisel, O. (2009). Joblib: running python function as pipeline jobs. packages. python. org/joblib.

Walt, S. V. D., Colbert, S. C., & Varoquaux, G. (2011). The NumPy array: a structure for efficient numerical computation. Computing in science & engineering, 13(2), 22-30.

Wickham, H., Francois, R., Henry, L., & Müller, K. (2015). dplyr: A grammar of data manipulation. R package version 0.4, 3.

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

License

Shield:

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data		Data
Visualisations		Visualisations
LICENSE.txt		LICENSE.txt
README.html		README.html
README.md		README.md
analysing_hojskolesanbogen.Rmd		analysing_hojskolesanbogen.Rmd
analysing_hojskolesanbogen.html		analysing_hojskolesanbogen.html
analysing_hojskolesangbogen.ipynb		analysing_hojskolesangbogen.ipynb
interval_plot.gif		interval_plot.gif
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to Cultural Data Science - Analysing Højskolesangbogen

Overview

Project Abstract

Animation of development in use of religious language

Software

Files

Metadata

Execution and requirements

Package and software references

License

About

Releases

Packages

Languages

License

bokajgd/CDS_Exam

Folders and files

Latest commit

History

Repository files navigation

Introduction to Cultural Data Science - Analysing Højskolesangbogen

Overview

Project Abstract

Animation of development in use of religious language

Software

Files

Metadata

Execution and requirements

Package and software references

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages