-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05_004_collecting-additional-data.Rmd
198 lines (160 loc) · 6.34 KB
/
05_004_collecting-additional-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "Collecting song and artist data"
description: |
Annotating the data downloaded from Spotify with more specific and descriptive data using Spotify's API.
author:
- first_name: "Joshua"
last_name: "Cook"
url: https://joshuacook.netlify.app
orcid_id: 0000-0001-9815-6879
date: "`r Sys.Date()`"
output:
distill::distill_article:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, comment = "#>")
library(memoise)
library(mustashe)
library(jhcutils)
library(magrittr)
library(tidyverse)
# To shut-up `summarise()`.
options(dplyr.summarise.inform = FALSE)
cache_path <- here::here(".memoise")
mustashe::use_here(silent = TRUE)
walk(list.files("lib", pattern = "R$", full.names = TRUE), source)
```
Spotify actually has another API for collecting additional details about playlists, albums, artists, and tracks.
(It also contains endpoints for a user to control their own playlists and song playback, but those are not relevant for this project.)
Below, I collect as much data as possible about songs I listen to using this API.
## Accessing the API and caching
To access the API, I used the ['spotifyr'](https://www.rcharlie.com/spotifyr/) R package.
The instructions are setting up a developer account and aquiring access keys from Spotify are detailed in the documentation for 'spotifyr'.
To make development and reproduction a bit faster, I wrapped all of the 'spotifyr' functions I used in `memoise::memoise()` from the ['memoise'](http://memoise.r-lib.org) package.
I also cached the long-running data frame creation steps using the ['mustashe'](https://jhrcook.github.io/mustashe/) package.
```{r}
get_playlist_memo <- memoise(
spotifyr::get_playlist,
cache = cache_filesystem(cache_path)
)
get_my_playlists_memo <- memoise(
spotifyr::get_my_playlists,
cache = cache_filesystem(cache_path)
)
get_track_memo <- memoise(
spotifyr::get_track,
cache = cache_filesystem(cache_path)
)
get_track_audio_features_memo <- memoise(
spotifyr::get_track_audio_features,
cache = cache_filesystem(cache_path)
)
get_track_audio_analysis_memo <- memoise(
spotifyr::get_track_audio_analysis,
cache = cache_filesystem(cache_path)
)
```
## Data collection
Track and artist information must be queried using their unique IDs assigned by Spotify.
Annoyingly, the personal data I have already downloaded does *not* contain these IDs with the tracks.
Therefore, I instead used the API to collect all of my playlist information, extracted the songs (and their IDs) from the playlists, and used those to query.
Because of this limitation, I was not able to access the detailed information from every song in my streaming history, but this will get most of them and all of my most popular tracks.
### List of my playlists
In the following code, I collected the names and IDs of all of my playlists.
A sample of the resultant data frame is presented.
```{r, echo=TRUE}
my_playlists <- get_my_playlists_memo(limit = 50) %>%
as_tibble() %>%
janitor::clean_names() %>%
select(name, id, uri, tracks_total)
```
```{r}
display_head(my_playlists)
```
### Playlist information
The playlist IDs were then used to get all of the information about the playlists.
The `data` column of the `playlist_df` data frame contains a data frame for each playlist with the track and album information.
```{r, echo=TRUE, message=FALSE}
get_playlist_info <- function(playlist_id) {
get_playlist_memo(playlist_id = playlist_id)$tracks$items %>%
as_tibble() %>%
janitor::clean_names() %>%
select(
track_name, track_id, track_uri, track_artists, track_duration_ms,
track_popularity, track_album_name, track_album_id, track_album_uri,
track_album_release_date, track_album_release_date_precision,
track_explicit
)
}
stash("playlist_df", depends_on = c("my_playlists"), {
playlist_df <- my_playlists %>%
mutate(data = map(id, get_playlist_info))
})
```
```{r}
saveRDS(playlist_df, here::here("data", "playlist_detailed_info.rds"))
rmarkdown::paged_table(head(playlist_df))
```
### Track information
Then, I gathered the track names and IDs from each playlist and queried the API to obtain as much information from Spotify as possible.
For each track, Spotify has what they call ["features"](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) and ["analysis"](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/) (links are to the API documentation).
The "features" data tend to be quantitative characteristics of the track such as *danceability*, *acousticness*, and *liveness*.
The "analysis" data are various time decompositions of the songs at different scales such as into their beats, segments, etc.
The following code shows the functions that collect and clean the data returned by the API calls.
```{r, echo=TRUE}
clean_track_info <- function(x) {
tibble(
album_id = x$album$id,
album_name = x$album$name,
duration = x$duration_ms,
explicit = x$explicit,
popularity = x$popularity,
album_info = list(x$album),
release_date = x$album$release_date,
release_date_precision = x$album$release_date_precision
)
}
clean_track_features <- function(x) {
janitor::clean_names(x) %>%
select(-id, -uri)
}
clean_track_analysis <- function(x) {
x[-c(1:2)] %>%
map(list) %>%
as_tibble()
}
get_track_data <- function(id) {
x <- bind_cols(
get_track_memo(id = id) %>% clean_track_info(),
get_track_audio_features_memo(id = id) %>% clean_track_features(),
get_track_audio_analysis_memo(id = id) %>% clean_track_analysis()
)
}
get_track_data <- memoise(get_track_data, cache = cache_filesystem(cache_path))
```
Each track is run through `get_track_data()` to get and organize all of the data.
The resulting data frame is cached for future analysis and some columns are displayed below as examples.
```{r, warning=FALSE, message=FALSE}
stash("track_info_df", depends_on = "playlist_df", {
track_info_df <- playlist_df %>%
select(data) %>%
unnest(data) %>%
select(track_name, track_id) %>%
distinct() %>%
mutate(track_data = map(track_id, get_track_data))
})
```
```{r}
save_f <- here::here("data", "track_detailed_info.rds")
if (!file.exists(save_f)) {
track_info_df %>%
unnest(track_data) %>%
saveRDS(save_f)
}
set.seed(0)
track_info_df %>%
unnest(track_data) %>%
sample_n(10) %>%
rmarkdown::paged_table()
```