Capstone Project at General Assembly London, Data Science Immersive, 08/12/19
- Technologies Used
- Packages Used
- Methods Used
- Introduction
- Gathering & Cleaning the Data
- Feature Engineering
- EDA
- Modelling
- Evaluation
- Final Thoughts and Plans for the Future
- Python 3.0
- Jupyter Notebook
- Pandas
- Scikit-learn
- Matplotlib
- Numpy
- Web Scraping
- Data Visualization
- Regression
I've always loved music. I was watching this YouTube video on what make Lil Nas X's video so popular. It desribes how he used memes, search engine optimisation, remixs (which count towards chart placement), tiktok, memes, hopping on Red Dead Redemption's cowboy theme and classifying the song as 'country' on itunes and souncloud in order to remanouvere recommendation alogrithmns rather than trying to compete with songs in today's oversaturated hip-hip genre. All of these were ingenuious tactics which helped to make his song break records for weeks at #1.
I set out to try and uncover whether there are any attributes of a song or an artist which made it destined to be popular.
For this project the data was obtained in two parts and then merged together:
Billboard's ranking method for the Top 100 is excellent because it stayed relevant with its ranking policy with the changing methods of discoverning and purchasing music.
- 1958-1991: ranking determined by ratio of singles sales and airplay
- 1991: Billboard begins collecting sales data digitally (using SoundScan) for quicker and more accurate charts
- 1998: Billboard drops requirement that song must be released as a single to appear on the chart
- 2005: Digital downloads (iTunes) included
- 2012: On-demand streaming services (Spotify, Rhapsody) included
- 2013: Video views (YouTube) included
I used guoguo12's Billboard API to acquire information about the top 100 songs since 1958; including the week ID, chart position.
There was extensive cleaning of the Billboard data. The largest issue was that when I put the artist's name acquired from the Billboard API into the Spotify API, if the name of the artist was too long it would produce no results - therefore I had to remove featuring artists. This was simple is there was a 'featuring'. However &, comma, 'and', slash were all used synonymously. Therefore, I searched through all artists containing a comma etc and if the name appeared elsewhere then I would assume that it was a solo artist. I manually sorted through the remaining instances where a comma appeared and made some other exceptional lists.
I used Spotipy, a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform. Plus, it's much easier to use! I extracted Spotify’s musical components (e.g. danceability, tempo, duration) aswell as information about the artist (number of followers, genre) for each respective Billboard track, before merging it with the Billboard data.
N.B. There were cases where the Billboard and Spotify artist names didnt match up, so I created a dicitonary of such entries and used the Python package Fuzzy Wuzzy to see if they were adequately similar.
The Spotify Data was mostly clean, Spotify's genre classification system provided additional challenges. The streaming service categorizes artists into over 1,300 specific, and often unheard of, music genres (anybody familiar with "zydeco"?). As the genre tags can only be acquired from the artist parameter, this creates a problem. It means that the genre label is not specific to each track but to the artist as a whole; therefore the tracks of an ecclectic artist who spans several genres (one artist had 22 genre tags) are likely to be mislabelled.
I used a two-step process to translate Spotify's genres to my own genre definition. First, I Count-Vectorised the whole column to see which terms appeared the most. I then manually sorted through them to make a list of real genres. For example, I ignored 'new' (stemming from 'new wave').
- 'r&b', 'rock', 'metal', 'grunge', 'punk', 'pop', 'house', 'electronic', 'trance', 'dance', 'country', 'folk', 'jazz', 'blues', 'soul', 'disco', 'funk', 'trap', 'rap', 'freestyle', 'indie', 'classical', 'ska', 'reggae', 'dancehall', 'adult standards', 'hip hop'
I then Count-Vectorised again with these are column names. I wrote a wrote a short Python script to 'vote' on which genre to place the artist in. For instance, Spotify classifies Drake as "pop rap", "indie r&b", "alternative hip hop", and "hip hop". According to our mapping system, three of those genres fall under rap/hip-hop and one under R&B. Thus, Drake goes under rap/hip-hop.
I made a seperate notebook for cleaning the genre tags here
Once the data was clean, I ran a quick linear regression to see roughly what the cross-valdiated score was; at 0.22 I realised that this project needed a lot more work.
I engineered some new features, to significantly improve the predictive power of my model:
- ‘track longevity’
- ‘artist familiarity’
- 'peak chart position'
- ‘time since first charting’
- 'percentage of genre dominance' - this had no impact on predicting popularity
After engineering the all-important new features, my highest cross-validated score of 0.68, a huge increase.
My inital EDA was to look at correlation between the musical components and Spotify popularity. I found that loudness was the mostly highly correlated with the target at 0.35; overall not much correlation. Further, the highest correlation amongst the musical components was not surprisingly between 'Loudness' and 'Energy' at 0.69. Correlation between my engineering features and the target was more promising. I found that 'Time Since Release', 'Numnber of Spotify Followers' and 'Artist Familiarity' all had a correlation score of over 0.5, showing that there was at least some correlation.
Next I examined a histogram of all the musical features. I observed that most of them were not normally distributed and therefore a Power Tranformer might be needed during the modelling stage.
Then I graphically illustrated, with a timeseries, trends in the musical components of tracks.
An important finding was that Spotify gives higher popularity rankings for a new releases and artists that have released new music recently. Therefore my target variable, Spotify popularity, was skewed to be higher the more recent it is. This was hence why I engineered the feature 'Time Since Release' which massively increased my models' predictiveness. Here is Spotify's description of how it is calculated.
“The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past.”*
At this point I droppped the column 'spot_artist_pop' because it's derived from the spotify track popularity.
Finally, there were some complications in removing outliers as some tracks were classified as double their BPM (Beats Per Minute).
My final features were:
- 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'duration', 'key', 'mode', 'time_signature', 'spot_followers', 'track_longevity', 'peak_rank', 'time'
My highest cross-validation R2 score was 0.72 by Gridsearching on a Random Forest. Here were my top feature importances:
- spot_followers 0.378432
- track_longevity 0.209470
- time 0.174249 0.174249
- artist_familiarity 0.035067
- duration 0.022495
- liveness 0.019045
I also used an Elastic Net CV on a Linear Regression to get a score of R2 score 0.54. The advantage of using linear regression of course means that these coefficients are directly interpretable:
- time -11.411248
- peak_rank -5.101219
- track_longevity 3.552898
- spot_followers 2.793421
- liveness -0.667895
- loudness 0.532260
By examining the coefficients I found that the variables: 'Time Since First Charting', 'Number of Spotify followers', 'Artist Chart Familiarity' and 'Track Chart Longevity' had the largest impact on track popularity.
In conclusion, if you are an artist aiming to get your songs into automatically curated playlists or wonder how to get higher rankings in Spotify popularity index make sure you release songs frequently to stay relevant in the Spotify world. Further, if an artist wants to focus on getting a high Spotify score, then they should focus their efforts on trying to gain more followers as opposed to getting views on other mediums such as YouTube. Finally, although 'Duration' and 'Loudness' seem to have a small impact on a track's popularity, it seems as if musical components are largely irrelevant in predicting what makes a song popular.
Due to time constraints I focused on the musical components and artist information derived from the Spotify API. However there are several other potentially brilliant predictors of popularity I'm excited to add:
I also want to investigate homoegeneity in music. It seems as if the Billboard Hot 100 will continue to musically converge, give it enough time and we’ll all be listening to the same thing.