Playlists from multiple tracks #72

SimonTeixidor · 2024-03-26T21:06:53Z

Hi! I found blissify a few weeks ago, and it's pretty great. Thanks for developing this!

One feature that I was missing (and I see it's been discussed on this issue tracker, #54) is the ability to create playlists from multiple tracks, so I decided to implement that. While doing that, I also ended up making some small unrelated changes.
This PR contains 4 commits:

Enable playlists based on multiple songs

This commit introduces a new distance metric, using the extended isolation forest algorithm, which allows us to create playlists from a set of songs instead of from a single song. From what I can tell, this kind of algorithm is usually used to detect outliers, but here I use it in the opposite way (that is, find the songs that are least likely to be outliers). So far, the results seem very good in my opinion.

Remove `*_by_key` family of functions

A lot of functionality in this library was duplicated, with functions like foo and foo_by_key doing basically the same thing. This commit gets rid of this duplication by making the functions generic over AsRef<Song>, and by implementing that trait for LibrarySong<Song>.

Get rid of circular dependency between playlist and song

I didn't like how the playlist module depends on the song module, and the other way around. This kind of circular dependency is a bit confusing, and in this case it was only used to provide some helper functions on Song. In my opinion it's better to get rid of those helper functions.

Unify distance metrics

Make functions in the playlist module generic over a new trait, DistanceMetricBuilder. This allows us to deduplicate code as there is no longer any need to treat distance functions that operate on a single pair of songs (like euclidean distance) any different from distance functions that operate on sets of songs (like extended isolation forest). For metrics that operate on a single set of songs, we use the sum of distances when those metrics are used to measure the distance to a set of songs.

I've been using a local build of blissify with these changes applied for a little while, and it seems to work so far. I'll be happy to make a PR to blissify as well if you would like to merge these changes. Let me know what you think.

This commit introduces functionality for generating playlist based on a set of songs. For good performance, I also introduce a new distance measure, extended isolation distance. While the previous distance metrics, euclidean distance and cosine distance, could be made to measure distance to a set of songs, the performance will not be as good.

Remove code duplication by making these functions generic over AsRef<Song> instead of having separate versions.

The playlist module depends on the song module, which in turn depends on the playlist module. This is confusing, and in this case also completely unecessary.

Allow euclidean distance and cosine distance to handle a set of songs as input. In doing so, we can also remove the ability to generate playlists from a single song, as that is just a special case of generating a playlist from many songs anyway.

Polochon-street · 2024-03-27T15:08:28Z

Hi!

Thanks a lot for this contribution, it would bring a really really nice feature to bliss. I would also be very much interested in you submitting the blissify's counterpart, as I'd be very happy to try it out!

Would it be possible to split this PR into 4 PRs, one for each commit? It'd be easier for me to review, and merging them one by one avoids the whole PR having trouble getting through if there are small blocking questions here and there :)

Side-note - do you mind if I add your github handle to the CHANGELOG, so you're credited properly?

Cheers!

SimonTeixidor · 2024-03-27T18:16:47Z

Here's what I'm running locally for blissify: Polochon-street/blissify-rs#53. I didn't really go through those changes as thoroughly, so please see it more as WIP.

I think there may be dependencies between the commits, so I'd prefer not split them into separate pull requests. Maybe you can have a look, and if there is something you disagree with I'll make an effort to break that out?

Sure, feel free to add my handle to the changelog :)

Polochon-street

Alright, just did a short review :) I'll take a better look later today, but I think it's a good start.
Thanks for submitting the blissify PR also, I'll take a look at it :)

Cheers!

src/library.rs

src/playlist.rs

src/library.rs

Polochon-street · 2024-03-28T23:49:37Z

I had some more time to review, and it looks great!
But I would rather not part with the *_by_key functions, as they would be useful (see https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-294648967369087cf3619793bd50043a2b39b5f481523d7c1636cc3e8d572a5bL55-L56) if people want to implement bliss for audio players that have structs with more than bliss' Song struct.

I.e. if a player has a song struct like

struct PlayerSong {
    title: Option<&str>,
    custom_metadata1: Option<Metadata1>,
    custom_metadata2: Option,
    bliss_song: Option<Song>,
}

then all of our closest_to_songs are basically useless. What do you think?

SimonTeixidor · 2024-03-29T08:06:43Z

if a player has a song struct like
struct PlayerSong {
    title: Option<&str>,
    custom_metadata1: Option<Metadata1>,
    custom_metadata2: Option,
    bliss_song: Option<Song>,
}
then all of our closest_to_songs are basically useless. What do you think?

Hmm, but how would you use the *_by_key functions for this struct? The key function has the signature Fn(&T) -> Song, and the only implementation I can think of would be bliss_song.unwrap().clone(), but that seems like a bad idea? If bliss_song was not optional, you could impl AsRef<Song> for PlayerSong, like I do with LibrarySong here: https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-47af127d675e3dc21b43c34222f42b9290334e88eb5c7ae0fc8b7539e3ecb794R341

Polochon-street · 2024-03-29T10:13:47Z

That's a fair point :D then I think these comments need to go https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-294648967369087cf3619793bd50043a2b39b5f481523d7c1636cc3e8d572a5bR110-R111, and this needs fixing #72 (comment), and we're good to go!

SimonTeixidor · 2024-03-29T19:10:42Z

then I think these comments need to go https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-294648967369087cf3619793bd50043a2b39b5f481523d7c1636cc3e8d572a5bR110-R111

Sorry, that link is not working for me (it just takes me to the while diff of this PR).

and this needs fixing #72 (comment), and we're good to go!

I pushed a test case.

Polochon-street · 2024-03-30T18:18:04Z

Sorry, that link is not working for me (it just takes me to the while diff of this PR).

oh, my bad, it's because it doesn't display the diff of playlist.rs because it's too big - it's the comment in line 110-111 of playlist.rs, that says

/// Sort songs with a key extraction function, useful for when you have a
/// structure like `CustomSong { bliss_song: Song, something_else: bool }

(which is not relevant anymore), it's also in a couple of other lines in the file

src/playlist.rs

toofar

Ohhh, exciting!

I went though this with the goal of understanding what this new distance metric was so I could reason about it. So I have some questions on that if you wouldn't mind humouring me. I haven't actually pulled down the code and starting experimenting yet, but hopefully today or tomorrow I will have time to.

Would you mind explaining a little more about how the isolation forest works in this case? As I understand it it partitions the points in the training data into a set of partitions, encoded as a b-tree, with the goal of isolating the maximum number of points, and it does this for multiple trees as a monte carlo method to get a better fit. Then to score data it puts the point being scored into each tree and sees how many nested partitions it is in (eg its depth in the tree).

Questions:

in the case which I assume is the common case where the training set is much more closely related than the whole library (eg you select a few similar songs to create the playlist from) how would the distribution of points and partitions look like here? I assume the partitions would be clustered around the selected songs and the vast majority of the library would be lightly partition (eg "anomalies"), does that sound right?
regarding the forest option sample_size, I see we are using min(selected_songs.len(), 200), where 200 is the default sample size from upstream. The paper talks about subsampling to avoid swamping and masking. I assume we don't need to worry about that since we are using a specialised sample and the majority of the points we are going to score are going to be anomalies? And we should make sure all of the selected songs are represented in the initial partitioning for, well for the obvious reason of having them all included in the initial partitioning exercise.
regarding the forest option n_trees I have two questions as I'm having a bit of trouble reasoning about this
a) what are the pros and cons of having a large number of trees (the default is 150) for a small data set like a handful of songs? Are a lot of them essentially redundant and we can use a smaller set to speed things up? Or maybe they aren't redundant and there will be many short trees with wildly different partitions?
b) does/should the number of trees have any relation to the number of dimensions the points are in? Or the number of features in this case. It seems in a non-extended isolation forest they have one tree per feature, and in this extended one where each partition is in multiple dimensions I think they keep using multiple trees to account for the larger option space available for these partitions. In the paper they say "path lengths usually converge well before 100 trees", and their test for that looks to be over a 135 point gaussian distribution, which may or may not apply in this case. If you don't have any thoughts on this and are just sticking with the defaults, I guess my motivating question would be whether playlist generation is significantly faster with less trees or not?

I would be interested in seeing a visualization of what the partitioning looked like in a small 2d dataset. Like if I just chose two songs would this make it degrade so that that there was a very small partitioned area between the points and everything outside of that, even if they were spatially close to one of the two points, had an equal path length? And if we had five points spaced equally apart (like the points of a star) would a point at the center of the cluster be treated as more normal or more anomalous compared to a point the same distance away from one of the outer points in the cluster, but outside of the cluster? Or even between to points on the edge of the cluster?

Also, maybe a question for @Polochon-street, I can't remember what the features that bliss uses look like but if there are some that are on a much different scale than others this issue on the original repo points out that you might want to normalize them to be on the same scale: sahandha/eif#31

toofar · 2024-03-30T23:44:06Z

src/playlist.rs

+    for FunctionDistanceMetric<'a, F>
+{
+    fn distance(&self, vector: &Array1<f32>) -> f32 {
+        self.state.iter().map(|v| (self.func)(v, vector)).sum()


If I understand this correctly (and I'm not sure I do, the signatures of these generic trait implementations are a bit much for a rust neophyte like me) this is going to compute the distance from vector (a multidimensional point) to each of the points saved in state at initialization time, and then sum them.
Is sum the right operation to be combining these individual distances? I guess I naively would expect some kind of average. Maybe it doesn't matter here? Or is this one of those situations where there are multiple ways to do it with their own arguments for and against and if you want something different you should implement your own builder?
Edit: I tested and I don't think it makes a difference

nit: I think some of these variables could have more meaningful names, "vector" and "state" are pretty generic

You have understood the implementation correctly. Given a set of points stored in state, this implementation will return the sum of the distances between vector and each point in state. Only for distance metrics that measure the distance between two points though, so this implementation is not relevant for the extended isolation forest metric or other metrics that can measure the distance between vector and state directly.

That works out to be the same as the average (mean) distance, just off by a factor 1/state.len() which shouldn't matter for sorting. I guess the median distance could be interesting to look at though?

Another approach is to compute some kind of centroid of state (perhaps that's what you mean by the average?), and then measure the distance to that. No idea if that gives better results, but perhaps something worth experimenting with?

In general though, for the case where state contains more than one item, I would avoid using euclidean and cosine distance, so I didn't really attempt to optimize this case.

toofar · 2024-03-31T01:53:46Z

src/playlist.rs

-        n32(s1.custom_distance(&s2, &distance)) < distance_threshold.unwrap_or(0.05)
+        let s1 = s1.as_ref();
+        let s2 = s2.as_ref();
+        let vector = [s1.analysis.as_arr1()];


hmm, any thoughts as to how well an isolation forest will perform here with just one input point?

Not really, probably best to use one of the other distance metrics here.

SimonTeixidor · 2024-03-31T16:54:56Z

@toofar

I went though this with the goal of understanding what this new distance metric was so I could reason about it. So I have some questions on that if you wouldn't mind humouring me. I haven't actually pulled down the code and starting experimenting yet, but hopefully today or tomorrow I will have time to.

Would you mind explaining a little more about how the isolation forest works in this case? As I understand it it partitions the points in the training data into a set of partitions, encoded as a b-tree, with the goal of isolating the maximum number of points, and it does this for multiple trees as a monte carlo method to get a better fit. Then to score data it puts the point being scored into each tree and sees how many nested partitions it is in (eg its depth in the tree).

I'm definitely no expert on the extended isolation forest. I set out to find some suitable metric, and somehow I bumped into this one which already had a Rust implementation.

In general, I wanted a metric that:

Does not make any assumptions on the distribution of the data (see Mahalanobis' distance for example)
Has good runtime performance
Gives subjectively good results
Is relatively easy to implement, or already has a Rust implementation

So, for your questions, I think your guess is mostly as good as mine :)

Questions:

in the case which I assume is the common case where the training set is much more closely related than the whole library (eg you select a few similar songs to create the playlist from) how would the distribution of points and partitions look like here? I assume the partitions would be clustered around the selected songs and the vast majority of the library would be lightly partition (eg "anomalies"), does that sound right?

I have no idea what the common case is, to be honest. I guess if you run this system on a large and diverse catalogue (say, the entire catalogue of spotify) and select something quite specific (say jazz piano trios), it would be true that the vast majority of the library would be anomalies. But I'm guessing that most private collections are much more specialised, so the same training set could actually match a large portion of the library instead?

regarding the forest option sample_size, I see we are using min(selected_songs.len(), 200), where 200 is the default sample size from upstream. The paper talks about subsampling to avoid swamping and masking. I assume we don't need to worry about that since we are using a specialised sample and the majority of the points we are going to score are going to be anomalies? And we should make sure all of the selected songs are represented in the initial partitioning for, well for the obvious reason of having them all included in the initial partitioning exercise.

regarding the forest option n_trees I have two questions as I'm having a bit of trouble reasoning about this
a) what are the pros and cons of having a large number of trees (the default is 150) for a small data set like a handful of songs? Are a lot of them essentially redundant and we can use a smaller set to speed things up? Or maybe they aren't redundant and there will be many short trees with wildly different partitions?
b) does/should the number of trees have any relation to the number of dimensions the points are in? Or the number of features in this case. It seems in a non-extended isolation forest they have one tree per feature, and in this extended one where each partition is in multiple dimensions I think they keep using multiple trees to account for the larger option space available for these partitions. In the paper they say "path lengths usually converge well before 100 trees", and their test for that looks to be over a 135 point gaussian distribution, which may or may not apply in this case. If you don't have any thoughts on this and are just sticking with the defaults, I guess my motivating question would be whether playlist generation is significantly faster with less trees or not?

These are good questions, and worth exploring. Unless I made a mistake, I don't think I set any of these options in this library though. We just accept a ForestOptions with whatever parameters and use those as is (well, except for sample size which is reduced in case we don't have enough samples). My idea was that we could experiment a bit with these parameters in blissify, perhaps we could even expose them to end users (seeing as blissify already exposes other such technical details in the cli).

I would be interested in seeing a visualization of what the partitioning looked like in a small 2d dataset. Like if I just chose two songs would this make it degrade so that that there was a very small partitioned area between the points and everything outside of that, even if they were spatially close to one of the two points, had an equal path length? And if we had five points spaced equally apart (like the points of a star) would a point at the center of the cluster be treated as more normal or more anomalous compared to a point the same distance away from one of the outer points in the cluster, but outside of the cluster? Or even between to points on the edge of the cluster?

I'm also interested in this, but haven't gotten around to experimenting with it yet. I can ping you here if I make something.

Also, maybe a question for @Polochon-street, I can't remember what the features that bliss uses look like but if there are some that are on a much different scale than others this issue on the original repo points out that you might want to normalize them to be on the same scale: sahandha/eif#31

I think bliss already normalizes the features, see https://github.com/Polochon-street/bliss-rs/blob/master/src/chroma.rs#L32 for example.

SimonTeixidor · 2024-03-31T18:55:35Z

(which is not relevant anymore), it's also in a couple of other lines in the file

Thanks. I've removed the one you spotted, I didn't spot any other such comments though... Where did you see the others?

Polochon-street · 2024-03-31T21:38:02Z

Also, maybe a question for @Polochon-street, I can't remember what the features that bliss uses look like but if there are some that are on a much different scale than others this issue on the original repo points out that you might want to normalize them to be on the same scale: sahandha/eif#31

I think bliss already normalizes the features, see https://github.com/Polochon-street/bliss-rs/blob/master/src/chroma.rs#L32 for example.

I can confirm normalization is done, albeit in a very crude way - so enhancing this is definitely worth exploring. From what I remember, chroma normalization is the most unsatisfying one. But yes, there is normalization!

Thanks. I've removed the one you spotted, I didn't spot any other such comments though... Where did you see the others?

Thanks. I can't see any other one, so I think once this is fixed #72 (comment), we're good to go!

Polochon-street · 2024-04-01T16:28:58Z

src/playlist.rs

+        // These songs contains analysis of actual music. Tracks 1-3 are recording of Mozart's
+        // piano concerto no. 19, 9-11 are a recording of Mozart's piano concerto no. 23, and
+        // tracks 4-8 are Miles Davis' "Kind Of Blue".
+        let first_song = Song {


nit, but I think maybe naming them mozart_training_1, mozart_training_2, miles_davis_1, etc would be a bit more self-descriptive (especially useful when they're used towards the end)

Agreed, I named them.

Polochon-street · 2024-04-01T16:30:24Z

src/playlist.rs

+        // We train the algorithm on the Mozart concertos, and the expectation is that the tracks
+        // from the Miles Davis record will end up last.
+        let opts = ForestOptions { n_trees: 1000, sample_size: 200, max_tree_depth: None, extension_level: 10 };
+        closest_to_songs(&[&first_song, &second_song, &third_song, &ninth_song, &tenth_song, &eleventh_song], &mut songs, &opts);


could we have the closest_to_songs call without one of the Mozart tracks, to make sure that even if it's not trained on that specific song, it still ranks better than Miles'?

Don't sweat it if it doesn't work though, I think the test is still useful enough as it is 😄

Yep, that seems to work. I train on one of the concertos, and it still works for ~8k runs of the test here.

Polochon-street · 2024-04-01T18:31:38Z

awesome! Thanks a lot for your patience and all your fixes, merging ✌️

* Enable playlists based on multiple songs This commit introduces functionality for generating playlist based on a set of songs. For good performance, I also introduce a new distance measure, extended isolation distance. While the previous distance metrics, euclidean distance and cosine distance, could be made to measure distance to a set of songs, the performance will not be as good. * Remove *_by_key family of functions Remove code duplication by making these functions generic over AsRef<Song> instead of having separate versions. * Get rid of circular dependency between playlist and song The playlist module depends on the song module, which in turn depends on the playlist module. This is confusing, and in this case also completely unecessary. * Unify distance metrics Allow euclidean distance and cosine distance to handle a set of songs as input. In doing so, we can also remove the ability to generate playlists from a single song, as that is just a special case of generating a playlist from many songs anyway. * Review comment: Improve documentation for single song use case * Add test for ForestOption as DistanceMetricBuilder. * Remove unecessary double-reference * Add missing test assertion. * Remove comment that is no longer relevant. * Fix extended isolation forest test case * Document suggestions and limitations for distance metric * cargo fmt

SimonTeixidor added 4 commits March 14, 2024 19:14

Remove *_by_key family of functions

a7be203

Remove code duplication by making these functions generic over AsRef<Song> instead of having separate versions.

Get rid of circular dependency between playlist and song

b762750

The playlist module depends on the song module, which in turn depends on the playlist module. This is confusing, and in this case also completely unecessary.

Unify distance metrics

e0ffd31

Allow euclidean distance and cosine distance to handle a set of songs as input. In doing so, we can also remove the ability to generate playlists from a single song, as that is just a special case of generating a playlist from many songs anyway.

Polochon-street reviewed Mar 28, 2024

View reviewed changes

src/library.rs Outdated Show resolved Hide resolved

src/playlist.rs Outdated Show resolved Hide resolved

src/library.rs Show resolved Hide resolved

src/library.rs Show resolved Hide resolved

Review comment: Improve documentation for single song use case

1898ba8

SimonTeixidor added 2 commits March 29, 2024 20:07

Add test for ForestOption as DistanceMetricBuilder.

d62009c

Remove unecessary double-reference

5cafbcc

Polochon-street reviewed Mar 30, 2024

View reviewed changes

src/playlist.rs Outdated Show resolved Hide resolved

toofar reviewed Mar 31, 2024

View reviewed changes

SimonTeixidor added 2 commits March 31, 2024 20:50

Add missing test assertion.

1676365

Remove comment that is no longer relevant.

6b336bf

toofar mentioned this pull request Apr 1, 2024

Error handling for tags: Null and encoding #54

Closed

SimonTeixidor force-pushed the multi-track-playlist-source branch 2 times, most recently from 91787dd to f61eec1 Compare April 1, 2024 15:47

Polochon-street reviewed Apr 1, 2024

View reviewed changes

Fix extended isolation forest test case

e57ca4c

SimonTeixidor force-pushed the multi-track-playlist-source branch from f61eec1 to e57ca4c Compare April 1, 2024 18:02

Document suggestions and limitations for distance metric

0a5029e

cargo fmt

43361d9

Polochon-street merged commit 435284c into Polochon-street:master Apr 1, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Playlists from multiple tracks #72

Playlists from multiple tracks #72

SimonTeixidor commented Mar 26, 2024

Polochon-street commented Mar 27, 2024

SimonTeixidor commented Mar 27, 2024

Polochon-street left a comment

Polochon-street commented Mar 28, 2024

SimonTeixidor commented Mar 29, 2024

Polochon-street commented Mar 29, 2024

SimonTeixidor commented Mar 29, 2024

Polochon-street commented Mar 30, 2024

toofar left a comment •

edited

Loading

toofar Mar 30, 2024 •

edited

Loading

SimonTeixidor Mar 31, 2024

toofar Mar 31, 2024

SimonTeixidor Mar 31, 2024

SimonTeixidor commented Mar 31, 2024

SimonTeixidor commented Mar 31, 2024

Polochon-street commented Mar 31, 2024

Polochon-street Apr 1, 2024

SimonTeixidor Apr 1, 2024

Polochon-street Apr 1, 2024

SimonTeixidor Apr 1, 2024

Polochon-street commented Apr 1, 2024

Playlists from multiple tracks #72

Playlists from multiple tracks #72

Conversation

SimonTeixidor commented Mar 26, 2024

Enable playlists based on multiple songs

Remove *_by_key family of functions

Get rid of circular dependency between playlist and song

Unify distance metrics

Polochon-street commented Mar 27, 2024

SimonTeixidor commented Mar 27, 2024

Polochon-street left a comment

Choose a reason for hiding this comment

Polochon-street commented Mar 28, 2024

SimonTeixidor commented Mar 29, 2024

Polochon-street commented Mar 29, 2024

SimonTeixidor commented Mar 29, 2024

Polochon-street commented Mar 30, 2024

toofar left a comment • edited Loading

Choose a reason for hiding this comment

toofar Mar 30, 2024 • edited Loading

Choose a reason for hiding this comment

SimonTeixidor Mar 31, 2024

Choose a reason for hiding this comment

toofar Mar 31, 2024

Choose a reason for hiding this comment

SimonTeixidor Mar 31, 2024

Choose a reason for hiding this comment

SimonTeixidor commented Mar 31, 2024

SimonTeixidor commented Mar 31, 2024

Polochon-street commented Mar 31, 2024

Polochon-street Apr 1, 2024

Choose a reason for hiding this comment

SimonTeixidor Apr 1, 2024

Choose a reason for hiding this comment

Polochon-street Apr 1, 2024

Choose a reason for hiding this comment

SimonTeixidor Apr 1, 2024

Choose a reason for hiding this comment

Polochon-street commented Apr 1, 2024

Remove `*_by_key` family of functions

toofar left a comment •

edited

Loading

toofar Mar 30, 2024 •

edited

Loading