Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playlists from multiple tracks #72

Conversation

SimonTeixidor
Copy link
Contributor

Hi! I found blissify a few weeks ago, and it's pretty great. Thanks for developing this!

One feature that I was missing (and I see it's been discussed on this issue tracker, #54) is the ability to create playlists from multiple tracks, so I decided to implement that. While doing that, I also ended up making some small unrelated changes.
This PR contains 4 commits:

Enable playlists based on multiple songs

This commit introduces a new distance metric, using the extended isolation forest algorithm, which allows us to create playlists from a set of songs instead of from a single song. From what I can tell, this kind of algorithm is usually used to detect outliers, but here I use it in the opposite way (that is, find the songs that are least likely to be outliers). So far, the results seem very good in my opinion.

Remove *_by_key family of functions

A lot of functionality in this library was duplicated, with functions like foo and foo_by_key doing basically the same thing. This commit gets rid of this duplication by making the functions generic over AsRef<Song>, and by implementing that trait for LibrarySong<Song>.

Get rid of circular dependency between playlist and song

I didn't like how the playlist module depends on the song module, and the other way around. This kind of circular dependency is a bit confusing, and in this case it was only used to provide some helper functions on Song. In my opinion it's better to get rid of those helper functions.

Unify distance metrics

Make functions in the playlist module generic over a new trait, DistanceMetricBuilder. This allows us to deduplicate code as there is no longer any need to treat distance functions that operate on a single pair of songs (like euclidean distance) any different from distance functions that operate on sets of songs (like extended isolation forest). For metrics that operate on a single set of songs, we use the sum of distances when those metrics are used to measure the distance to a set of songs.

I've been using a local build of blissify with these changes applied for a little while, and it seems to work so far. I'll be happy to make a PR to blissify as well if you would like to merge these changes. Let me know what you think.

This commit introduces functionality for generating playlist based on a
set of songs.

For good performance, I also introduce a new distance measure, extended
isolation distance.

While the previous distance metrics, euclidean distance and cosine
distance, could be made to measure distance to a set of songs, the
performance will not be as good.
Remove code duplication by making these functions generic over
AsRef<Song> instead of having separate versions.
The playlist module depends on the song module, which in turn depends on
the playlist module. This is confusing, and in this case also completely
unecessary.
Allow euclidean distance and cosine distance to handle a set of songs as
input. In doing so, we can also remove the ability to generate playlists
from a single song, as that is just a special case of generating a
playlist from many songs anyway.
@Polochon-street
Copy link
Owner

Hi!

Thanks a lot for this contribution, it would bring a really really nice feature to bliss. I would also be very much interested in you submitting the blissify's counterpart, as I'd be very happy to try it out!

Would it be possible to split this PR into 4 PRs, one for each commit? It'd be easier for me to review, and merging them one by one avoids the whole PR having trouble getting through if there are small blocking questions here and there :)

Side-note - do you mind if I add your github handle to the CHANGELOG, so you're credited properly?

Cheers!

@SimonTeixidor
Copy link
Contributor Author

Here's what I'm running locally for blissify: Polochon-street/blissify-rs#53. I didn't really go through those changes as thoroughly, so please see it more as WIP.

I think there may be dependencies between the commits, so I'd prefer not split them into separate pull requests. Maybe you can have a look, and if there is something you disagree with I'll make an effort to break that out?

Sure, feel free to add my handle to the changelog :)

Copy link
Owner

@Polochon-street Polochon-street left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, just did a short review :) I'll take a better look later today, but I think it's a good start.
Thanks for submitting the blissify PR also, I'll take a look at it :)

Cheers!

src/library.rs Outdated Show resolved Hide resolved
src/playlist.rs Outdated Show resolved Hide resolved
src/library.rs Show resolved Hide resolved
src/library.rs Show resolved Hide resolved
@Polochon-street
Copy link
Owner

I had some more time to review, and it looks great!
But I would rather not part with the *_by_key functions, as they would be useful (see https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-294648967369087cf3619793bd50043a2b39b5f481523d7c1636cc3e8d572a5bL55-L56) if people want to implement bliss for audio players that have structs with more than bliss' Song struct.

I.e. if a player has a song struct like

struct PlayerSong {
    title: Option<&str>,
    custom_metadata1: Option<Metadata1>,
    custom_metadata2: Option,
    bliss_song: Option<Song>,
}

then all of our closest_to_songs are basically useless. What do you think?

@SimonTeixidor
Copy link
Contributor Author

if a player has a song struct like

struct PlayerSong {
    title: Option<&str>,
    custom_metadata1: Option<Metadata1>,
    custom_metadata2: Option,
    bliss_song: Option<Song>,
}

then all of our closest_to_songs are basically useless. What do you think?

Hmm, but how would you use the *_by_key functions for this struct? The key function has the signature Fn(&T) -> Song, and the only implementation I can think of would be bliss_song.unwrap().clone(), but that seems like a bad idea? If bliss_song was not optional, you could impl AsRef<Song> for PlayerSong, like I do with LibrarySong here: https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-47af127d675e3dc21b43c34222f42b9290334e88eb5c7ae0fc8b7539e3ecb794R341

@Polochon-street
Copy link
Owner

That's a fair point :D then I think these comments need to go https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-294648967369087cf3619793bd50043a2b39b5f481523d7c1636cc3e8d572a5bR110-R111, and this needs fixing #72 (comment), and we're good to go!

@SimonTeixidor
Copy link
Contributor Author

then I think these comments need to go https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-294648967369087cf3619793bd50043a2b39b5f481523d7c1636cc3e8d572a5bR110-R111

Sorry, that link is not working for me (it just takes me to the while diff of this PR).

and this needs fixing #72 (comment), and we're good to go!

I pushed a test case.

@Polochon-street
Copy link
Owner

Sorry, that link is not working for me (it just takes me to the while diff of this PR).

oh, my bad, it's because it doesn't display the diff of playlist.rs because it's too big - it's the comment in line 110-111 of playlist.rs, that says

/// Sort songs with a key extraction function, useful for when you have a
/// structure like `CustomSong { bliss_song: Song, something_else: bool }

(which is not relevant anymore), it's also in a couple of other lines in the file

src/playlist.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@toofar toofar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh, exciting!

I went though this with the goal of understanding what this new distance metric was so I could reason about it. So I have some questions on that if you wouldn't mind humouring me. I haven't actually pulled down the code and starting experimenting yet, but hopefully today or tomorrow I will have time to.

Would you mind explaining a little more about how the isolation forest works in this case? As I understand it it partitions the points in the training data into a set of partitions, encoded as a b-tree, with the goal of isolating the maximum number of points, and it does this for multiple trees as a monte carlo method to get a better fit. Then to score data it puts the point being scored into each tree and sees how many nested partitions it is in (eg its depth in the tree).

Questions:

  1. in the case which I assume is the common case where the training set is much more closely related than the whole library (eg you select a few similar songs to create the playlist from) how would the distribution of points and partitions look like here? I assume the partitions would be clustered around the selected songs and the vast majority of the library would be lightly partition (eg "anomalies"), does that sound right?
  2. regarding the forest option sample_size, I see we are using min(selected_songs.len(), 200), where 200 is the default sample size from upstream. The paper talks about subsampling to avoid swamping and masking. I assume we don't need to worry about that since we are using a specialised sample and the majority of the points we are going to score are going to be anomalies? And we should make sure all of the selected songs are represented in the initial partitioning for, well for the obvious reason of having them all included in the initial partitioning exercise.
  3. regarding the forest option n_trees I have two questions as I'm having a bit of trouble reasoning about this
    a) what are the pros and cons of having a large number of trees (the default is 150) for a small data set like a handful of songs? Are a lot of them essentially redundant and we can use a smaller set to speed things up? Or maybe they aren't redundant and there will be many short trees with wildly different partitions?
    b) does/should the number of trees have any relation to the number of dimensions the points are in? Or the number of features in this case. It seems in a non-extended isolation forest they have one tree per feature, and in this extended one where each partition is in multiple dimensions I think they keep using multiple trees to account for the larger option space available for these partitions. In the paper they say "path lengths usually converge well before 100 trees", and their test for that looks to be over a 135 point gaussian distribution, which may or may not apply in this case. If you don't have any thoughts on this and are just sticking with the defaults, I guess my motivating question would be whether playlist generation is significantly faster with less trees or not?

I would be interested in seeing a visualization of what the partitioning looked like in a small 2d dataset. Like if I just chose two songs would this make it degrade so that that there was a very small partitioned area between the points and everything outside of that, even if they were spatially close to one of the two points, had an equal path length? And if we had five points spaced equally apart (like the points of a star) would a point at the center of the cluster be treated as more normal or more anomalous compared to a point the same distance away from one of the outer points in the cluster, but outside of the cluster? Or even between to points on the edge of the cluster?

Also, maybe a question for @Polochon-street, I can't remember what the features that bliss uses look like but if there are some that are on a much different scale than others this issue on the original repo points out that you might want to normalize them to be on the same scale: sahandha/eif#31

for FunctionDistanceMetric<'a, F>
{
fn distance(&self, vector: &Array1<f32>) -> f32 {
self.state.iter().map(|v| (self.func)(v, vector)).sum()
Copy link
Contributor

@toofar toofar Mar 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly (and I'm not sure I do, the signatures of these generic trait implementations are a bit much for a rust neophyte like me) this is going to compute the distance from vector (a multidimensional point) to each of the points saved in state at initialization time, and then sum them.
Is sum the right operation to be combining these individual distances? I guess I naively would expect some kind of average. Maybe it doesn't matter here? Or is this one of those situations where there are multiple ways to do it with their own arguments for and against and if you want something different you should implement your own builder?
Edit: I tested and I don't think it makes a difference

nit: I think some of these variables could have more meaningful names, "vector" and "state" are pretty generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have understood the implementation correctly. Given a set of points stored in state, this implementation will return the sum of the distances between vector and each point in state. Only for distance metrics that measure the distance between two points though, so this implementation is not relevant for the extended isolation forest metric or other metrics that can measure the distance between vector and state directly.

That works out to be the same as the average (mean) distance, just off by a factor 1/state.len() which shouldn't matter for sorting. I guess the median distance could be interesting to look at though?

Another approach is to compute some kind of centroid of state (perhaps that's what you mean by the average?), and then measure the distance to that. No idea if that gives better results, but perhaps something worth experimenting with?

In general though, for the case where state contains more than one item, I would avoid using euclidean and cosine distance, so I didn't really attempt to optimize this case.

n32(s1.custom_distance(&s2, &distance)) < distance_threshold.unwrap_or(0.05)
let s1 = s1.as_ref();
let s2 = s2.as_ref();
let vector = [s1.analysis.as_arr1()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, any thoughts as to how well an isolation forest will perform here with just one input point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, probably best to use one of the other distance metrics here.

@SimonTeixidor
Copy link
Contributor Author

@toofar

I went though this with the goal of understanding what this new distance metric was so I could reason about it. So I have some questions on that if you wouldn't mind humouring me. I haven't actually pulled down the code and starting experimenting yet, but hopefully today or tomorrow I will have time to.

Would you mind explaining a little more about how the isolation forest works in this case? As I understand it it partitions the points in the training data into a set of partitions, encoded as a b-tree, with the goal of isolating the maximum number of points, and it does this for multiple trees as a monte carlo method to get a better fit. Then to score data it puts the point being scored into each tree and sees how many nested partitions it is in (eg its depth in the tree).

I'm definitely no expert on the extended isolation forest. I set out to find some suitable metric, and somehow I bumped into this one which already had a Rust implementation.

In general, I wanted a metric that:

  • Does not make any assumptions on the distribution of the data (see Mahalanobis' distance for example)
  • Has good runtime performance
  • Gives subjectively good results
  • Is relatively easy to implement, or already has a Rust implementation

So, for your questions, I think your guess is mostly as good as mine :)

Questions:

  1. in the case which I assume is the common case where the training set is much more closely related than the whole library (eg you select a few similar songs to create the playlist from) how would the distribution of points and partitions look like here? I assume the partitions would be clustered around the selected songs and the vast majority of the library would be lightly partition (eg "anomalies"), does that sound right?

I have no idea what the common case is, to be honest. I guess if you run this system on a large and diverse catalogue (say, the entire catalogue of spotify) and select something quite specific (say jazz piano trios), it would be true that the vast majority of the library would be anomalies. But I'm guessing that most private collections are much more specialised, so the same training set could actually match a large portion of the library instead?

  1. regarding the forest option sample_size, I see we are using min(selected_songs.len(), 200), where 200 is the default sample size from upstream. The paper talks about subsampling to avoid swamping and masking. I assume we don't need to worry about that since we are using a specialised sample and the majority of the points we are going to score are going to be anomalies? And we should make sure all of the selected songs are represented in the initial partitioning for, well for the obvious reason of having them all included in the initial partitioning exercise.

  2. regarding the forest option n_trees I have two questions as I'm having a bit of trouble reasoning about this
    a) what are the pros and cons of having a large number of trees (the default is 150) for a small data set like a handful of songs? Are a lot of them essentially redundant and we can use a smaller set to speed things up? Or maybe they aren't redundant and there will be many short trees with wildly different partitions?
    b) does/should the number of trees have any relation to the number of dimensions the points are in? Or the number of features in this case. It seems in a non-extended isolation forest they have one tree per feature, and in this extended one where each partition is in multiple dimensions I think they keep using multiple trees to account for the larger option space available for these partitions. In the paper they say "path lengths usually converge well before 100 trees", and their test for that looks to be over a 135 point gaussian distribution, which may or may not apply in this case. If you don't have any thoughts on this and are just sticking with the defaults, I guess my motivating question would be whether playlist generation is significantly faster with less trees or not?

These are good questions, and worth exploring. Unless I made a mistake, I don't think I set any of these options in this library though. We just accept a ForestOptions with whatever parameters and use those as is (well, except for sample size which is reduced in case we don't have enough samples). My idea was that we could experiment a bit with these parameters in blissify, perhaps we could even expose them to end users (seeing as blissify already exposes other such technical details in the cli).

I would be interested in seeing a visualization of what the partitioning looked like in a small 2d dataset. Like if I just chose two songs would this make it degrade so that that there was a very small partitioned area between the points and everything outside of that, even if they were spatially close to one of the two points, had an equal path length? And if we had five points spaced equally apart (like the points of a star) would a point at the center of the cluster be treated as more normal or more anomalous compared to a point the same distance away from one of the outer points in the cluster, but outside of the cluster? Or even between to points on the edge of the cluster?

I'm also interested in this, but haven't gotten around to experimenting with it yet. I can ping you here if I make something.

Also, maybe a question for @Polochon-street, I can't remember what the features that bliss uses look like but if there are some that are on a much different scale than others this issue on the original repo points out that you might want to normalize them to be on the same scale: sahandha/eif#31

I think bliss already normalizes the features, see https://github.com/Polochon-street/bliss-rs/blob/master/src/chroma.rs#L32 for example.

@SimonTeixidor
Copy link
Contributor Author

(which is not relevant anymore), it's also in a couple of other lines in the file

Thanks. I've removed the one you spotted, I didn't spot any other such comments though... Where did you see the others?

@Polochon-street
Copy link
Owner

Also, maybe a question for @Polochon-street, I can't remember what the features that bliss uses look like but if there are some that are on a much different scale than others this issue on the original repo points out that you might want to normalize them to be on the same scale: sahandha/eif#31

I think bliss already normalizes the features, see https://github.com/Polochon-street/bliss-rs/blob/master/src/chroma.rs#L32 for example.

I can confirm normalization is done, albeit in a very crude way - so enhancing this is definitely worth exploring. From what I remember, chroma normalization is the most unsatisfying one. But yes, there is normalization!

Thanks. I've removed the one you spotted, I didn't spot any other such comments though... Where did you see the others?

Thanks. I can't see any other one, so I think once this is fixed #72 (comment), we're good to go!

@SimonTeixidor SimonTeixidor force-pushed the multi-track-playlist-source branch 2 times, most recently from 91787dd to f61eec1 Compare April 1, 2024 15:47
src/playlist.rs Outdated
// These songs contains analysis of actual music. Tracks 1-3 are recording of Mozart's
// piano concerto no. 19, 9-11 are a recording of Mozart's piano concerto no. 23, and
// tracks 4-8 are Miles Davis' "Kind Of Blue".
let first_song = Song {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but I think maybe naming them mozart_training_1, mozart_training_2, miles_davis_1, etc would be a bit more self-descriptive (especially useful when they're used towards the end)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I named them.

src/playlist.rs Outdated
// We train the algorithm on the Mozart concertos, and the expectation is that the tracks
// from the Miles Davis record will end up last.
let opts = ForestOptions { n_trees: 1000, sample_size: 200, max_tree_depth: None, extension_level: 10 };
closest_to_songs(&[&first_song, &second_song, &third_song, &ninth_song, &tenth_song, &eleventh_song], &mut songs, &opts);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have the closest_to_songs call without one of the Mozart tracks, to make sure that even if it's not trained on that specific song, it still ranks better than Miles'?

Don't sweat it if it doesn't work though, I think the test is still useful enough as it is 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that seems to work. I train on one of the concertos, and it still works for ~8k runs of the test here.

@SimonTeixidor SimonTeixidor force-pushed the multi-track-playlist-source branch from f61eec1 to e57ca4c Compare April 1, 2024 18:02
@Polochon-street
Copy link
Owner

awesome! Thanks a lot for your patience and all your fixes, merging ✌️

@Polochon-street Polochon-street merged commit 435284c into Polochon-street:master Apr 1, 2024
2 checks passed
Polochon-street pushed a commit that referenced this pull request Apr 3, 2024
* Enable playlists based on multiple songs

This commit introduces functionality for generating playlist based on a
set of songs.

For good performance, I also introduce a new distance measure, extended
isolation distance.

While the previous distance metrics, euclidean distance and cosine
distance, could be made to measure distance to a set of songs, the
performance will not be as good.

* Remove *_by_key family of functions

Remove code duplication by making these functions generic over
AsRef<Song> instead of having separate versions.

* Get rid of circular dependency between playlist and song

The playlist module depends on the song module, which in turn depends on
the playlist module. This is confusing, and in this case also completely
unecessary.

* Unify distance metrics

Allow euclidean distance and cosine distance to handle a set of songs as
input. In doing so, we can also remove the ability to generate playlists
from a single song, as that is just a special case of generating a
playlist from many songs anyway.

* Review comment: Improve documentation for single song use case

* Add test for ForestOption as DistanceMetricBuilder.

* Remove unecessary double-reference

* Add missing test assertion.

* Remove comment that is no longer relevant.

* Fix extended isolation forest test case

* Document suggestions and limitations for distance metric

* cargo fmt
Polochon-street pushed a commit that referenced this pull request Apr 3, 2024
* Enable playlists based on multiple songs

This commit introduces functionality for generating playlist based on a
set of songs.

For good performance, I also introduce a new distance measure, extended
isolation distance.

While the previous distance metrics, euclidean distance and cosine
distance, could be made to measure distance to a set of songs, the
performance will not be as good.

* Remove *_by_key family of functions

Remove code duplication by making these functions generic over
AsRef<Song> instead of having separate versions.

* Get rid of circular dependency between playlist and song

The playlist module depends on the song module, which in turn depends on
the playlist module. This is confusing, and in this case also completely
unecessary.

* Unify distance metrics

Allow euclidean distance and cosine distance to handle a set of songs as
input. In doing so, we can also remove the ability to generate playlists
from a single song, as that is just a special case of generating a
playlist from many songs anyway.

* Review comment: Improve documentation for single song use case

* Add test for ForestOption as DistanceMetricBuilder.

* Remove unecessary double-reference

* Add missing test assertion.

* Remove comment that is no longer relevant.

* Fix extended isolation forest test case

* Document suggestions and limitations for distance metric

* cargo fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants