-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Playlists from multiple tracks #72
Playlists from multiple tracks #72
Conversation
This commit introduces functionality for generating playlist based on a set of songs. For good performance, I also introduce a new distance measure, extended isolation distance. While the previous distance metrics, euclidean distance and cosine distance, could be made to measure distance to a set of songs, the performance will not be as good.
Remove code duplication by making these functions generic over AsRef<Song> instead of having separate versions.
The playlist module depends on the song module, which in turn depends on the playlist module. This is confusing, and in this case also completely unecessary.
Allow euclidean distance and cosine distance to handle a set of songs as input. In doing so, we can also remove the ability to generate playlists from a single song, as that is just a special case of generating a playlist from many songs anyway.
Hi! Thanks a lot for this contribution, it would bring a really really nice feature to bliss. I would also be very much interested in you submitting the blissify's counterpart, as I'd be very happy to try it out! Would it be possible to split this PR into 4 PRs, one for each commit? It'd be easier for me to review, and merging them one by one avoids the whole PR having trouble getting through if there are small blocking questions here and there :) Side-note - do you mind if I add your github handle to the CHANGELOG, so you're credited properly? Cheers! |
Here's what I'm running locally for blissify: Polochon-street/blissify-rs#53. I didn't really go through those changes as thoroughly, so please see it more as WIP. I think there may be dependencies between the commits, so I'd prefer not split them into separate pull requests. Maybe you can have a look, and if there is something you disagree with I'll make an effort to break that out? Sure, feel free to add my handle to the changelog :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, just did a short review :) I'll take a better look later today, but I think it's a good start.
Thanks for submitting the blissify PR also, I'll take a look at it :)
Cheers!
I had some more time to review, and it looks great! I.e. if a player has a song struct like
then all of our |
Hmm, but how would you use the |
That's a fair point :D then I think these comments need to go https://github.com/Polochon-street/bliss-rs/pull/72/files#diff-294648967369087cf3619793bd50043a2b39b5f481523d7c1636cc3e8d572a5bR110-R111, and this needs fixing #72 (comment), and we're good to go! |
Sorry, that link is not working for me (it just takes me to the while diff of this PR).
I pushed a test case. |
oh, my bad, it's because it doesn't display the diff of
(which is not relevant anymore), it's also in a couple of other lines in the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh, exciting!
I went though this with the goal of understanding what this new distance metric was so I could reason about it. So I have some questions on that if you wouldn't mind humouring me. I haven't actually pulled down the code and starting experimenting yet, but hopefully today or tomorrow I will have time to.
Would you mind explaining a little more about how the isolation forest works in this case? As I understand it it partitions the points in the training data into a set of partitions, encoded as a b-tree, with the goal of isolating the maximum number of points, and it does this for multiple trees as a monte carlo method to get a better fit. Then to score data it puts the point being scored into each tree and sees how many nested partitions it is in (eg its depth in the tree).
Questions:
- in the case which I assume is the common case where the training set is much more closely related than the whole library (eg you select a few similar songs to create the playlist from) how would the distribution of points and partitions look like here? I assume the partitions would be clustered around the selected songs and the vast majority of the library would be lightly partition (eg "anomalies"), does that sound right?
- regarding the forest option
sample_size
, I see we are usingmin(selected_songs.len(), 200)
, where 200 is the default sample size from upstream. The paper talks about subsampling to avoid swamping and masking. I assume we don't need to worry about that since we are using a specialised sample and the majority of the points we are going to score are going to be anomalies? And we should make sure all of the selected songs are represented in the initial partitioning for, well for the obvious reason of having them all included in the initial partitioning exercise. - regarding the forest option
n_trees
I have two questions as I'm having a bit of trouble reasoning about this
a) what are the pros and cons of having a large number of trees (the default is 150) for a small data set like a handful of songs? Are a lot of them essentially redundant and we can use a smaller set to speed things up? Or maybe they aren't redundant and there will be many short trees with wildly different partitions?
b) does/should the number of trees have any relation to the number of dimensions the points are in? Or the number of features in this case. It seems in a non-extended isolation forest they have one tree per feature, and in this extended one where each partition is in multiple dimensions I think they keep using multiple trees to account for the larger option space available for these partitions. In the paper they say "path lengths usually converge well before 100 trees", and their test for that looks to be over a 135 point gaussian distribution, which may or may not apply in this case. If you don't have any thoughts on this and are just sticking with the defaults, I guess my motivating question would be whether playlist generation is significantly faster with less trees or not?
I would be interested in seeing a visualization of what the partitioning looked like in a small 2d dataset. Like if I just chose two songs would this make it degrade so that that there was a very small partitioned area between the points and everything outside of that, even if they were spatially close to one of the two points, had an equal path length? And if we had five points spaced equally apart (like the points of a star) would a point at the center of the cluster be treated as more normal or more anomalous compared to a point the same distance away from one of the outer points in the cluster, but outside of the cluster? Or even between to points on the edge of the cluster?
Also, maybe a question for @Polochon-street, I can't remember what the features that bliss uses look like but if there are some that are on a much different scale than others this issue on the original repo points out that you might want to normalize them to be on the same scale: sahandha/eif#31
for FunctionDistanceMetric<'a, F> | ||
{ | ||
fn distance(&self, vector: &Array1<f32>) -> f32 { | ||
self.state.iter().map(|v| (self.func)(v, vector)).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly (and I'm not sure I do, the signatures of these generic trait implementations are a bit much for a rust neophyte like me) this is going to compute the distance from vector
(a multidimensional point) to each of the points saved in state
at initialization time, and then sum them.
Is sum
the right operation to be combining these individual distances? I guess I naively would expect some kind of average. Maybe it doesn't matter here? Or is this one of those situations where there are multiple ways to do it with their own arguments for and against and if you want something different you should implement your own builder?
Edit: I tested and I don't think it makes a difference
nit: I think some of these variables could have more meaningful names, "vector" and "state" are pretty generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have understood the implementation correctly. Given a set of points stored in state
, this implementation will return the sum of the distances between vector
and each point in state
. Only for distance metrics that measure the distance between two points though, so this implementation is not relevant for the extended isolation forest metric or other metrics that can measure the distance between vector
and state
directly.
That works out to be the same as the average (mean) distance, just off by a factor 1/state.len()
which shouldn't matter for sorting. I guess the median distance could be interesting to look at though?
Another approach is to compute some kind of centroid of state
(perhaps that's what you mean by the average?), and then measure the distance to that. No idea if that gives better results, but perhaps something worth experimenting with?
In general though, for the case where state
contains more than one item, I would avoid using euclidean and cosine distance, so I didn't really attempt to optimize this case.
n32(s1.custom_distance(&s2, &distance)) < distance_threshold.unwrap_or(0.05) | ||
let s1 = s1.as_ref(); | ||
let s2 = s2.as_ref(); | ||
let vector = [s1.analysis.as_arr1()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, any thoughts as to how well an isolation forest will perform here with just one input point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, probably best to use one of the other distance metrics here.
I'm definitely no expert on the extended isolation forest. I set out to find some suitable metric, and somehow I bumped into this one which already had a Rust implementation. In general, I wanted a metric that:
So, for your questions, I think your guess is mostly as good as mine :)
I have no idea what the common case is, to be honest. I guess if you run this system on a large and diverse catalogue (say, the entire catalogue of spotify) and select something quite specific (say jazz piano trios), it would be true that the vast majority of the library would be anomalies. But I'm guessing that most private collections are much more specialised, so the same training set could actually match a large portion of the library instead?
These are good questions, and worth exploring. Unless I made a mistake, I don't think I set any of these options in this library though. We just accept a
I'm also interested in this, but haven't gotten around to experimenting with it yet. I can ping you here if I make something.
I think bliss already normalizes the features, see https://github.com/Polochon-street/bliss-rs/blob/master/src/chroma.rs#L32 for example. |
Thanks. I've removed the one you spotted, I didn't spot any other such comments though... Where did you see the others? |
I can confirm normalization is done, albeit in a very crude way - so enhancing this is definitely worth exploring. From what I remember, chroma normalization is the most unsatisfying one. But yes, there is normalization!
Thanks. I can't see any other one, so I think once this is fixed #72 (comment), we're good to go! |
91787dd
to
f61eec1
Compare
src/playlist.rs
Outdated
// These songs contains analysis of actual music. Tracks 1-3 are recording of Mozart's | ||
// piano concerto no. 19, 9-11 are a recording of Mozart's piano concerto no. 23, and | ||
// tracks 4-8 are Miles Davis' "Kind Of Blue". | ||
let first_song = Song { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but I think maybe naming them mozart_training_1
, mozart_training_2
, miles_davis_1
, etc would be a bit more self-descriptive (especially useful when they're used towards the end)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I named them.
src/playlist.rs
Outdated
// We train the algorithm on the Mozart concertos, and the expectation is that the tracks | ||
// from the Miles Davis record will end up last. | ||
let opts = ForestOptions { n_trees: 1000, sample_size: 200, max_tree_depth: None, extension_level: 10 }; | ||
closest_to_songs(&[&first_song, &second_song, &third_song, &ninth_song, &tenth_song, &eleventh_song], &mut songs, &opts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have the closest_to_songs
call without one of the Mozart tracks, to make sure that even if it's not trained on that specific song, it still ranks better than Miles'?
Don't sweat it if it doesn't work though, I think the test is still useful enough as it is 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that seems to work. I train on one of the concertos, and it still works for ~8k runs of the test here.
f61eec1
to
e57ca4c
Compare
awesome! Thanks a lot for your patience and all your fixes, merging ✌️ |
* Enable playlists based on multiple songs This commit introduces functionality for generating playlist based on a set of songs. For good performance, I also introduce a new distance measure, extended isolation distance. While the previous distance metrics, euclidean distance and cosine distance, could be made to measure distance to a set of songs, the performance will not be as good. * Remove *_by_key family of functions Remove code duplication by making these functions generic over AsRef<Song> instead of having separate versions. * Get rid of circular dependency between playlist and song The playlist module depends on the song module, which in turn depends on the playlist module. This is confusing, and in this case also completely unecessary. * Unify distance metrics Allow euclidean distance and cosine distance to handle a set of songs as input. In doing so, we can also remove the ability to generate playlists from a single song, as that is just a special case of generating a playlist from many songs anyway. * Review comment: Improve documentation for single song use case * Add test for ForestOption as DistanceMetricBuilder. * Remove unecessary double-reference * Add missing test assertion. * Remove comment that is no longer relevant. * Fix extended isolation forest test case * Document suggestions and limitations for distance metric * cargo fmt
* Enable playlists based on multiple songs This commit introduces functionality for generating playlist based on a set of songs. For good performance, I also introduce a new distance measure, extended isolation distance. While the previous distance metrics, euclidean distance and cosine distance, could be made to measure distance to a set of songs, the performance will not be as good. * Remove *_by_key family of functions Remove code duplication by making these functions generic over AsRef<Song> instead of having separate versions. * Get rid of circular dependency between playlist and song The playlist module depends on the song module, which in turn depends on the playlist module. This is confusing, and in this case also completely unecessary. * Unify distance metrics Allow euclidean distance and cosine distance to handle a set of songs as input. In doing so, we can also remove the ability to generate playlists from a single song, as that is just a special case of generating a playlist from many songs anyway. * Review comment: Improve documentation for single song use case * Add test for ForestOption as DistanceMetricBuilder. * Remove unecessary double-reference * Add missing test assertion. * Remove comment that is no longer relevant. * Fix extended isolation forest test case * Document suggestions and limitations for distance metric * cargo fmt
Hi! I found blissify a few weeks ago, and it's pretty great. Thanks for developing this!
One feature that I was missing (and I see it's been discussed on this issue tracker, #54) is the ability to create playlists from multiple tracks, so I decided to implement that. While doing that, I also ended up making some small unrelated changes.
This PR contains 4 commits:
Enable playlists based on multiple songs
This commit introduces a new distance metric, using the extended isolation forest algorithm, which allows us to create playlists from a set of songs instead of from a single song. From what I can tell, this kind of algorithm is usually used to detect outliers, but here I use it in the opposite way (that is, find the songs that are least likely to be outliers). So far, the results seem very good in my opinion.
Remove
*_by_key
family of functionsA lot of functionality in this library was duplicated, with functions like
foo
andfoo_by_key
doing basically the same thing. This commit gets rid of this duplication by making the functions generic overAsRef<Song>
, and by implementing that trait forLibrarySong<Song>
.Get rid of circular dependency between playlist and song
I didn't like how the playlist module depends on the song module, and the other way around. This kind of circular dependency is a bit confusing, and in this case it was only used to provide some helper functions on
Song
. In my opinion it's better to get rid of those helper functions.Unify distance metrics
Make functions in the playlist module generic over a new trait,
DistanceMetricBuilder
. This allows us to deduplicate code as there is no longer any need to treat distance functions that operate on a single pair of songs (like euclidean distance) any different from distance functions that operate on sets of songs (like extended isolation forest). For metrics that operate on a single set of songs, we use the sum of distances when those metrics are used to measure the distance to a set of songs.I've been using a local build of blissify with these changes applied for a little while, and it seems to work so far. I'll be happy to make a PR to blissify as well if you would like to merge these changes. Let me know what you think.