-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement image embedding API #19
Comments
Perhaps it’s time to revisit this (post icassp)? |
Yes. Here are my thoughts for how we could address this without significant change to the API. Basically, for all of the functions we'd add a
This makes it backwards compatible while allowing for the image embeddings to be extracted. My fear is that adding Thoughts? |
I agree this would be the most backwards compatible option, but I'm not convinced it's better than breaking the current API in favor of transitioning to separate functions for audio/image. My concern with maintaining the current API is that it can create a lot of ambiguity as to which embedding is being extracted, and having shared parameters but with different allowed values only makes things worse. The reality is that many users don't read the docs and just explore the API via trial/error, and my intuition is that having separate The major downside to this is that we break backwards compatibility, but that's what version control is for :) We can just make a new release (we can even bump to 1.0.0 if we want to strictly follow the rules of semantic versioning), and there shouldn't be too much confusion. Our user base is still relatively small, so now's a good time to make breaking changes. I'm not too worried about the API being cluttered, it's 8 function calls instead of 4 which is still tiny, and if every pair of functions is distinguished by having ...unless we want to have built in support for processing video directly (and returning both the audio and image embeddings from a video file). This could be supported e.g. by a 3rd family of function calls for Thoughts? |
You raise a good point. Maybe backwards compatibility isn't a very important consideration at this point. Sure, let's refactor under a new v1.0.0 release with the separate I guess with a new v1.0.0 release in mind, we should consider any other possible API changes we might want to include. Maybe we should include the more flexible pooling size interface in @bensteers's PR. I've received a few questions about including the whole AVC model. Though I'm not super convinced if this something we should worry about right now. Regarding video, this would be a nice feature to have I think. If we're going to have a 1.0.0 release I think this would be a good thing to include. We can reuse some of the |
OK, sounds like we're in agreement about revamping the API for a 1.0.0 release. I think we should spend a moment on nomenclature, especially if we want to support audio, video, and joint audio-visual processing. For audio-specific functions, I think it's a no brainer, and the
It becomes trickier for the visual frame embeddings: we could use either So if we ignore my beef, we could use The final question is how to name methods that extract both from a video file. If we use Thoughts? |
I'm on board changing I'm more on board with using Thoughts? |
ok cool, I think it'll be clearer and also cleaner to have only 1 method start with get, especially once we have 3 copies of each method.
So, in principle I agree, but... there's a difference between an image embedding (learned on images, e.g. ImageNet) and a video-frame embedding (learned from video frames). The literature suggests these aren't the same and don't necessary generalize well from one to the other. So the problem with using
For API consistency I'd be in favor of implementing all 4 methods for video, even if some of them just end up being wrappers around get_audio and get_image under the hood. I think it's cleaner and more consistent that way (and probably more streamlined for those interested in extracting both embeddings). |
Okay, one more question. For processing videos, do we want an option to have a different hop size for the audio than the "hop size" of the images (effectively the framerate)? |
Hmm without actually asking people, a priori my inclination is not to support different hop sizes for processing videos. I.e., if you're processing a video, you'll get time-aligned audio/video embeddings controlled via a single hop size parameter. If you want separate hop size then it makes sense to require the user to use the separate |
Also, if we're doing a major revision, would it be a good moment to consider #22 ? |
I think we can still forgo an OOP interface before getting more feedback from users. |
Sure, sounds reasonable to me. I've asked around about naming. People responded that |
Another comment was that we might want to avoid |
Something else to consider is a batch processing mode. i.e. making more efficient use of the GPU by predicting multiple files at once. I'm putting together something like this for BirdVoxClassify, so I can adapt it for openl3. |
That seems like a different issue though, right? Worth including for a major revision, but perhaps we should open a new issue for this. |
Deferred to #26 |
Proposed updated API:
Thoughts? |
Generally I think this looks great. A couple of comments:
|
Yeah, the issue of diverging nomenclature was why I opted for
What would we expect |
Re: 1, I'm not a big fan of Re: 2, ok, fair enough. |
I'd prefer |
Regarding videos, I've been thinking about whether we should use opencv which is maybe overkill but is the most popular library for video processing. However, in order to load audio from the videos we'll probably need to use |
|
OK, let's do that then. Regarding video loading, I think the lighter we can keep the repo the better. Whatever our dependencies are people will install them (or rather, pip will install them), so I don't think it matters too much if the dependency is super popular, what matters most is that it's (1) light and (2) well supported (with expectation that it'll continue to be supported long term). |
FYI: being addressed in #29 |
Closed by #37. |
Add the image embedding API to the library. This should be fairly similar to the existing audio API. I'll add a candidate interface once I've given it more thought.
The text was updated successfully, but these errors were encountered: