-
-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Face detection #342
base: main
Are you sure you want to change the base?
Face detection #342
Conversation
Just tried this out - really really cool!! 🚀 What's the plan for extracting all videos, and storing faces that are the same person, and grouping the same videos together? |
I'd assume something along the lines of:
That way we'd have a pre-computed list of actors, along with associated videos. Could then let the user add more metadata. |
It looks like a lot of the face recognition work is done for us: |
Current intention for the future features, not all will be in this PR:
These are the initial features -- should be pretty easy to implement. Once it's set and released, I will consider others. Before that's done -- I'm open to more suggestions / recommendation for change, etc. Part of the reason I'm aiming for those 6 things is that they are easy from the code point of view (re-using same functionality a lot). We'll see how it goes. |
I'm confused about two things:
Just from my point of view, I think we should aim (not necessarily right now), for extracting all faces, grouping faces (regardless of the video they came from) within a threshold, and allowing users to name that face, selecting which will show all videos that face appears in. What's your thoughts on this? 😄 Just want to be on the same page! 📖 |
Re: filmstrips of faces - I actually think this would be very cool to show all the faces of one actor, but I think the storage of these should be individual files for the reasons above (multiple actors per video) - combining them in-app would be my go-to idea. It would also be more interesting I think, to shuffle up the face order, ie. not have all the faces from video1, then video2, etc... Clicking a face would still bring up the relevant video though. 😄 |
At the moment I'm not doing person identification, only face extraction. I'm unsure why many of the features exist in this app -- people request things, other things I suspect are desired, and others are just easy-enough to add that I might as well. Once I have person identification it may be useful to do something with that, but for now, just extracting faces is enough. With identification, as you suggest, we could then auto-tag each video with that face found. But I'm unclear how resource-heavy it is to do all these operations. Face detection is fast and straight-forward, so I'm doing that as a first step. |
I'd be quite interested to implement the face identification as above after this, but I'm just afraid of needing to change the underlying structure of the data (storing jpgs as filmstrips vs faces, etc.). 🤔 If you'd be okay with me changing up some of these things after this merges, I'm happy to work on this! 😄 |
I think you're very right that it makes sense to think through the future functionality so I don't have to break backwards compatibility when adding another feature. This is a major-enough feature that I could bump VHA to version 3 (good for communicating publicly that there are big new features 🤷♂ ) ... but this is besides the point -- I don't want to break compatibility with an earlier release if I can help it (and a bit of planning will help). I think before I continue with this PR, I'll try out the face recognition to see how well it works (how much time it takes, etc) ... I know Picasa was able to do it a decade ago (feature released in 2008 😱 ) so maybe it will work fast-enough with this library on today's computers 🤞 All ideas and thoughts welcome 👍 |
I'm all for keeping the data as flexible as possible, and as far as I can tell, the best storage would be:
This would allow for the current feature (just concat all indexes for a given file hash), and allow to index any face by video-hash and id in the future for any features. For example (pseudocode):
This possible class would allow listing videos with an actor, moving faces to correct actor groupings etc. Just keeping the future in mind! 🚀 😄 |
btw -- if you're still occasionally doing things with your |
I played around with this and I think I've decided that a face filmstrip is still a good idea. I'm not sure who would use it, but I suspect there may be people who would want to see a filmstrip of just faces. The way the app is set up, it can consume a filmstrip very easily, so very little code will need to be changed. I am uneasy about extracting each face as its own file, because rather than 1 strip per file, we may have as many as 100 image files (imagine a long video with many screens extracted and many faces per screen). The upshot is very little code will need to change:
I'm pretty sure this will work out well -- I'll try to get a rough draft by end of Sunday 🤞 |
Thinking forward to potential(!) face recognition functionality: I imagine the primary goal is to be able to ask the question "find all videos that have this person" (even if you have never manually entered the person's name anywhere). The library I'm using does facial recognition by generating a vector with 128 values: So (as far as I can tell) the app would manually have to have a database of vectors to compare against (maybe support file inside The process might be something like this:
At this point, we can drop any face (even a photo from the computer) and the app will find the face(s) that are most-likely matches to it (we manually find the closest match(es) to the picked photo). All this is tricky, and we'll have to think through all the cross-linking, so we can update things in the future (add new people, merge people and have all the references update automatically, without bugs). Lots to think through! The important part (decision at this point) is whether having a |
It would be a great bonus if the resulting set of identified people (map from vector to name) can be stored separately / exported. This way, there will not be a need to re-label the same individuals in another hub! The user interface for tagging people, merging faces that are of the same person, and telling the app that some faces don't belong is going to be very challenging 😓 |
I am hoping the face recognition process may avoid having to do clustering of any kind. The comment above previous one might not be describing the process we'll end up using. This routine works only if it's sensible to (experimentally find and) use some sort of a rough cutoff for what face is "close-enough" that the app considers the photo to be of the same person as found before. Perhaps, we will simply go through each video, one-at-a-time, following some version of this routine:
At this point we'll have a new folder Eventually the user will be able to add names to each 'person' and merge people (by naming them the same name perhaps). When merging, we will just go through the ImageElement list and update the "persons" array with new number (e.g. The global person list will perhaps have a shape like this:
This way it would be easy to find all videos with a particular person -- just show all the videos in the 'videos' array. |
I feel like this is getting a little over complicated - we should keep it as simple as possible. 😄 I strongly suggest we do one of these rather than storing a filmstrip
I have two reasons for this:
And lastly, along with either of the options above, I suggest we store all facial recognition data to disk, with the corresponding face (database, files or otherwise). I know it might seem wasteful, but I am 100% for thinking ahead for the future, and this way, any new features will already have the facial data computed from the first extraction. There's no point filtering the data now, if we don't know exactly what we'll want in the future. And 128 numbers isn't exactly a lot of data these days... 🤣 Imagine we implement facial recognition using algorithm X, but then realise that algoritm Y is vastly superior in results and runtime. If we have stored all the data we extracted the first time, it's just a simple reprocessing of this data (which would take seconds), rather than needing to increment a major version number and remove backwards compatibility. I know it's more difficult to do, but implementing your feature by stitching together X face jpgs will be much better in the long run, rather than deciding a format now that is restrictive, and either hacking solutions later, or risking losing compatibility. Hope that makes sense! 😄 👍 |
Actually, now that I think about it some more, why don't we just store the face coordinates for each for each filmstrip, and the recognition data? 🤔 For now, you could just have a class which stores:
Then it should be easy to implement your feature, and we don't have to store any new images! 💡 Just for each filmstrip in "face mode" - when you scroll across the frame, just use CSS-magic to zoom into each face, like we do along the X axis to view each screenshot? And to find the faces for each filmstrip, just filter the above class by video hash, and order the resulting faces by X value, followed Y value! 😄 |
Thank you for your thoughts 🤝 Thank you for clarifying which features we are discussing. The facestrip idea is easy to release because I already have most of the the code done. What you were correctly pointing out is we don't want to get into a dead-end and have to undo things for the face recognition feature. What I describe above (in the last comment) is an attempt to spell out the details of how the code would run. I'm currently unsure how face recognition would work (seems like the library we use only gives us vectors per face, and we need to compare the Euclidean distance to check if the face is similar to anything in our database). So we'll need some 'database'. Face coordinates (face vectors) need to be extracted only once, and once we know which person is in the video, we can just store the person's I feel like once a person exists (we have face coordinates for the person), when in an appropriate view (perhaps Details View?) each video will have a headshot of all people found in the video. Different videos will have the same headshot shown (so each person will have a single .jpg assigned to them). So we don't need to have a .jpg for every face found -- only one per person. I'm unsure (will run experiments today) how quickly we can compare a face against, say, 1,000 'persons' ... this may change how I think about the problem. I'll write more later (having lunch now), I know I've not responded to everything you said yet 👍 |
Important note: looks like if we just do face detection, many faces get detected, but if we run the same script but ask the face-api.js library to also generated the 128-number face vector, it detects fewer faces (probably because they are not aligned well-enough or something). This may mean that we might want to do two passes -- one for just detecting faces for a good filmstrip, and a second pass for just facial recognition. I have the code now that will extract the largest image for the single-face preview for each video file too: whyboris/extract-faces-node#2 I'm still experimenting to see how well simple vector comparison works (speed etc -- may need to change my approach to the whole feature depending on how things go). I'll keep everyone posted here 🙆♂ |
I have some experience with face recognition so I'm sharing my thoughts hoping they will be of use.
In my experiments this is possible. I normally use a threshold of 0.6 for Euclidean distance (so
I think averaging the embeddings might be a bad idea. Since a face can have many different instances due to lighting, position, rotation, wearing glasses vs not wearing glasses, etc, it's possible for the same face to have widely different embeddings. I would therefore recommend saving all the embeddings separately so it's easier to find matches. A downside is the additional processing required when matching, but calculating Euclidean distance is not that costly. In my face recognition applications I save every embedding and when a new image is processed I compare the extracted embeddings with every embedding in the database (no performance issues for ~2000 embeddings, but YMMV). For videos the amount of embeddings is of course a lot bigger so maybe in the end you have to cluster the embeddings somehow for performance reasons. Still, I would then refrain from clustering to a single embedding per person, but instead save multiple. Looking forward to seeing more progress on this! Would be amazing to filter videos by person! |
Thank you Steven! I came across the I wish averaging embedding was a good idea -- in my short experiment, I made pairwise comparisons of 5 photos of the same person from a video frames I extracted (10 total comparisons) and only 3 were below I compared the average of one person's face to an average of another person's face, they were different. I'll experiment with this some more when I have time to see how much mileage I'll get out of it, even against your warning 😅 In case of 'averaging' photos, when you do that with different people, you'll get an 'average person' , but I'm hoping that averaging vectors does something different 🤞 especially that I'm aiming to average across the same person 🤷♂ All feedback/comments/advice is very welcome 🙇 My current experiment continues here: whyboris/extract-faces-node#2 |
I've not forgotten about this feature -- I just want to finish releasing version |
May as well note here: the vectorizations of faces will be stored in a separate |
🤞 Hoping to resume this feature in 2024 🤞 |
Face detection and extraction works 😎
App builds without crashing -- 125MB file 👌 not too big.
Currently built app doesn't seem extract photos because the
weights
folder needs to be moved elsewhere within the installed directory (e.g.Program Files/Video Hub App 2/...
I'll update the branch later today once I fix that up.
Will close #341
🚀
Will release version🚀2.2.0
once this feature is doneupdate: this feature got delayed to release version 3.0.0 🤷 ... I might get back to it later this year 😅