-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlier detection #120
Outlier detection #120
Conversation
This can be used by biigle/maia, too.
Adding annotation feature vectors poses a problem with the current approach of a separate vector database (for MAIA). If the vector database gets an added So I'm now thinking about putting all the feature vectors back into the regular database. This way we could use joins and foreign key constraints to get label IDs without duplication and also automatically delete items when they are no longer needed. Annotation feature vectors can be added/modified/deleted with the same logic that handles the annotation thumbnails. The main reason to separate the vector database from the main database was that the backups could be separated, too. I don't want to frequently back up a 100 GB database every 10 minutes. So I'm now experimenting with |
Dumping with I'll now try a combination of |
Here is a possible strategy to migrate the existing setup:
|
Here is what I found:
|
The migration to create the vector extension had a later timestamp but it needs to be executed before.
If there were duplicates, the data would be returned as object instead of the expected array.
The observers would also fire when a new annotation was created. In this case the copy feature vector job should not be dispatched. Now with the event, the copy job is only dispatched when a label is attached to an existing annotation.
Keep SVG generation call in GenerateImage/VideoAnnotationPatch due to changes in #120.
Resolves #88
Notes:
Update the feature vector Python script to enable a "single-file" mode where it reads a single file and outputs the feature vector to stdout. This can be used in GenerateAnnotationPatch to avoid reading and writing additional files.Maybe leave the implementation with the CSV file exclusive to MAIA after all? Remove the Trait again.sim-sort-thumbs
branch. This method uses the approximated bounding box of the annotation instead of the whole thumbnail to generate FV. Maybe this can be used to generate FV for remote volumes whereas we generate from original files for locally stored data. It's hard to determine how well the sorting works with this as it looks ok but is not identical to the sorting based on original files.generate-missing
command that submits one job per file. The commant should be made more intelligent so it checks for missing data file by file and groups submitted jobs (with$only
annotations) by file.sim-sort-thumbs
seem to work quite well compared to the "real thing" (I was finally able to compare the sorting on real data). So we can use this to initialize all remote volumes.generate-missing
with the new "ProcessAnnotatedFile" jobs.sim-sort-thumbs
.Implement synchronization between the regular database and the vector database: If an annotation changes, update all feature vectors of the annotation, if a label changes, add/remove a feature vector (there can be several places where this happens).Enable index from the beginning so it doesn't take long to compute for LabelBOT.Unclear how the index should work (with partitioned tables etc.)Make call to CopyFeatureVector in Largo save controller more efficient (copy in batches with insert?)