Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The line:
dejavu/dejavu/fingerprint.py
Line 98 in d2b8761
The maximum filter loops over all positions, takes the maximum of all numbers in a diamond-like area centered at the position. The time complexity of the naive implementation is O(NMW^2), where N is number of windows, M is number of frequency ranges, and W is window size (PEAK_NEIGHBORHOOD_SIZE=20).
Using a data structure known as monotonic queue, it should be possible to reduce the time complexity down to linear time, which is O(N*M), dividing processing time by up to 400.
The algorithm has been implemented in scipy as
scipy.ndimage.filters.maximum_filter1d
. By applying such an algorithm twice, on the X axis and then on the Y axis, we can get the rectangle maximum. The drawback is that the filter is similar but different from the original one, and the impact on accuracy is unknown. It is technically possible to compute the original maximum filter in linear time but it's harder to implement.A simple benchmark shows that the time to scan all 5 files in the mp3 folder reduced from 2m55s to only 40s, a 4.5x improvement. Note that the number of fingerprints reduced from 494217 to 353834 due to a different maximum filter, and the impact is unknown. I expect some more improvement by removing the binary erosion, but on my machine it seems that the bottleneck has already moved from Python to MySQL. For indexing purposes it might be better to use an embedded database, such as RocksDB, to fully squeeze performance.