-
Notifications
You must be signed in to change notification settings - Fork 3
/
algorithm-for-hashing-high-dimensional-float-vectors.html
169 lines (94 loc) · 11.6 KB
/
algorithm-for-hashing-high-dimensional-float-vectors.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
<!-- Copyright by Vitali Fedulov ([email protected]) 2015-2021 -->
<!DOCTYPE html>
<html lang="en">
<head>
<title>Algorithm for hashing float vectors</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" type="image/png" href="images/favicon.png">
<link rel="stylesheet" type="text/css" href="css/index.css">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto">
</head>
<body>
<!-- Header -->
<div class="header">
<a href = "index.html" class="logo"><h1>Similar Pictures Search</h1></a>
<h4 class="logo-statement">Find near duplicates on your computer</h4>
</div>
<div class="nav">
<!-- Set manually class "active" for corresponding nav-tab. -->
<div class="nav-tab"><a href="index.html">Home</a></div>
<div class="nav-tab"><a href="duplicate-image-search-example.html">Demo</a></div>
<div class="nav-tab active"><a href="research.html">Research</a></div>
<div class="nav-tab"><a href="tools.html">Tools</a></div>
<div class="nav-tab"><a href="about.html">About</a></div>
<div class="line"></div> <!-- Gray line placeholder. Should stay here. -->
</div>
<!-- Blog organization:
1. Home page navigation links to the latest blog-[date].html file. The pages must never
change file name/address, so if linked by other sites they stay constant.
2. Each blog page will later have a list of all articles linked by html injection,
e.g. secondary navigation menu. Place in upper part of the blog as TOC.
-->
<!-- Contents -->
<p class="published">[email protected], <time datetime="2021-12-07">7 December 2021</time></p>
<h3 id="2021-12-07">Float N-dimensional vector hashing</h3> <!-- Id used for URL anchors.-->
<p>The scope of the article is fast approximate vector comparison in N-dimensional spaces. A method to hash float vectors is presented. The method has been developed in the context of large scale image comparison.</p>
<p>Hash tables allow fast search. But not every value is suitable for hashing. For example how could you hash a person's height? There would be a near infinite number of hashes to find equal ones for comparison. Hashing in higher dimensions is even more challenging.</p>
<p>Here I demonstrate a method to hash numerical Float values and other values of high density, such as Int, by using binary bucket trees. In the method each component of a vector is projected into 1 or 2 buckets per dimension. Bucket numbers form hashes by appending them sequentially dimension by dimension.</p>
<p>In general, in each dimension, buckets can have variable widths. For example, if values come from interval (-Infinity, +Infinity), we could split this infinite space into finite K buckets with some kind of distribution reflecting the data (bell-curve or other). It might be good to analyze each dimension of your vectors to get the distribution curve, and split each axis so that each bucket contains approximately the same number of values.</p>
<p>Value distribution is important, but in the method below I assume that buckets have equal widths and that we consider a fixed interval of float values [min, max] per dimension. For example, (0.0, 300.0) cm is a safe interval to measure humans, and 10 cm could be the bucket width.</p>
<p>To be able to hash/compare a float value, we need its bucket number. But it is not sufficient to check that a float value falls into a specific bucket. What if it falls exactly at a bucket split point? Or one value falls near the split on one side, and another similar value falls on the other side? Besides, those values typically contain measurement errors. Therefore we will introduce an uncertainty neighbourhood of a value: (x-ɛ, x+ɛ).</p>
<p>Let's start with 1-dimensional vector (figure below). The bucket number for the left point is uncertain:</p>
<figure class="blog-img-1">
<img src="images/21.12.07-buckets-01.gif" alt="1-dimensional buckets">
</figure>
<p>If a float value falls near the center of a bucket, we would feel safe to allocate a single bucket for this value. But if the value finds itself near the bucket border, as in the picture above, the nearby bucket should also be included. By introducing the uncertainty interval of different sizes around the float value, we will get more than one bucket corresponding to such value:</p>
<figure class="blog-img-1">
<img src="images/21.12.07-buckets-02.gif" alt="1-dimensional buckets with uncertainty interval">
</figure>
<p>In the 1st case of the figure above 2 buckets are "activated" by the uncertainty interval. Therefore 2 bucket numbers will be necessary to represent one value.</p>
<p>In the 2nd case, the epsilon interval covers many buckets to represent one float value. This is impractical. An ideal bucket width must be larger than 2ɛ. Then a float value will fall into maximum 2 buckets. There is no "free lunch" to have all values in 1 bucket only - some will luckily do, but some will require 2 buckets. This is the price to be able to compare values, as explained above.</p>
<p>From now on we assume that bucket width is larger than 2ɛ.</p>
<p>As showed above, the uncertainty requires 2 buckets "reserved" in 1-dimensional space. If bucket width is much larger than ɛ, there will be few values with their uncertainty intervals overlaying 2 buckets, because probability of a value to be near a bucket border will be smaller. In such configuration most values will require only 1 bucket number to be correctly represented.</p>
<p>Now, let's consider higher dimensional space:</p>
<figure class="blog-img-1">
<img src="images/21.12.07-buckets-03.gif" alt="2-dimensional hypercubes">
<figcaption>Buckets for 2 dimensions, value uncertainty intervals,<br>and corresponding hypercubes</figcaption>
</figure>
<p>In N-space buckets are defined the same way as for one dimension, that is in relation to bucket borders. In 2 dimensions (figure above) buckets are vertical and horizontal stripes between black lines. The uncertainty regions are colored pink. In 3D buckets become infinite slices, and so on for higher dimensions.</p>
<p><b>Method essence:</b> Bucket intersections form hypercubes, which correspond to the final hashes to be ultimately used to compare float vector values. The hashes are de-facto hypercube numbers.</p>
<p>Assuming that bucket width is larger than 2ɛ, for one dimension we will get 1-2 final hashes. For 2D we will get 1-4 hashes. For 3D we will get 1-8 hashes. And so on.</p>
<p>Below is shown the smallest number of buckets necessary in 2D case (one for each dimension), because each uncertainty interval falls inside one bucket.</p>
<figure class="blog-img-1">
<img src="images/21.12.07-buckets-04.gif" alt="2-dimensional hypercubes">
<figcaption>The case of minimal number of buckets necessary in 2D</figcaption>
</figure>
<p>With each additional dimension the number of possible hypercubes will double. Therefore one N-dimensional vector will be represented by 1 to 2<sup>N</sup> final hashes.</p>
<h4>How to construct hashes</h4>
<p>Let's assume that one character of a hash is a bucket number, and that we have 10 buckets per dimension numbered from 0 to 9.</p>
<p>In each dimension we always have 1 or 2 bucket numbers.</p>
<p>Let's consider a 2D case with some random bucket numbers. In the 1st dimension we have bucket number [2] or numbers [2, 3]. In the 2nd dimension [7] or [7, 8]. Now, to create full hashes, we append numbers of the 1st dimension to the numbers of the 2nd dimension, thus getting at minimum 1 hash: "27", and at maximum 4 hashes: "27", "28", "37", "38" (each to each number of the dimensions).</p>
<p>Such binary tree approach allows to continue with additional dimensions to construct larger hashes for N-dimensional float vectors. The final hash length is always N. For every additional dimension we combine hashes from previous dimensions with the current one. To continue from the previous example, if in the 3rd dimension there is only one bucket 5, the final hashes at maximum will be "275", "285", "375", "385".</p>
<p>Just to repeat: in the "most economic" case there will be only 1 hash in the final set, and in the worse case there will be 2<sup>N</sup> hashes. Such hash set represents one N-vector for search and comparison.</p>
<h4>How to search and compare</h4>
<p>Every hash is de-facto a hypercube for nearest-neighbour search or comparison. Two vectors with same hash(es) are equal or near equal.</p>
<p>For hash lookup database, we do not need to store the whole set of hashes per N-vector. The only hash we need to store is the one corresponding to the central hypercube, which contains the value point.</p>
<p>But for a query we need all hashes from a set. Thus the query hash (set) is "fuzzy" because of uncertainty, but a recorded value hash is not.</p>
<p>It should be possible to inverse the approach and store the whole hash set of the vector, while only using the central hash for a lookup query. Anyway, no "free lunch" here: one-to-many or many-to-one principle must hold on search.</p>
<h4>How the curse becomes a blessing</h4>
<p>Since the hash set size may double with each additional dimension, it might seem impractical to search/compare for large N. Already for 10 dimensions we may get up to 2<sup>10</sup> = 1024 hashes in the set! The curse of dimensionality?</p>
<p>To address the problem, as already mentioned for 1D case, we can choose larger bucket widths. Assuming very large bucket width and reasonably small ɛ, there will be only few cases of our value being near bucket borders, therefore only a few branches of the binary tree will be necessary. As a result we will typically get small hash sets with a few rare exceptions.</p>
<p>If bucket width is large, the similarity approximation is rough, but could be sufficient for initial grouping of similar vectors. After that Euclidean distance can be used to check for higher precision within the group. The Euclidean distance threshold can be set to ɛ.</p>
<p>To increase precision we could also add additional dimensions N+1, N+2 etc. As a result even the most basic dimension split into only 2 buckets may be enough. Then the curse of dimensionality becomes a blessing: number of hypercubes will grow fast and will discretize N-space quite granularly. For example having only 4 buckets per dimension in 10-dimensional space will give us 4<sup>10</sup>> 1 million hashes. And for 8 buckets over 1 billion hashes.</p>
<p>Another way to fight the curse of dimensionality is to split N-dimensional vector to smaller vectors, and generate hash sets for each sub-vector. Then later intersect lookup results for those sub-sets.</p>
<p>The demonstrated bucket approach will work best when vector components are independent variables.</p>
<h4>Algorithm code</h4>
<p>Written in Golang <a href="https://github.com/vitali-fedulov/hyper">float vector hash (Github)</a>.</p>
<h4>Possible optimization</h4>
<p>The demonstrated method uses uncertainty planes to identify additional hypercubes. A better method could be using an epsilon sphere to select the hypercubes. This should reduce the number of hashes in the query set.</p>
<br>
<p class="published">This is my original research. Please treat it as such for references, provided you do not find an earlier similar research by others, which is possible. Sometimes solutions have to be rediscovered, because either papers are hard to find, or they are difficult to understand.</p>
<div class = "footer">Copyright © 2021 Vitali Fedulov. All rights reserved.</div>
</body>
</html>