|
|
# K-Approximate-Nearest-Neighbor Search with LSH
|
|
|
|
|
|
An implementation of approximate k-nearest-neighbor search with locality-sensitive hashing (LSH). Given a set of reference points and a set of query points, this will compute the k approximate nearest neighbors of each query point in the reference set.
|
|
|
|
|
|
# Available Predicates
|
|
|
|
|
|
* [initModel/8](/PrologMethods/Geometry/lsh#initmodel8)
|
|
|
* [computeRecall/7](/PrologMethods/Geometry/lsh#computerecall7)
|
|
|
* [searchWithQuery/12](/PrologMethods/Geometry/lsh#searchwithquery12)
|
|
|
* [searchNoQuery/9](/PrologMethods/Geometry/lsh#searchnoquery9)
|
|
|
* [train/8](/PrologMethods/Geometry/lsh#train8)
|
|
|
|
|
|
---
|
|
|
|
|
|
[links/resources](/PrologMethods/Geometry/lsh#connected-linksresources)
|
|
|
|
|
|
## **_initModel/8_**
|
|
|
|
|
|
Initiatzes the model and trains it.
|
|
|
|
|
|
```prolog
|
|
|
%% part of the predicate definition
|
|
|
initModel( +pointer(float_array), +integer, +integer,
|
|
|
+integer, +integer, +float32, +integer, +integer).
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
| Name | Type | Description | Default |
|
|
|
|------|------|-------------|---------|
|
|
|
| referenceSet | +matrix | Matrix containing the reference dataset. | - |
|
|
|
| numProj | +integer | Number of projections in each hash table (anything between 10-50 might be a decent choice). | 10-50 |
|
|
|
| numTables | +integer | Total number of hash tables (anything between 10-20 should suffice). | 10-20 |
|
|
|
| hashWidth | +float | The width of hash for every table. If 0 (the default) is provided, then the hash width is automatically obtained by computing the average pairwise distance of 25 pairs. This should be a reasonable upper bound on the nearest-neighbor distance in general. | 0.0 |
|
|
|
| secondHashSize | +integer | The size of the second hash table. This should be a large prime number. | 99901 |
|
|
|
| bucketSize | +integer | The size of the bucket in the second hash table. This is the maximum number of points that can be hashed into single bucket. A value of 0 indicates that there is no limit (so the second hash table can be arbitrarily large—be careful!). | 500 |
|
|
|
|
|
|
---
|
|
|
|
|
|
## **_computeRecall/7_**
|
|
|
|
|
|
Compute the recall (% of neighbors found) given the neighbors returned by searchWithQuery/12 or searchNoQuery/9 and a "ground truth" set of neighbors.
|
|
|
|
|
|
```prolog
|
|
|
%% part of the predicate definition
|
|
|
computeRecall( +pointer(float_array), +integer, +integer,
|
|
|
+pointer(float_array), +integer, +integer,
|
|
|
[-float32]).
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
| Name | Type | Description | Default |
|
|
|
|------|------|-------------|---------|
|
|
|
| foundNeighbors | +matrix | Set of neighbors to compute recall of. | - |
|
|
|
| realneighbors | +matrix | Set of "ground truth" neighbors to compute recall against. | - |
|
|
|
| recall | -float | The recall percentage retured as a value between \[0,1\]. | - |
|
|
|
|
|
|
---
|
|
|
|
|
|
## **_searchWithQuery/12_**
|
|
|
|
|
|
Compute the nearest neighbors of the points in the given query set and store the output in the given matrices.
|
|
|
|
|
|
The matrices will be set to the size of n columns by k rows, where n is the number of points in the query dataset and k is the number of neighbors being searched for.
|
|
|
|
|
|
```prolog
|
|
|
%% part of the predicate definition
|
|
|
searchWithQuery( +pointer(float_array), +integer, +integer,
|
|
|
+integer,
|
|
|
-pointer(float_array), -integer, -integer,
|
|
|
-pointer(float_array), -integer, -integer,
|
|
|
+integer, +integer).
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
| Name | Type | Description | Default |
|
|
|
|------|------|-------------|---------|
|
|
|
| querySet | +matrix | Set of query points. | - |
|
|
|
| k | +integer | Number of neighbors to search for. | 0 |
|
|
|
| resultingNeighbors | -matrix | Matrix storing lists of neighbors for each query point. | - |
|
|
|
| distances | -matrix | Matrix storing distances of neighbors for each query point. | - |
|
|
|
| numTablesToSearch | +integer | This parameter allows the user to have control over the number of hash tables to be searched. This allows the user to pick the number of tables it can afford for the time available without having to build hashing for every table size. By default, this is set to zero in which case all tables are considered. | 0 |
|
|
|
| T | +integer | The number of additional probing bins to examine with multiprobe LSH. If T = 0, classic single-probe LSH is run (default). | 0 |
|
|
|
|
|
|
---
|
|
|
|
|
|
## **_searchNoQuery/9_**
|
|
|
|
|
|
Compute the nearest neighbors and store the output in the given matrices.
|
|
|
|
|
|
The matrices will be set to the size of n columns by k rows, where n is the number of points in the query dataset and k is the number of neighbors being searched for.
|
|
|
|
|
|
```prolog
|
|
|
%% part of the predicate definition
|
|
|
searchNoQuery( +integer,
|
|
|
-pointer(float_array), -integer, -integer,
|
|
|
-pointer(float_array), -integer, -integer,
|
|
|
+integer, +integer).
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
| Name | Type | Description | Default |
|
|
|
|------|------|-------------|---------|
|
|
|
| k | +integer | Number of neighbors to search for. | 0 |
|
|
|
| resultingNeighbors | -matrix | Matrix storing lists of neighbors for each query point. | - |
|
|
|
| distances | -matrix | Matrix storing distances of neighbors for each query point. | - |
|
|
|
| numTablesToSearch | +integer | This parameter allows the user to have control over the number of hash tables to be searched. This allows the user to pick the number of tables it can afford for the time available without having to build hashing for every table size. By default, this is set to zero in which case all tables are considered. | 0 |
|
|
|
| T | +integer | The number of additional probing bins to examine with multiprobe LSH. If T = 0, classic single-probe LSH is run (default). | 0 |
|
|
|
|
|
|
---
|
|
|
|
|
|
---
|
|
|
|
|
|
## **_train/8_**
|
|
|
|
|
|
Train the LSH model on the given dataset.
|
|
|
|
|
|
```prolog
|
|
|
%% part of the predicate definition
|
|
|
train( +pointer(float_array), +integer, +integer,
|
|
|
+integer, +integer, +float32, +integer, +integer).
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
| Name | Type | Description | Default |
|
|
|
|------|------|-------------|---------|
|
|
|
| referenceSet | +matrix | Matrix containing the reference dataset. | - |
|
|
|
| numProj | +integer | Number of projections in each hash table (anything between 10-50 might be a decent choice). | 10-50 |
|
|
|
| numTables | +integer | Total number of hash tables (anything between 10-20 should suffice). | 10-20 |
|
|
|
| hashWidth | +float | The width of hash for every table. If 0 (the default) is provided, then the hash width is automatically obtained by computing the average pairwise distance of 25 pairs. This should be a reasonable upper bound on the nearest-neighbor distance in general. | 0.0 |
|
|
|
| secondHashSize | +integer | The size of the second hash table. This should be a large prime number. | 99901 |
|
|
|
| bucketSize | +integer | The size of the bucket in the second hash table. This is the maximum number of points that can be hashed into single bucket. A value of 0 indicates that there is no limit (so the second hash table can be arbitrarily large—be careful!). | 500 |
|
|
|
|
|
|
---
|
|
|
|
|
|
# Connected Links/Resources
|
|
|
|
|
|
If you want a more detailed explanation, then go to the python documentation. There is most of the time a good explanation on how the methods work and what the parameters do.
|
|
|
|
|
|
* [MLpack::lsh_C++\_documentation](https://www.mlpack.org/doc/stable/doxygen/classmlpack_1_1neighbor_1_1LSHSearch.html)
|
|
|
* [MLpack::lsh_Python_documentation](https://www.mlpack.org/doc/stable/python_documentation.html#local_coordinate_coding)
|
|
|
|
|
|
added some of the links from the python documentation
|
|
|
|
|
|
* knn
|
|
|
* [Locality-sensitive hashing on Wikipedia](https://en.wikipedia.org/wiki/Locality-sensitive_hashing)
|
|
|
* [Locality-sensitive hashing scheme based on p-stable distributions(pdf)](http://mlpack.org/papers/lsh.pdf) |
|
|
\ No newline at end of file |