K-Approximate-Nearest-Neighbor Search with LSH
An implementation of approximate k-nearest-neighbor search with locality-sensitive hashing (LSH). Given a set of reference points and a set of query points, this will compute the k approximate nearest neighbors of each query point in the reference set.
:- use_module('path/to/.../src/methods/adaboost/adaboost.pl').
%% usage example
TrainData = [5.1,3.5,1.4, 4.9,3.0,1.4, 4.7,3.2,1.3, 4.6,3.1,1.5],
TestData = [3,2,0, 5,1,4, 0,0,4, 3,3,5, 0,5,5, 2,5,5],
lsh_initAndTrainModel(TrainData, 3, 20, 10, 0.0, 99901, 500),
lsh_searchWithQuery(TestData, 3, 3, NeighborResultsList, _, DistancesList, _, 0, 0).
Available Predicates
lsh_initAndTrainModel/7
Initiatzes the model and trains it.
%% predicate definition
lsh_initAndTrainModel(ReferenceList, ReferenceRows, NumProj, NumTables, HashWidth, SecondHashSize, BucketSize) :-
NumProj >= 0,
NumTables >= 0,
HashWidth >= 0.0,
SecondHashSize > 0,
BucketSize > 0,
convert_list_to_float_array(ReferenceList, ReferenceRows, array(Xsize, Xrownum, X)),
initAndTrainModelI(X, Xsize, Xrownum, NumProj, NumTables, HashWidth, SecondHashSize, BucketSize).
%% foreign c++ predicate definition
foreign(initAndTrainModel, c, initAndTrainModelI( +pointer(float_array), +integer, +integer,
+integer, +integer, +float32, +integer, +integer)).
Parameters
Name | Type | Description | Default |
---|---|---|---|
referenceSet | +matrix | Matrix containing the reference dataset. | - |
numProj | +integer | Number of projections in each hash table (anything between 10-50 might be a decent choice). | 10-50 |
numTables | +integer | Total number of hash tables (anything between 10-20 should suffice). | 10-20 |
hashWidth | +float | The width of hash for every table. If 0 (the default) is provided, then the hash width is automatically obtained by computing the average pairwise distance of 25 pairs. This should be a reasonable upper bound on the nearest-neighbor distance in general. | 0.0 |
secondHashSize | +integer | The size of the second hash table. This should be a large prime number. | 99901 |
bucketSize | +integer | The size of the bucket in the second hash table. This is the maximum number of points that can be hashed into single bucket. A value of 0 indicates that there is no limit (so the second hash table can be arbitrarily large—be careful!). | 500 |
lsh_computeRecall/5
Compute the recall (% of neighbors found) given the neighbors returned by searchWithQuery/12 or searchNoQuery/9 and a "ground truth" set of neighbors.
%% predicate definition
lsh_computeRecall(FoundNeighborsList, FoundNeighborsRows, RealNeighborsList, RealNeighborsRows, Percentage) :-
convert_list_to_float_array(FoundNeighborsList, FoundNeighborsRows, array(Xsize, Xrownum, X)),
convert_list_to_float_array(RealNeighborsList, RealNeighborsRows, array(Ysize, Yrownum, Y)),
computeRecallI(X, Xsize, Xrownum, Y, Ysize, Yrownum, Percentage).
%% foreign c++ predicate definition
foreign(computeRecall, c, computeRecallI( +pointer(float_array), +integer, +integer,
+pointer(float_array), +integer, +integer,
[-float32])).
Parameters
Name | Type | Description | Default |
---|---|---|---|
foundNeighbors | +matrix | Set of neighbors to compute recall of. | - |
realneighbors | +matrix | Set of "ground truth" neighbors to compute recall against. | - |
recall | -float | The recall percentage retured as a value between [0,1]. | - |
lsh_searchWithQuery/9
Compute the nearest neighbors of the points in the given query set and store the output in the given matrices.
The matrices will be set to the size of n columns by k rows, where n is the number of points in the query dataset and k is the number of neighbors being searched for.
%% predicate definition
lsh_searchWithQuery(QueryList, QueryRows, K, ResultingNeighborsList, YCols, DistancesList, ZCols, NumTablesToSearch, T) :-
K > 0,
NumTablesToSearch >= 0,
T >= 0,
convert_list_to_float_array(QueryList, QueryRows, array(Xsize, Xrows, X)),
searchWithQueryI(X, Xsize, Xrows, K, Y, YCols, YRows, Z, ZCols, ZRows, NumTablesToSearch, T),
convert_float_array_to_2d_list(Y, YCols, YRows, ResultingNeighborsList),
convert_float_array_to_2d_list(Z, ZCols, ZRows, DistancesList).
%% foreign c++ predicate definition
foreign(searchWithQuery, c, searchWithQueryI( +pointer(float_array), +integer, +integer,
+integer,
-pointer(float_array), -integer, -integer,
-pointer(float_array), -integer, -integer,
+integer, +integer)).
Parameters
Name | Type | Description | Default |
---|---|---|---|
querySet | +matrix | Set of query points. | - |
k | +integer | Number of neighbors to search for. | 0 |
resultingNeighbors | -matrix | Matrix storing lists of neighbors for each query point. | - |
distances | -matrix | Matrix storing distances of neighbors for each query point. | - |
numTablesToSearch | +integer | This parameter allows the user to have control over the number of hash tables to be searched. This allows the user to pick the number of tables it can afford for the time available without having to build hashing for every table size. By default, this is set to zero in which case all tables are considered. | 0 |
T | +integer | The number of additional probing bins to examine with multiprobe LSH. If T = 0, classic single-probe LSH is run (default). | 0 |
lsh_searchNoQuery/7
Compute the nearest neighbors and store the output in the given matrices.
The matrices will be set to the size of n columns by k rows, where n is the number of points in the query dataset and k is the number of neighbors being searched for.
%% predicate definition
lsh_searchNoQuery(K, ResultingNeighborsList, YCols, DistancesList, ZCols, NumTablesToSearch, T) :-
K > 0,
NumTablesToSearch >= 0,
T >= 0,
searchNoQueryI(K, Y, YCols, YRows, Z, ZCols, ZRows, NumTablesToSearch, T),
convert_float_array_to_2d_list(Y, YCols, YRows, ResultingNeighborsList),
convert_float_array_to_2d_list(Z, ZCols, ZRows, DistancesList).
%% foreign c++ predicate definition
foreign(searchNoQuery, c, searchNoQueryI( +integer,
-pointer(float_array), -integer, -integer,
-pointer(float_array), -integer, -integer,
+integer, +integer)).
Parameters
Name | Type | Description | Default |
---|---|---|---|
k | +integer | Number of neighbors to search for. | 0 |
resultingNeighbors | -matrix | Matrix storing lists of neighbors for each query point. | - |
distances | -matrix | Matrix storing distances of neighbors for each query point. | - |
numTablesToSearch | +integer | This parameter allows the user to have control over the number of hash tables to be searched. This allows the user to pick the number of tables it can afford for the time available without having to build hashing for every table size. By default, this is set to zero in which case all tables are considered. | 0 |
T | +integer | The number of additional probing bins to examine with multiprobe LSH. If T = 0, classic single-probe LSH is run (default). | 0 |
Connected Links/Resources
If you want a more detailed explanation, then go to the python documentation. There is most of the time a good explanation on how the methods work and what the parameters do.
added some of the links from the python documentation