Quick nearest neighbor search in the 150-dimensional space

Posted on

Question :

I want to create a database using any of the possible RDBMS. It will have a table with approximately 150 columns. The objective is to perform nearest neighbor search of some other objects. So it’s a NNS in the 150-dimensional space.

I already tried to use some obvious methods like L1 or L2 distances but of course it takes a lot of time for tables with many rows. Also I tried to look at the KD-tree (note I did not test it) and PG-Strom but they are not a good solution for data with a many dimensions.

Can I somehow improve the speed of the described search using math methods (like KD-tree) or tech methods (like PG-Strom)?

I will try to use any RDBMS which allow to improve speed of the NNS. But MySQL and PostgreSQL are the most appropriate DBMS for me.

Answer :

PostgreSQL 9.6 using cube

First install the cube extension


Now we will create some n-dimensional space with 100,000 points in 50 dimensions. In addition we’ll add a GIST index.

  SELECT i, cube(array_agg(random()::float)) AS c
  FROM generate_series(1,1e5) AS i
  CROSS JOIN LATERAL generate_series(1,50)
    AS x

CREATE INDEX ON space_nd USING gist ( c );
ANALYZE space_nd;

Now we will generate a single point and use the <-> operater to find the nearest point using Eucledian distance.

WITH points AS (
  SELECT cube(array_agg(random()::float)) AS c
  FROM generate_series(1,50)
    AS x
  cube_distance(space_nd.c, points.c)
FROM space_nd
ORDER BY space_nd.c <-> points.c

PostgreSQL 9.6+ supports other distance operators over cube. All of which can use the GIST index we created. Namely,

a <-> b float8  Euclidean distance between a and b.
a <#> b float8  Taxicab (L-1 metric) distance between a and b.
a <=> b float8  Chebyshev (L-inf metric) distance between a and b.

That said there is one caveat,

To make it harder for people to break things, there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger.

You ask for 150 dimensions. That may present a minor complication.

Consider performing dimension reduction first (eg. Principle Component Analysis).

Then your are doing NN in small number of dimensions with higher performance.

You can use Pl/R to perform PCA inside postgres if needed.

Take a look at https://github.com/a-mma/AquilaDB it is a vector database to store Feature Vectors along with JSON Metadata. Keep it along with your RDBMS and do use metadata to maintain cross reference between data.

Have a look at FLANN and OpenCV.

Unfortunately I am not aware of an integration of that into a RDBMS system. But there is for example integration of chemical structure information with Posgres. So in principle this can be done.

Leave a Reply

Your email address will not be published. Required fields are marked *