Skip to content

Short-circuit HNSW search for similarity-based vector queries? #15869

@kaivalnp

Description

@kaivalnp

Description

Spinoff from #15836

A KNN query short-circuits the HNSW search if the "expected" number of nodes visited is >= number of filtered nodes.

A similarity-based vector query (i.e. [Byte|Float]VectorSimilarityQuery) attempts to find all vectors with a score above a threshold (for Euclidean similarity, this can be imagined as all vectors within a radius of the query vector).

Assuming document vectors are evenly spread out across the n-dimensional space, should vector similarity scores form a normal distribution?

If so, can we estimate the proportion of nodes visited using area under the curve (from resultSimilarity -> ) of a normal distribution? (and apply the same short circuit logic)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions