annoy_the_vectors

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Filename: annoy_the_vectors.py
# Version: 1.0.0
# Author: Jeoi Reqi

"""
This script demonstrates the usage of the Annoy library for vector search.

Vector search is a common task in machine learning and information retrieval, where the goal is to find similar vectors in a large dataset efficiently.
Annoy is a library designed for approximate nearest neighbor search, which can be useful for tasks such as recommendation systems, clustering, and data exploration.

Functions:

1. Generating Random Vectors:
   - It generates a specified number of random vectors, each representing a point in a high-dimensional space.
     These vectors serve as the dataset for the nearest neighbor search.

2. Combining and Shuffling Vectors:
   - It combines the generated vectors with additional "needle" vectors, which act as query points for the nearest neighbor search.
   - After combining, it shuffles the vectors to ensure randomness and remove any ordering bias.

3. Creating Annoy Index:
   - It creates an Annoy index, a data structure optimized for efficient nearest neighbor search in high-dimensional spaces.
   - Annoy indexes are built using a configurable number of trees, which partition the vector space to speed up the search process.

4. Adding Vectors to the Index:
   - It adds the generated vectors to the Annoy index, enabling fast lookup of nearest neighbors for any given query vector.

5. Building the Index:
   - It builds the Annoy index, which involves finalizing the data structure and making it ready for search operations.

6. Executing Query and Retrieving Nearest Neighbors:
   - It executes a query using a sample vector and retrieves the nearest neighbors from the Annoy index.
   - The nearest neighbors are returned along with their indices and distances from the query vector.

Requirements:
- Python 3.x: The script is written in Python 3 and requires a compatible interpreter.
- NumPy: NumPy is used for generating random vectors and handling numerical operations efficiently.
- Annoy: Annoy library is required for creating and using the Annoy index for nearest neighbor search.

Usage:
1. Ensure you have Python 3.x installed on your system.

2. Install NumPy and Annoy libraries using pip:
   pip install numpy annoy

3. Run the script using Python 3.x interpreter.

Expected Output Example:
- The script provides detailed output during the execution, including the number of vectors generated, the process of creating the Annoy index, and the results of the nearest neighbor search.

    Generating 3000000 vectors...
    Combining and shuffling vectors...
    Number of generated vectors: 3000000
    Creating Annoy index...
    Adding 3000000 vectors to the index...
    Building the index...
    Executing the query...
    Query Vector: [0.5 0.5 0.5]
    Nearest Neighbors Indices:
        - [2744143, 2798537, 1748512, 2073859, 1004853, 1099524, 1132982, 1368920, 797050, 892618]
    Nearest Neighbors Distances:
        - [0.00034156814217567444, 0.0008897104999050498, 0.0009991193655878305, 0.0010703052394092083, 0.001080756657756865, 0.001241203281097114, 0.0013004967477172613, 0.001486903172917664, 0.0015034251846373081, 0.0015795603394508362]

    Nearest Neighbors:
    - Neighbor: [0.4998232  0.49986833 0.49973908] Distance: 0.00034156814217567444
    - Neighbor: [0.5008271  0.50022185 0.4997585 ] Distance: 0.0008897104999050498
    - Neighbor: [0.5008262  0.4999582  0.49943972] Distance: 0.0009991193655878305
    - Neighbor: [0.49939016 0.49921262 0.49960798] Distance: 0.0010703052394092083
    - Neighbor: [0.5006552  0.49915966 0.5001806 ] Distance: 0.001080756657756865
    - Neighbor: [0.50122076 0.4999057  0.50020355] Distance: 0.001241203281097114
    - Neighbor: [0.4987392  0.49969807 0.4998973 ] Distance: 0.0013004967477172613
    - Neighbor: [0.5011176  0.49953958 0.49913403] Distance: 0.001486903172917664
    - Neighbor: [0.4997517  0.5014215  0.49957818] Distance: 0.0015034251846373081
    - Neighbor: [0.5008414  0.50001013 0.49866322] Distance: 0.0015795603394508362

Note:
- By running this script, users can gain insights into how Annoy can be utilized for efficient vector search tasks and understand its potential applications in real-world scenarios.
"""

import numpy as np
from annoy import AnnoyIndex

# Step 1: Creating Vectors
num_vectors_total = 3000000
num_needles = 10000

num_vectors_per_set = (num_vectors_total - num_needles) // 3

print(f"Generating {num_vectors_total} vectors...")
v1 = np.random.normal(loc=[1, 0, 0], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
v2 = np.random.normal(loc=[0, 1, 0], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
v3 = np.random.normal(loc=[0, 0, 1], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
needles = np.random.normal(loc=[0.5, 0.5, 0.5], scale=[0.01, 0.01, 0.01], size=(num_needles, 3)).astype(np.float32)

# Check if the total number of vectors is correct
while (len(v1) + len(v2) + len(v3) + len(needles)) != num_vectors_total:
    diff = num_vectors_total - (len(v1) + len(v2) + len(v3) + len(needles))
    if diff > 0:
        needles = np.concatenate((needles, np.random.normal(loc=[0.5, 0.5, 0.5], scale=[0.01, 0.01, 0.01], size=(diff, 3)).astype(np.float32)))
    else:
        needles = needles[:diff]

# Step 2: Combining and Shuffling Vectors
print("Combining and shuffling vectors...")
data = np.concatenate((v1, v2, v3, needles))
np.random.shuffle(data)

# Ensure the correct number of vectors is generated
print("Number of generated vectors:", len(data))  # Debug print statement
assert len(data) == num_vectors_total, f"Error: Number of vectors ({len(data)}) does not match expected ({num_vectors_total})"

# Step 3: Creating Annoy Index
print("Creating Annoy index...")
num_dimensions = 3
num_trees = 100
annoy_index = AnnoyIndex(num_dimensions, metric='euclidean')

# Adding vectors to the index
print(f"Adding {len(data)} vectors to the index...")
for i, vector in enumerate(data):
    annoy_index.add_item(i, vector)

# Building the index
print("Building the index...")
annoy_index.build(num_trees)

# Step 4: Executing the Query
print("Executing the query...")
query = np.array([0.5, 0.5, 0.5]).astype(np.float32)
num_neighbors = 10
neighbors, distances = annoy_index.get_nns_by_vector(query, num_neighbors, include_distances=True)

# Debugging: Print intermediate values
print("Query Vector:", query)
print("Nearest Neighbors Indices:", neighbors)
print("Nearest Neighbors Distances:", distances)

# Step 5: Printing the Result
print("\nNearest Neighbors:")
for neighbor, distance in zip(neighbors, distances):
    print("Neighbor:", data[neighbor], "Distance:", distance)