Advertisement
Python253

annoy_the_vectors

May 3rd, 2024
851
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 6.75 KB | None | 0 0
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # Filename: annoy_the_vectors.py
  4. # Version: 1.0.0
  5. # Author: Jeoi Reqi
  6.  
  7. """
  8. This script demonstrates the usage of the Annoy library for vector search.
  9.  
  10. Vector search is a common task in machine learning and information retrieval, where the goal is to find similar vectors in a large dataset efficiently.
  11. Annoy is a library designed for approximate nearest neighbor search, which can be useful for tasks such as recommendation systems, clustering, and data exploration.
  12.  
  13. Functions:
  14.  
  15. 1. Generating Random Vectors:
  16.   - It generates a specified number of random vectors, each representing a point in a high-dimensional space.
  17.     These vectors serve as the dataset for the nearest neighbor search.
  18.  
  19. 2. Combining and Shuffling Vectors:
  20.   - It combines the generated vectors with additional "needle" vectors, which act as query points for the nearest neighbor search.
  21.   - After combining, it shuffles the vectors to ensure randomness and remove any ordering bias.
  22.  
  23. 3. Creating Annoy Index:
  24.   - It creates an Annoy index, a data structure optimized for efficient nearest neighbor search in high-dimensional spaces.
  25.   - Annoy indexes are built using a configurable number of trees, which partition the vector space to speed up the search process.
  26.  
  27. 4. Adding Vectors to the Index:
  28.   - It adds the generated vectors to the Annoy index, enabling fast lookup of nearest neighbors for any given query vector.
  29.  
  30. 5. Building the Index:
  31.   - It builds the Annoy index, which involves finalizing the data structure and making it ready for search operations.
  32.  
  33. 6. Executing Query and Retrieving Nearest Neighbors:
  34.   - It executes a query using a sample vector and retrieves the nearest neighbors from the Annoy index.
  35.   - The nearest neighbors are returned along with their indices and distances from the query vector.
  36.  
  37. Requirements:
  38. - Python 3.x: The script is written in Python 3 and requires a compatible interpreter.
  39. - NumPy: NumPy is used for generating random vectors and handling numerical operations efficiently.
  40. - Annoy: Annoy library is required for creating and using the Annoy index for nearest neighbor search.
  41.  
  42. Usage:
  43. 1. Ensure you have Python 3.x installed on your system.
  44.  
  45. 2. Install NumPy and Annoy libraries using pip:
  46.   pip install numpy annoy
  47.  
  48. 3. Run the script using Python 3.x interpreter.
  49.  
  50. Expected Output Example:
  51. - The script provides detailed output during the execution, including the number of vectors generated, the process of creating the Annoy index, and the results of the nearest neighbor search.
  52.  
  53.    Generating 3000000 vectors...
  54.    Combining and shuffling vectors...
  55.    Number of generated vectors: 3000000
  56.    Creating Annoy index...
  57.    Adding 3000000 vectors to the index...
  58.    Building the index...
  59.    Executing the query...
  60.    Query Vector: [0.5 0.5 0.5]
  61.    Nearest Neighbors Indices:
  62.        - [2744143, 2798537, 1748512, 2073859, 1004853, 1099524, 1132982, 1368920, 797050, 892618]
  63.    Nearest Neighbors Distances:
  64.        - [0.00034156814217567444, 0.0008897104999050498, 0.0009991193655878305, 0.0010703052394092083, 0.001080756657756865, 0.001241203281097114, 0.0013004967477172613, 0.001486903172917664, 0.0015034251846373081, 0.0015795603394508362]
  65.  
  66.    Nearest Neighbors:
  67.    - Neighbor: [0.4998232  0.49986833 0.49973908] Distance: 0.00034156814217567444
  68.    - Neighbor: [0.5008271  0.50022185 0.4997585 ] Distance: 0.0008897104999050498
  69.    - Neighbor: [0.5008262  0.4999582  0.49943972] Distance: 0.0009991193655878305
  70.    - Neighbor: [0.49939016 0.49921262 0.49960798] Distance: 0.0010703052394092083
  71.    - Neighbor: [0.5006552  0.49915966 0.5001806 ] Distance: 0.001080756657756865
  72.    - Neighbor: [0.50122076 0.4999057  0.50020355] Distance: 0.001241203281097114
  73.    - Neighbor: [0.4987392  0.49969807 0.4998973 ] Distance: 0.0013004967477172613
  74.    - Neighbor: [0.5011176  0.49953958 0.49913403] Distance: 0.001486903172917664
  75.    - Neighbor: [0.4997517  0.5014215  0.49957818] Distance: 0.0015034251846373081
  76.    - Neighbor: [0.5008414  0.50001013 0.49866322] Distance: 0.0015795603394508362
  77.  
  78. Note:
  79. - By running this script, users can gain insights into how Annoy can be utilized for efficient vector search tasks and understand its potential applications in real-world scenarios.
  80. """
  81.  
  82. import numpy as np
  83. from annoy import AnnoyIndex
  84.  
  85. # Step 1: Creating Vectors
  86. num_vectors_total = 3000000
  87. num_needles = 10000
  88.  
  89. num_vectors_per_set = (num_vectors_total - num_needles) // 3
  90.  
  91. print(f"Generating {num_vectors_total} vectors...")
  92. v1 = np.random.normal(loc=[1, 0, 0], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
  93. v2 = np.random.normal(loc=[0, 1, 0], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
  94. v3 = np.random.normal(loc=[0, 0, 1], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
  95. needles = np.random.normal(loc=[0.5, 0.5, 0.5], scale=[0.01, 0.01, 0.01], size=(num_needles, 3)).astype(np.float32)
  96.  
  97. # Check if the total number of vectors is correct
  98. while (len(v1) + len(v2) + len(v3) + len(needles)) != num_vectors_total:
  99.     diff = num_vectors_total - (len(v1) + len(v2) + len(v3) + len(needles))
  100.     if diff > 0:
  101.         needles = np.concatenate((needles, np.random.normal(loc=[0.5, 0.5, 0.5], scale=[0.01, 0.01, 0.01], size=(diff, 3)).astype(np.float32)))
  102.     else:
  103.         needles = needles[:diff]
  104.  
  105. # Step 2: Combining and Shuffling Vectors
  106. print("Combining and shuffling vectors...")
  107. data = np.concatenate((v1, v2, v3, needles))
  108. np.random.shuffle(data)
  109.  
  110. # Ensure the correct number of vectors is generated
  111. print("Number of generated vectors:", len(data))  # Debug print statement
  112. assert len(data) == num_vectors_total, f"Error: Number of vectors ({len(data)}) does not match expected ({num_vectors_total})"
  113.  
  114. # Step 3: Creating Annoy Index
  115. print("Creating Annoy index...")
  116. num_dimensions = 3
  117. num_trees = 100
  118. annoy_index = AnnoyIndex(num_dimensions, metric='euclidean')
  119.  
  120. # Adding vectors to the index
  121. print(f"Adding {len(data)} vectors to the index...")
  122. for i, vector in enumerate(data):
  123.     annoy_index.add_item(i, vector)
  124.  
  125. # Building the index
  126. print("Building the index...")
  127. annoy_index.build(num_trees)
  128.  
  129. # Step 4: Executing the Query
  130. print("Executing the query...")
  131. query = np.array([0.5, 0.5, 0.5]).astype(np.float32)
  132. num_neighbors = 10
  133. neighbors, distances = annoy_index.get_nns_by_vector(query, num_neighbors, include_distances=True)
  134.  
  135. # Debugging: Print intermediate values
  136. print("Query Vector:", query)
  137. print("Nearest Neighbors Indices:", neighbors)
  138. print("Nearest Neighbors Distances:", distances)
  139.  
  140. # Step 5: Printing the Result
  141. print("\nNearest Neighbors:")
  142. for neighbor, distance in zip(neighbors, distances):
  143.     print("Neighbor:", data[neighbor], "Distance:", distance)
  144.  
  145.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement