Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- #!/usr/bin/env python
- # -*- coding: utf-8 -*-
- # Filename: annoy_the_vectors.py
- # Version: 1.0.0
- # Author: Jeoi Reqi
- """
- This script demonstrates the usage of the Annoy library for vector search.
- Vector search is a common task in machine learning and information retrieval, where the goal is to find similar vectors in a large dataset efficiently.
- Annoy is a library designed for approximate nearest neighbor search, which can be useful for tasks such as recommendation systems, clustering, and data exploration.
- Functions:
- 1. Generating Random Vectors:
- - It generates a specified number of random vectors, each representing a point in a high-dimensional space.
- These vectors serve as the dataset for the nearest neighbor search.
- 2. Combining and Shuffling Vectors:
- - It combines the generated vectors with additional "needle" vectors, which act as query points for the nearest neighbor search.
- - After combining, it shuffles the vectors to ensure randomness and remove any ordering bias.
- 3. Creating Annoy Index:
- - It creates an Annoy index, a data structure optimized for efficient nearest neighbor search in high-dimensional spaces.
- - Annoy indexes are built using a configurable number of trees, which partition the vector space to speed up the search process.
- 4. Adding Vectors to the Index:
- - It adds the generated vectors to the Annoy index, enabling fast lookup of nearest neighbors for any given query vector.
- 5. Building the Index:
- - It builds the Annoy index, which involves finalizing the data structure and making it ready for search operations.
- 6. Executing Query and Retrieving Nearest Neighbors:
- - It executes a query using a sample vector and retrieves the nearest neighbors from the Annoy index.
- - The nearest neighbors are returned along with their indices and distances from the query vector.
- Requirements:
- - Python 3.x: The script is written in Python 3 and requires a compatible interpreter.
- - NumPy: NumPy is used for generating random vectors and handling numerical operations efficiently.
- - Annoy: Annoy library is required for creating and using the Annoy index for nearest neighbor search.
- Usage:
- 1. Ensure you have Python 3.x installed on your system.
- 2. Install NumPy and Annoy libraries using pip:
- pip install numpy annoy
- 3. Run the script using Python 3.x interpreter.
- Expected Output Example:
- - The script provides detailed output during the execution, including the number of vectors generated, the process of creating the Annoy index, and the results of the nearest neighbor search.
- Generating 3000000 vectors...
- Combining and shuffling vectors...
- Number of generated vectors: 3000000
- Creating Annoy index...
- Adding 3000000 vectors to the index...
- Building the index...
- Executing the query...
- Query Vector: [0.5 0.5 0.5]
- Nearest Neighbors Indices:
- - [2744143, 2798537, 1748512, 2073859, 1004853, 1099524, 1132982, 1368920, 797050, 892618]
- Nearest Neighbors Distances:
- - [0.00034156814217567444, 0.0008897104999050498, 0.0009991193655878305, 0.0010703052394092083, 0.001080756657756865, 0.001241203281097114, 0.0013004967477172613, 0.001486903172917664, 0.0015034251846373081, 0.0015795603394508362]
- Nearest Neighbors:
- - Neighbor: [0.4998232 0.49986833 0.49973908] Distance: 0.00034156814217567444
- - Neighbor: [0.5008271 0.50022185 0.4997585 ] Distance: 0.0008897104999050498
- - Neighbor: [0.5008262 0.4999582 0.49943972] Distance: 0.0009991193655878305
- - Neighbor: [0.49939016 0.49921262 0.49960798] Distance: 0.0010703052394092083
- - Neighbor: [0.5006552 0.49915966 0.5001806 ] Distance: 0.001080756657756865
- - Neighbor: [0.50122076 0.4999057 0.50020355] Distance: 0.001241203281097114
- - Neighbor: [0.4987392 0.49969807 0.4998973 ] Distance: 0.0013004967477172613
- - Neighbor: [0.5011176 0.49953958 0.49913403] Distance: 0.001486903172917664
- - Neighbor: [0.4997517 0.5014215 0.49957818] Distance: 0.0015034251846373081
- - Neighbor: [0.5008414 0.50001013 0.49866322] Distance: 0.0015795603394508362
- Note:
- - By running this script, users can gain insights into how Annoy can be utilized for efficient vector search tasks and understand its potential applications in real-world scenarios.
- """
- import numpy as np
- from annoy import AnnoyIndex
- # Step 1: Creating Vectors
- num_vectors_total = 3000000
- num_needles = 10000
- num_vectors_per_set = (num_vectors_total - num_needles) // 3
- print(f"Generating {num_vectors_total} vectors...")
- v1 = np.random.normal(loc=[1, 0, 0], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
- v2 = np.random.normal(loc=[0, 1, 0], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
- v3 = np.random.normal(loc=[0, 0, 1], scale=[0.01, 0.01, 0.01], size=(num_vectors_per_set, 3)).astype(np.float32)
- needles = np.random.normal(loc=[0.5, 0.5, 0.5], scale=[0.01, 0.01, 0.01], size=(num_needles, 3)).astype(np.float32)
- # Check if the total number of vectors is correct
- while (len(v1) + len(v2) + len(v3) + len(needles)) != num_vectors_total:
- diff = num_vectors_total - (len(v1) + len(v2) + len(v3) + len(needles))
- if diff > 0:
- needles = np.concatenate((needles, np.random.normal(loc=[0.5, 0.5, 0.5], scale=[0.01, 0.01, 0.01], size=(diff, 3)).astype(np.float32)))
- else:
- needles = needles[:diff]
- # Step 2: Combining and Shuffling Vectors
- print("Combining and shuffling vectors...")
- data = np.concatenate((v1, v2, v3, needles))
- np.random.shuffle(data)
- # Ensure the correct number of vectors is generated
- print("Number of generated vectors:", len(data)) # Debug print statement
- assert len(data) == num_vectors_total, f"Error: Number of vectors ({len(data)}) does not match expected ({num_vectors_total})"
- # Step 3: Creating Annoy Index
- print("Creating Annoy index...")
- num_dimensions = 3
- num_trees = 100
- annoy_index = AnnoyIndex(num_dimensions, metric='euclidean')
- # Adding vectors to the index
- print(f"Adding {len(data)} vectors to the index...")
- for i, vector in enumerate(data):
- annoy_index.add_item(i, vector)
- # Building the index
- print("Building the index...")
- annoy_index.build(num_trees)
- # Step 4: Executing the Query
- print("Executing the query...")
- query = np.array([0.5, 0.5, 0.5]).astype(np.float32)
- num_neighbors = 10
- neighbors, distances = annoy_index.get_nns_by_vector(query, num_neighbors, include_distances=True)
- # Debugging: Print intermediate values
- print("Query Vector:", query)
- print("Nearest Neighbors Indices:", neighbors)
- print("Nearest Neighbors Distances:", distances)
- # Step 5: Printing the Result
- print("\nNearest Neighbors:")
- for neighbor, distance in zip(neighbors, distances):
- print("Neighbor:", data[neighbor], "Distance:", distance)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement