Advertisement
Python253

sentencepiece_tokenizer_trainer

May 3rd, 2024
814
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 5.68 KB | None | 0 0
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # Filename: sentencepiece_tokenizer_trainer.py
  4. # Version: 1.0.0
  5. # Author: Jeoi Reqi
  6.  
  7. """
  8. This script trains a SentencePiece tokenizer model based on provided or user-selected text data.
  9. It offers options to use sample text data provided within the script or to select a custom text file.
  10. The trained model is saved with the prefix 'tokenizer', and vocabulary files are generated accordingly.
  11.  
  12. Requirements:
  13.  
  14.    - Python 3.x
  15.    - sentencepiece library
  16.    - tkinter
  17.  
  18. Functions:
  19.  
  20.    - use_sample_text(): Generates sample text data and saves it to a file.
  21.    - use_custom_text(): Allows the user to choose a custom text file using a file dialog.
  22.    - main(): Provides options to choose sample or custom text, trains the model & generates vocabulary files.
  23.  
  24. Usage:
  25.  
  26.    1. Run the script using Python 3.x interpreter.
  27.    2. Choose an option to use sample text data or select a custom text file.
  28.    3. The script will train a SentencePiece tokenizer model based on the chosen text data.
  29.    4. Tokenization model and vocabulary files will be generated with the prefix 'tokenizer'.
  30.  
  31. Additional Notes:
  32.  
  33.    - The SentencePiece tokenizer model is trained with adjusted parameters such as vocabulary size and character coverage.
  34.    - It's recommended to use diverse text sources for training to capture a wide range of linguistic nuances.
  35.    - Ensure that the sentencepiece library is installed in the Python environment before running the script.
  36.    - This script provides a convenient way to train a tokenizer model for text preprocessing tasks.
  37. """
  38.  
  39. import os
  40. import sentencepiece as spm
  41. from tkinter import Tk, filedialog
  42.  
  43. def use_sample_text():
  44.     # Sample text data
  45.     text_data = """
  46.    Welcome to the sample text corpus! This corpus is designed to showcase the capabilities of the tokenizer model. It consists of several sentences carefully crafted to cover a wide range of linguistic patterns and vocabulary.
  47.    In this corpus, you'll find sentences of varying lengths and complexities. Some sentences are short and simple, while others are longer and more intricate. This diversity helps the tokenizer model learn to handle different types of text inputs effectively.
  48.    The purpose of this sample text is to provide a starting point for training the tokenizer model. However, it's essential to note that real-world text data will vary significantly from this example. Therefore, it's highly recommended to replace this sample text with your own data for more accurate and relevant model training.
  49.    When replacing this sample text with your own data, consider using a diverse set of text sources. Include text from different domains, genres, and languages to ensure that the tokenizer model captures a wide range of linguistic nuances.
  50.    Remember, the quality of the tokenizer model depends largely on the quality and diversity of the training data. So, take your time to gather and curate a comprehensive dataset that reflects the text inputs your model will encounter in real-world applications.
  51.    Thank you for using this sample text corpus. We wish you the best of luck in training your tokenizer model!
  52.    """
  53.     with open("text_data.txt", "w", encoding="utf-8") as f:
  54.         f.write(text_data)
  55.  
  56.     return "text_data.txt"
  57.  
  58. # Function to open the file explorer to select a custom text file to use
  59. def use_custom_text():
  60.     root = Tk()
  61.     root.withdraw()
  62.     file_path = filedialog.askopenfilename(filetypes=[("Text files", "*.txt")])
  63.     return file_path
  64.  
  65. def main():
  66.     print("Menu Options:\n")
  67.     print("1. Use sample text data")
  68.     print("2. Choose a custom text file")
  69.     choice = input("\nEnter your choice (1 or 2): ")
  70.  
  71.     if choice == "1":
  72.         print("\n\t\tTraining with sample data...\n")
  73.         text_file_path = use_sample_text()
  74.     elif choice == "2":
  75.         print("\n\t\tTraining with custom data...\n")
  76.         text_file_path = use_custom_text()
  77.     else:
  78.         print("\n\t\tInvalid choice. Please enter '1' or '2'.\n")
  79.         return
  80.  
  81.     if not text_file_path:
  82.         print("\n\t\tNo text file selected. Exiting...\n")
  83.         return
  84.  
  85.     # Train SentencePiece model with adjusted parameters
  86.     spm.SentencePieceTrainer.train(
  87.         input=text_file_path,      # Use the selected text data file as input
  88.         model_prefix="tokenizer",  # Set the prefix for model and vocabulary files
  89.         vocab_size=189,            # Edit the vocabulary size to accommodate all required characters
  90.         character_coverage=0.9995, # Keep the character coverage as specified
  91.         num_threads=16,            # Use 16 threads for training
  92.     )
  93.  
  94.     # Load the trained model
  95.     tokenizer = spm.SentencePieceProcessor()
  96.     tokenizer.load("tokenizer.model")
  97.  
  98.     # Print the vocabulary size
  99.     print(
  100.         f"\n\t\tTRAINING COMPLETE!\n\t\t--------------------------------------\n\t\tVocabulary size: {tokenizer.vocab_size()}"
  101.     )
  102.  
  103.     # Generate vocabulary file
  104.     with open("vocab.txt", "w", encoding="utf-8") as vocab_file:
  105.         for vocab_id in range(tokenizer.vocab_size()):
  106.             vocab_file.write(f"{vocab_id:>30} {tokenizer.id_to_piece(vocab_id)}")
  107.  
  108.     print("\t\t--------------------------------------\n\t\tFiles Created:\n")
  109.     print(f"\t\t- tokenizer.model\n\t\tPath: {os.path.abspath('tokenizer.model')}\n")
  110.     print(f"\t\t- tokenizer.vocab\n\t\tPath: {os.path.abspath('tokenizer.vocab')}\n")
  111.     print(
  112.         f"\t\t- vocab.txt\n\t\tPath: {os.path.abspath('vocab.txt')}\n\t\t--------------------------------------"
  113.     )
  114.     print(
  115.         "\t\tExiting program...\tGoodBye!\n\t\t--------------------------------------"
  116.     )
  117.  
  118. if __name__ == "__main__":
  119.     main()
  120.  
  121.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement