Advertisement
AquaBlitz11

tf-idf

Jul 7th, 2021
1,219
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 4.09 KB | None | 0 0
  1. """
  2. This is hopefully a light homework set. Though, I want to introduce a new data structure here.
  3.  
  4. Other than lists, strings, and dictionaries, there is another compound data type you should know about: sets.
  5. Sets are like dictionaries except they only store keys, not values.
  6.  
  7. Like dictionaries, they do not store information in any particular order you can understand.
  8. This means you cannot use indexing to get the first or second or third element.
  9.  
  10. You also should not worry if your output doesn't match the examples here.
  11. All that matters is you have the same members in the set.
  12.  
  13. Sets store things in a way that allow them to handle operations efficiently.
  14. They are best known for their efficient `in` operator.
  15.  
  16. There are three ways to create a set.
  17.  
  18. 1. Convert an existing list into a set.
  19. >>> L = [4, 8, 15, 16, 23, 42]
  20. >>> S = set(L)
  21. >>> S
  22. {4, 8, 42, 15, 16, 23}
  23. >>> 42 in S
  24. True
  25. >>> 43 in S
  26. False
  27.  
  28. 2. Create a set using set notation (Curly braces {} instead of brackets []. No key: value pairs, unlike dictionaries.)
  29. >>> S = {4, 8, 15, 16, 23, 42}
  30.  
  31. 3. Create an empty set with `set()` then manually add each element into the set.
  32. Beware, do not use empty curly braces {}. Otherwise, Python will interpret this as an empty dictionary.
  33. >>> S = set()
  34. >>> S.add(4)
  35. >>> S.add(8)
  36. >>> S.add(5)
  37. >>> S
  38. {8, 4, 5}
  39.  
  40. Sets also support common mathematical set operations like union (using `|` operator),
  41. intersection (using `&` operator), and difference (`-`).
  42. >>> A = {4, 8, 15, 16}
  43. >>> B = {8, 15, 23}
  44. >>> A | B
  45. {16, 4, 23, 8, 15}
  46.  
  47. See https://docs.python.org/3/library/stdtypes.html#set for all available operations.
  48. """
  49.  
  50. # ------------------------------
  51. # Homework
  52. # ------------------------------
  53.  
  54. def get_word_count(text):
  55.     """
  56.    Given a string `text` containing a cleaned document, construct and return a dictionary
  57.    that maps each word to the number of times that word has appeared in the text.
  58.  
  59.    You do not need to handle plurals and abbreviations and all those fun stuff.
  60.  
  61.    Cleaned document means it is a string containing only lowercase words and a single space
  62.    separating each adjacent word. There are no punctuations or newlines for you to worry about.
  63.    (See example test cases below.)
  64.    """
  65.  
  66.     return {'hi': 1}
  67.  
  68. def get_set_of_words(text):
  69.     """
  70.    Given a string `text` containing a cleaned document, construct a set of words that appeared.
  71.    """
  72.  
  73.     # hint: can be done in one line
  74.     return set()
  75.  
  76. def get_all_words(corpus):
  77.     """
  78.    Given a list of cleaned texts, build a set of all words that appeared in at least one document.
  79.    """
  80.  
  81.     # hint: this is like a running sum, but for set unions.
  82.     # might also want to use `get_set_of_words` function to help simplify things
  83.     return set()
  84.  
  85. def get_document_frequency(corpus):
  86.     """
  87.    Given a list of cleaned texts, build a dictionary mapping each word to the number of documents that has that word.
  88.    """
  89.  
  90.     # bonus problem.
  91.     # this is relatively difficult so feel free to skip this if you need to
  92.     return {'hi': 3}
  93.  
  94. # ------------------------------
  95. # Test cases
  96. # ------------------------------
  97.  
  98. # when you run this program, variable `example` and `excorpus` will have been defined for you
  99. # so you shouldn't have to waste time typing out the strings
  100. # try running:
  101. # >>> example
  102. # on your repl to see that the string actually exists
  103.  
  104. example = 'hi my name is my name is gunn gunn hi gunn yo'
  105. excorpus = [example, 'hi i want pfizer vaccine why cant thai government give me', 'hi i am thai']
  106.  
  107. assert get_word_count(example) == {'hi': 2, 'my': 2, 'name': 2, 'is': 2, 'gunn': 3, 'yo': 1}
  108. assert get_set_of_words(example) == {'hi', 'is', 'yo', 'my', 'gunn', 'name'}
  109.  
  110. assert get_all_words(excorpus) == {'vaccine', 'thai', 'name', 'me', 'i', 'cant', 'yo', 'give', 'am', 'hi', 'my', 'gunn', 'why', 'want', 'pfizer', 'is', 'government'}
  111.  
  112. assert get_document_frequency(excorpus) == {'hi': 3, 'pfizer': 1, 'want': 1, 'name': 1, 'i': 2, 'why': 1, 'give': 1, 'me': 1, 'thai': 2, 'is': 1, 'gunn': 1, 'cant': 1, 'government': 1, 'my': 1, 'am': 1, 'vaccine': 1, 'yo': 1}
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement