tf-idf

"""
This is hopefully a light homework set. Though, I want to introduce a new data structure here.

Other than lists, strings, and dictionaries, there is another compound data type you should know about: sets.
Sets are like dictionaries except they only store keys, not values.

Like dictionaries, they do not store information in any particular order you can understand.
This means you cannot use indexing to get the first or second or third element.

You also should not worry if your output doesn't match the examples here.
All that matters is you have the same members in the set.

Sets store things in a way that allow them to handle operations efficiently.
They are best known for their efficient `in` operator.

There are three ways to create a set.

1. Convert an existing list into a set.
>>> L = [4, 8, 15, 16, 23, 42]
>>> S = set(L)
>>> S
{4, 8, 42, 15, 16, 23}
>>> 42 in S
True
>>> 43 in S
False

2. Create a set using set notation (Curly braces {} instead of brackets []. No key: value pairs, unlike dictionaries.)
>>> S = {4, 8, 15, 16, 23, 42}

3. Create an empty set with `set()` then manually add each element into the set.
Beware, do not use empty curly braces {}. Otherwise, Python will interpret this as an empty dictionary.
>>> S = set()
>>> S.add(4)
>>> S.add(8)
>>> S.add(5)
>>> S
{8, 4, 5}

Sets also support common mathematical set operations like union (using `|` operator),
intersection (using `&` operator), and difference (`-`).
>>> A = {4, 8, 15, 16}
>>> B = {8, 15, 23}
>>> A | B
{16, 4, 23, 8, 15}

See https://docs.python.org/3/library/stdtypes.html#set for all available operations.
"""

# ------------------------------
# Homework
# ------------------------------

def get_word_count(text):
    """
    Given a string `text` containing a cleaned document, construct and return a dictionary
    that maps each word to the number of times that word has appeared in the text.

    You do not need to handle plurals and abbreviations and all those fun stuff.

    Cleaned document means it is a string containing only lowercase words and a single space
    separating each adjacent word. There are no punctuations or newlines for you to worry about.
    (See example test cases below.)
    """

    return {'hi': 1}

def get_set_of_words(text):
    """
    Given a string `text` containing a cleaned document, construct a set of words that appeared.
    """

    # hint: can be done in one line
    return set()

def get_all_words(corpus):
    """
    Given a list of cleaned texts, build a set of all words that appeared in at least one document.
    """

    # hint: this is like a running sum, but for set unions.
    # might also want to use `get_set_of_words` function to help simplify things
    return set()

def get_document_frequency(corpus):
    """
    Given a list of cleaned texts, build a dictionary mapping each word to the number of documents that has that word.
    """

    # bonus problem.
    # this is relatively difficult so feel free to skip this if you need to
    return {'hi': 3}

# ------------------------------
# Test cases
# ------------------------------

# when you run this program, variable `example` and `excorpus` will have been defined for you
# so you shouldn't have to waste time typing out the strings
# try running:
# >>> example
# on your repl to see that the string actually exists

example = 'hi my name is my name is gunn gunn hi gunn yo'
excorpus = [example, 'hi i want pfizer vaccine why cant thai government give me', 'hi i am thai']

assert get_word_count(example) == {'hi': 2, 'my': 2, 'name': 2, 'is': 2, 'gunn': 3, 'yo': 1}
assert get_set_of_words(example) == {'hi', 'is', 'yo', 'my', 'gunn', 'name'}

assert get_all_words(excorpus) == {'vaccine', 'thai', 'name', 'me', 'i', 'cant', 'yo', 'give', 'am', 'hi', 'my', 'gunn', 'why', 'want', 'pfizer', 'is', 'government'}

assert get_document_frequency(excorpus) == {'hi': 3, 'pfizer': 1, 'want': 1, 'name': 1, 'i': 2, 'why': 1, 'give': 1, 'me': 1, 'thai': 2, 'is': 1, 'gunn': 1, 'cant': 1, 'government': 1, 'my': 1, 'am': 1, 'vaccine': 1, 'yo': 1}