Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- """
- This is hopefully a light homework set. Though, I want to introduce a new data structure here.
- Other than lists, strings, and dictionaries, there is another compound data type you should know about: sets.
- Sets are like dictionaries except they only store keys, not values.
- Like dictionaries, they do not store information in any particular order you can understand.
- This means you cannot use indexing to get the first or second or third element.
- You also should not worry if your output doesn't match the examples here.
- All that matters is you have the same members in the set.
- Sets store things in a way that allow them to handle operations efficiently.
- They are best known for their efficient `in` operator.
- There are three ways to create a set.
- 1. Convert an existing list into a set.
- >>> L = [4, 8, 15, 16, 23, 42]
- >>> S = set(L)
- >>> S
- {4, 8, 42, 15, 16, 23}
- >>> 42 in S
- True
- >>> 43 in S
- False
- 2. Create a set using set notation (Curly braces {} instead of brackets []. No key: value pairs, unlike dictionaries.)
- >>> S = {4, 8, 15, 16, 23, 42}
- 3. Create an empty set with `set()` then manually add each element into the set.
- Beware, do not use empty curly braces {}. Otherwise, Python will interpret this as an empty dictionary.
- >>> S = set()
- >>> S.add(4)
- >>> S.add(8)
- >>> S.add(5)
- >>> S
- {8, 4, 5}
- Sets also support common mathematical set operations like union (using `|` operator),
- intersection (using `&` operator), and difference (`-`).
- >>> A = {4, 8, 15, 16}
- >>> B = {8, 15, 23}
- >>> A | B
- {16, 4, 23, 8, 15}
- See https://docs.python.org/3/library/stdtypes.html#set for all available operations.
- """
- # ------------------------------
- # Homework
- # ------------------------------
- def get_word_count(text):
- """
- Given a string `text` containing a cleaned document, construct and return a dictionary
- that maps each word to the number of times that word has appeared in the text.
- You do not need to handle plurals and abbreviations and all those fun stuff.
- Cleaned document means it is a string containing only lowercase words and a single space
- separating each adjacent word. There are no punctuations or newlines for you to worry about.
- (See example test cases below.)
- """
- return {'hi': 1}
- def get_set_of_words(text):
- """
- Given a string `text` containing a cleaned document, construct a set of words that appeared.
- """
- # hint: can be done in one line
- return set()
- def get_all_words(corpus):
- """
- Given a list of cleaned texts, build a set of all words that appeared in at least one document.
- """
- # hint: this is like a running sum, but for set unions.
- # might also want to use `get_set_of_words` function to help simplify things
- return set()
- def get_document_frequency(corpus):
- """
- Given a list of cleaned texts, build a dictionary mapping each word to the number of documents that has that word.
- """
- # bonus problem.
- # this is relatively difficult so feel free to skip this if you need to
- return {'hi': 3}
- # ------------------------------
- # Test cases
- # ------------------------------
- # when you run this program, variable `example` and `excorpus` will have been defined for you
- # so you shouldn't have to waste time typing out the strings
- # try running:
- # >>> example
- # on your repl to see that the string actually exists
- example = 'hi my name is my name is gunn gunn hi gunn yo'
- excorpus = [example, 'hi i want pfizer vaccine why cant thai government give me', 'hi i am thai']
- assert get_word_count(example) == {'hi': 2, 'my': 2, 'name': 2, 'is': 2, 'gunn': 3, 'yo': 1}
- assert get_set_of_words(example) == {'hi', 'is', 'yo', 'my', 'gunn', 'name'}
- assert get_all_words(excorpus) == {'vaccine', 'thai', 'name', 'me', 'i', 'cant', 'yo', 'give', 'am', 'hi', 'my', 'gunn', 'why', 'want', 'pfizer', 'is', 'government'}
- assert get_document_frequency(excorpus) == {'hi': 3, 'pfizer': 1, 'want': 1, 'name': 1, 'i': 2, 'why': 1, 'give': 1, 'me': 1, 'thai': 2, 'is': 1, 'gunn': 1, 'cant': 1, 'government': 1, 'my': 1, 'am': 1, 'vaccine': 1, 'yo': 1}
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement