In this assignment, you will analyze a number of political tweets to understand how political candidates use Twitter. This assignment is designed to give you practice with dictionaries and files.

Twitter ... and Politics
Unless you've been hiding under a rock, you probably heard a great deal about the election that just happened to our south. We'd like to focus on one particular aspect: the candidates' use of Twitter. Twitter is a social networking website where users can post very short messages, and all of the candidates have posted thousands of messages. (We'll focus on the four most mainstream candidates of the 1700+ people who have registered as candidates.) We'd like to know more about how each candidate used Twitter.
Definitions
â€¢ tweet: A message posted on Twitter. The message text is between 1 and 140 characters long (inclusive).
â€¢ hashtag: A word or phrase in a tweet that begins with the hash symbol. Twitter uses the number sign (#) as the hash symbol. #UofT and #csc108 are two examples of hashtags on Twitter. Hashtags are used to label important words or terms in a tweet. For our purposes, a hashtag begins with the hash symbol, and contains all alphanumeric characters up to (but not including) a space character, punctuation, or the end of a tweet. A hashtag is preceded by either a space, or the beginning of a tweet.
â€¢ mention: A word or phrase in a tweet that begins with the mention symbol. Twitter uses the at sign (@) as the mention symbol. Mentions are used to direct a message at or about a particular Twitter user, so the word or phrase should be a Twitter username (but for the purposes of this assignment, we won't check if the username is valid â€” we'll just assume it). For our purposes, the definition of a mention is very similar to that of a hashtag. A mention begins with the mention symbol, and contains all alphanumeric characters up to (but not including) a space character, punctuation, or the end of a tweet. A mention is preceded by either a space or the beginning of a tweet.
â€¢ URL: An address to a resource (like a webpage) on the Internet. For example, http://www.twitter.com and https://t.co/LREA7WRmOx are URLs. For our purposes, a URL is preceded either by a space or the beginning of a tweet, starts with the four characters http, and contains all characters up to (but not including) a space character, or the end of a tweet.
For a complete list of Twitter terms, check out the Twitter glossary.

Data Files
We're providing two files: a text file containing all of the tweets from the four major candidates, and a text file containing a much smaller data set to be used for testing. You should not use the larger data set for early testing: how will you know what the right answer is? With the smaller data set, you can calculate what the answer should be.
Both data files have a particular format. Open the short data file and follow along with the description of the format that follows.
At the top level, they contain a sequence of candidate Twitter histories. Each candidate Twitter history has the format:
CANDIDATE NAME: TWEET #1 TWEET #2 TWEET #3 ...

A candidate history ends when a new candidate Twitter history (which begins with a line that ends with a ":") begins. The CANDIDATE NAME in a Twitter history is a string. Each TWEET is a record with multiple fields:
ID,DATE,LOCATION,SOURCE,FAVORITE_COUNT,RETWEET_COUNT TEXT ... <<<EOT

The fields in the first line are comma separated. The ID and DATE are integers. (The date is an integer because it is the number of seconds since the Unix Epoch. That makes it easy to compare dates, since you can use numeric comparisons.) The LOCATION and SOURCE are strings describing the location where the tweet was made, and the app or device used to create the tweet. The FAVORITE_COUNT and RETWEET_COUNT are also integers reflecting the number of times the tweet was favorited and retweeted, respectively.

The TEXT of the tweet can span multiple lines, so we have to indicate that it has ended by inserting a sentinel. The sentinel in our data files is "<<<EOT". When a line contains only the sentinel, you know that the TEXT of the tweet has been read. The sentinel is not part of the tweet TEXT.
All tweets will be well-formed; you do not need to handle the case where a tweet does not match the format described above.

What To Do
In this assignment, you will write a set of functions that analyze tweets and features extracted from tweets. In particular, complete the functions listed below in tweets.py and provide a set of tests to evaluate them in the docstrings.

tweets.py

extract_mentions:
(str) -> list of str
The parameter is a tweet. This function should return a list containing all of the mentions in the tweet, in the order they appear in the tweet. Each mention in the returned list should have the initial mention symbol removed, and the list should contain every mention encountered â€” including repeats, if a user is mentioned more than once within a tweet. Note: Our definition of a mention (and similarly, of a hashtag) doesn't allow for mentions embedded in other symbols. For example, "Vote! --@FLOTUS" does not contain a mention, by our definition, since the mention of FLOTUS is preceded by other symbols. We require that the mention start with a "@" symbol and be preceded by a space or the beginning of the tweet.

extract_hashtags:
(str) -> list of str
The parameter is a tweet. This function should return a list containing all of the hashtags in the tweet, in the order they appear in the tweet. Each hashtag in the returned list should have the initial hash symbol removed, and hashtags should be unique. (If a tweet uses the same hashtag twice, it is included in the list only once. The order of the hashtags should match the order of the first occurrence of each tag in the tweet.)

count_words:
(str, dict of {str: int}) -> None
The first parameter is a tweet, and the second is a dictionary containing lowercase words as keys and integer counts as values. The function should update the counts of words in the dictionary. If a word is not the dictionary yet, it should be added. For the purposes of this function, words are defined by whitespace: every string that occurs between two pieces of whitespace (or between a piece of whitespace and either the beginning or end of the tweet) could be a word. Numeric characters are treated the same way as alphabetic characters. Hashtags, mentions, and URLs are not considered words. The empty string is not considered a word. Words don't contain punctuation, so punctuation should be removed from any candidate words. For example, if we are analyzing the tweet "@utmandrew Don't you wish you could vote? #MakeAmericaGreatAgain", we would increment the count for the word "you" by 2 and the counts for words "dont", "wish", "could", and "vote" by 1.

common_words:
(dict of {str: int}, int) -> None
The first parameter is the dictionary of word counts as described in count_words and the second is a positive integer N. This function should update the dictionary so that it includes the most common (highest frequency words). At most N words should be included in the dictionary. If including all words with some word count would result in a dictionary with more than N words, then none of the words with that word count should be included. (i.e., in the case of a tie for the N+1st most common word, omit all of the words in the tie.)

read_tweets:
(filename) -> dict of {str: list of tweet tuples}
The parameter is the full name of a file. Open the file specified by the parameter, which is formatted as described in the data files section, and read all of the data from it. The keys of the dictionary should be the names of the candidates, and the items in the list associated with each candidate are the tweets they have sent. A tweet tuple should have the form (candidate, tweet text, date, source, favorite count, retweet count). The date, favorite count, and retweet count should be integers, and the rest of the items in the tuple should be strings.

most_popular:
(dict of {str: list of tweet tuples}, int, int) -> list of str
The first parameter is the dictionary produced by read_tweets. The second and third parameters are dates (expressed in the same int format as the data file) with the second parameter less than or equal to the third parameter. This function should return a list containing the names of the candidates who submitted at least one tweet between the two dates (inclusive of the start and end dates), with candidates ordered from the most to the least popular in that period of time. A candidate's popularity in a time period is the sum of the favorite counts and retweet counts for all tweets issued in that time period.

detect_author:
(dict of {str: list of tweet tuples}, str) -> str
The first parameter is the dictionary produced by read_tweets and the second is a tweet. This function should return the username of the most likely author of that tweet, based on the hashtags they use. If the tweet contains a hashtag that only one of the candidates uses, then the likely author is the candidate that uses that hashtag. If the tweet contains no hashtags or more than one hashtag that are uniquely used by a single candidate, return the string "Unknown." For this function, we suggest creating a helper function that will find all hashtags for the candidates, and another helper function to find all unique hashtags for each candidate.

What NOT to Do
Your tweets.py file should not include any calls to print or input. Also, do not include any extra code outside of the function definitions unless it is a constant declaration, helper function, or test code that is guarded by an "if __name__ == '__main__'" block. Do not import any modules oher than doctest. Do not call open other than to open the parameter in read_tweets.

In this assignment, you will analyze a number of political tweets t...