QuestionQuestion

Transcribed TextTranscribed Text

Deliverables : You must use functions to modularize your work in a logical way. You should use exception handling where necessary as well. Email Scraper How do spammers get your email addresses? There are a lot of methods that are used to create the collections of email addresses that marketers use. Sometimes, the websites you sign up at sell your information, including your email. They also, have bots that scour the internet and scrape email addresses off of web pages. We are going to write our own simple bot. Your program will ask the user for a file that contains URLs ( web sites ). It will load each one and search for email addresses to scrape and use. Python has a module that will help us pull data off of websites. We can pull it down just like text. It obviously won’t look like the web page, but will contain the HTML markup. HTML stands for Hypertext Markup Language. You can view the page source in virtually any browser. Right click on the page you want the source, and you’ll likely see a menu with an option that says “View Page Source”. This is what HTML code looks like, and by scouring the text we can find email addresses in the pages. urllib module urllib is the python library that helps with urls. You should view the documentation for the module, since you never know when you’ll find something useful, but for what we need, it is straightforward. import urllib.request # Import should be done at the top of your program request = urllib.request.Request("http://cnn.com") # First create a request object response = urllib.request.urlopen(request) # Create a response object # after we open the request. page_data = response.read() # page_data has the text Page_str = page_data.decode(‘utf-8’) # convert the byte text to a # utf-8 string response.close() # Remember to close the response You should play with this code and get comfortable using it. Come to think of it, since you’ll be calling this with multiple urls over and over, it might make a great function. HINT. It’s a great thing to functionalize. Seriously, you should just write it now. What happens when you pass a bad URL to the request? If it creates an error you probably want to use our error handling powers to solve that issue. How can we tell what an email is? We’re going to be looking for portions of text that start with mailto: Including the colon. The email address follows that. How do we know where the email address stops? It stops when you reach any character that is not .@&#; or digits 0-9, or any alpha character a-z upper or lower. This isn’t the most resilient way, but it will give you some good practice working with strings. The strings you are going to get from websites will be extremely large. So making a function that you can pass smaller strings to and experiment with is crucial to debugging and finding errors in a timely and efficient manner. Encoded Emails Some email addresses are encoded. If you get an email address that is encoded then you’ll want to parse it and create the real email address. webmaster@umk&# 099;.edu This email address is html encoded. We’ve already seen that characters are a decimal number. chr(119) # Returns w chr(101 # returns e Our programs goals We want to write a program to ask the user for a file that has URLs in it. One url on each line. If the user gives us a file that doesn’t exist, or can’t be opened then you must be able to handle those errors. Once you have a file, open each url and get the contents, find all the email addresses. Once you are done eliminate the duplicates and ask the user for a file to write out the email addresses to. Program Specifications The requires are below, but an additional requirement has to be observed. There are many tools that can make much of this easier to do. In fact many of them make it trivially easy. This isn’t a course about finding and using libraries and modules, so you’ll be stuck using strings and your wits ( besides urllib of course ). However, you may be interested once you’ve solved it to look at 3rd party modules like BeautifulSoup ( terrible name ). It helps in parsing and working with HTML and XML. Another built-in module that is quite useful is re or regular expressions. Spending some time learning regular expressions at some point will pay off for you. Regular expressions are extremely powerful, flexible and useful for validating data and finding matching strings. Another module that is useful for unescaping encoded email addresses below is cgi.html.unescape which is built in. Learning to do these things by hand will apy off later when you don’t have a tool that can do it for you. These are the skills that will allow you to build your own solutions. In summary you are not allowed to use ● Beautiful Soup ● re ( regular expressions ) ● cgi ● Any imported module other than urllib Sample Program Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> ================================ RESTART ================================ >>> Welcome to email scraper! Enter the filename containting URLs to read ==> invalid.txt Could not open the file invalid.txt. It doesn't exist. Enter the filename containting URLs to read ==> subdir Could not open the file subdir. There was an IOError Enter the filename containting URLs to read ==> emails.txt Enter a file to save the emails to ==> output.txt Do you want to run this application again? Y/YES/N/NO ==> e You must enter only Y/YES/N or NO only. Do you want to run this application again? Y/YES/N/NO ==> y Welcome to email scraper! Enter the filename containting URLs to read ==> emails2.txt We did not find any emails in the provided urls to save Do you want to run this application again? Y/YES/N/NO ==> y Welcome to email scraper! Enter the filename containting URLs to read ==> emails3.txt sce.umkc.edu does not seem to be a valid url invalid_url does not seem to be a valid url We did not find any emails in the provided urls to save Do you want to run this application again? Y/YES/N/NO ==> n

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

# Method to get emails from a URL
def getEmails(url):
    try:
       # Obtain string from HTML page
       request = urllib.request.Request(url)
       response = urllib.request.urlopen(request)
       page_data = response.read()
       page_str = page_data.decode('utf-8')
       response.close()
       # Decode all ASCII values
       for i in range(32,127):
            if i < 100:
                istr = '0' + str(i)
            else:
                istr = str(i)...

By purchasing this solution you'll be able to access the following files:
Solution.PNG and Solution.py.

$60.00
for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Python Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats