QuestionQuestion

Instructions
We use a script that extracts your answers by looking for cells in between the cells containing the exercise statements. So you MUST add cells in between the exercise statements and add answers within them and MUST NOT modify the existing cells, particularly not the problem statement.

To make markdown, please switch the cell type to markdown (from code) - you can hit 'm' when you are in command mode - and use the markdown language.

Web scraping the Aggie
In this assignment, you'll scrape text from The California Aggie and then analyze the text. The Aggie is organized by category into article lists. For example, there's a Campus News list, Arts & Culture list, and Sports list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

Exercise 1.1. Write a function that extracts all of the links to articles in an Aggie article list. The function should:
    -Have a parameter url for the URL of the article list.
    -Have a parameter page for the number of pages to fetch links from. The default should be 1.
    -Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:
-Be polite to The Aggie and save time by setting up requests_cache before you write your function.
-Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.
-You can use lxml.html or BeautifulSoup to scrape HTML. Choose one and use it throughout the entire assignment.

Exercise 1.2. Write a function that extracts the title, text, and author of an Aggie article. The function should:
    -Have a parameter url for the URL of the article.
    -For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.
    -Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

Exercise 1.3. Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 Campus News articles and a data frame of 60 City News articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

Exercise 1.4. Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

    -What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?
    -What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?
    -Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

    By purchasing this solution you'll be able to access the following files:
    Solution.pdf.

    $45.00
    for this solution

    PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

    Find A Tutor

    View available Python Programming Tutors

    Get College Homework Help.

    Are you sure you don't want to upload any files?

    Fast tutor response requires as much info as possible.

    Decision:
    Upload a file
    Continue without uploading

    SUBMIT YOUR HOMEWORK
    We couldn't find that subject.
    Please select the best match from the list below.

    We'll send you an email right away. If it's not in your inbox, check your spam folder.

    • 1
    • 2
    • 3
    Live Chats