QuestionQuestion

Transcribed TextTranscribed Text

Lab 4 - Data In [1]: % matplotlib inline The special command above will make all the matplotlib images appear in the notebook. Directions Failure to follow the directions will result in a "0" 1. main.ipynb - this file and your queries/reporting code. 2. hurricanes.py - your preprocessing and data writing file. 3. hurricanes.html - the local copy of the Wikipedia page. 4. hurricanes.db - the SQLite database you create. 5. hurricans.sql - the DDL commands to create your database. 1. Data Science is as much about what you write (communicating) as the code you execute (researching). In many places, you will be required to execute code and discuss both the purpose and the result. Additionally, Data Science is about reproducibility and transparency. This includes good communication with your team and possibly with yourself. Therefore, you must show all work. 2. Avail yourself of the Markdown/Codecell nature of the notebook. If you don't know about Markdown, look it up. Your notebooks should not look like ransom notes. Don't make everything bold. Clearly indicate what question you are answering. 3. Submit a cleanly executed notebook. The first code cell should say In [1] and each successive code cell should increase by 1 throughout the notebook. Individual Submission Getting and Storing Data This Lab is about acquiring, cleaning and storing data (commonly known as "wrangling") as well as doing a little bit of analysis. Basic Outline 1. Use curl or wget obtain a local copy of the following web page: Atlantic Hurricane Season ( https://en.wikipedia.org/wiki/Atlantic_hurricane_season ). You can also just visit the page and save the HTML. include this in your submission as hurricanes.html . This is important.The page was edited during the module and different people got different answers at different times. You only need to be correct with respect to your hurricanes.html file. 2. create the SQLite3 database and initialize the tables. include these files as hurricanes.db and hurricanes.sql . We are using SQLite3 not because it is the best database in the world but because its database files are shareable. 3. use Beautiful Soup 4 and Python, parse the HTML file into a usable dataset. Write the data to the database. this code should be in a file hurricanes.py and included in your submission Your code should have copious comments. 4. Write this data set to a SQLite3 database called hurricanes.db include this in your submission . 5. Run the requested queries against the data set. see below The results should be nicely formatted. Although Wikipedia has an API, I do not what you to use it for this assignment. Learning Objectives This assignment has several learning objectives: create a database and tables in one database type, SQLite use BeautifulSoup to parse a web page. use Regular Expressions to parse text to extract numbers. write data to a database using SQL. read data from a database using SQL. You can reference the Module links to various tutorials on these topics. Details The data is contained in many separate HTML tables. You will need to look at the raw HTML to determine how everything is laid out. Make sure you decompose the problem into nice, neat, focused functions. You should have a function that processes a table, one that processes a row, one that processes something that may be numeric. You only need to get the data from the tables starting in the 1850s but note that not all years have the same data. You only need to save the following columns. The name is parentheses is the name the column should have in the database table. Year ( year ) Number of tropical storms ( tropical_storms ) Number of hurricanes ( hurricanes ) Number of Major Hurricanes ( major_hurricanes ) Deaths ( deaths ) Damage ( damage ) Notes ( notes ) Note that "Damage" doesn't start until 1900s and "Notes" was added in 1880s. "Strongest Storm" should be added to the Notes column (even in years that didn't have Notes) as should "Retired Storms". The name of the database table should be atlantic_hurricanes . The name of the table file (SQLite3 uses a file) should be hurricanes.db (who knows...you might need to add Pacific storms someday). There are a number of parsing problems which will most likely require regular expressions. First, the Deaths column has numbers that include commas and entries that are not numbers (Unknown and None). How should you code Unknown and None so that answers are not misleading but queries are still fairly straightforward to write? Similarly, Damages has numbers with commas, currency signs and different amount words (millions, billions). How will you normalize all of these so that a query can compare them? You may need regular expressions. Additionally, the way that Tropical Storms are accounted for seems to change mysteriously. Looking over the data, it's not immediately apparent when the interpretation should change. 1850s, 1860s definitely but 1870s? Not sure. It could just be a coincidence that there were never more hurricanes than tropical storms which seems to be the norm but see, for example, 1975. Welcome to Data Science! You should put your parsing code in hurricanes.py . None of it should be in this file. Imagine this file is going to be the report you give to your boss. Documentation Any time you run into a problem where you have to decide what to do--how to solve the problem or treat the values--document it here. Hurricanes.db What is the function of hurricanes.db in this assignment? Queries When you are done, you must write and execute the following queries against your database. Those queries should be run from this notebook. Find the documentation for using SQLite3 from Python (the library is already included). You should never output raw Python data structures instead, you need to output well-formatted tables. You may need to look around for a library to help you or write your own formatting code. Pandas is one possibility. However, I want you to use raw SQL for your queries so if you use Pandas use it only for the formatting of query results (don't load the data into Pandas and use Pandas/Python to query the data ). Imagine that you were somewhere you couldn't get Pandas...you'd need to know how to do this without it. Write the most general query possible. Never assume that you are going to get only one result back (i.e, don't assume there won't be ties). The query result should be in a nicely formatted table; don't show raw data structures to your boss or manager. Additionally, don't just run the query. Having gotten an answer, add a textual description of the result to the question. Data Science is not about running code; code is a means to an end. The end is the communication of results. We never just run code in this class. In [ ]: # imports 1. For the 1920s, list the years by number of tropical storms, then hurricanes. In [ ]: 2. What year had the most tropical storms? In [ ]: 3. What year had the most major hurricanes? In [ ]: 4. What year had the most deaths? In [ ]: 5. What year had the most damage (not inflation adjusted)? In [ ]: Things to think about 1. What is the granularity of this data? (Are the rows the most specific observation possible?) 2. What if this data were contained in worksheets in an Excel file? Find a Python library or libraries that work with Excel spreadsheets. 3. Each section links to details about each hurricane season. Review each Season's page and discuss strategies for extracting the information for every hurricane. 4. Hurricane tracking maps were recently added. How would you get and store those images? 5. Damages are not inflation adjusted. How would you go about enriching your data with inflation adjusted dollars? Where should this additional data be stored and how would it be used?

Solution PreviewSolution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

{
"cells": [
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 4 - Data"
   ]
},
{
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The special command above will make all the `matplotlib` images appear in the notebook."
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Directions\n",
    "\n",
    "**Failure to follow the directions will result in a \"0\"**\n",
    "\n"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### General Instructions\n",
    "\n",
    "1. You will be submitting your assignment to Blackboard.\n",
    "    \n",
    "This assignment has many files. You should include in your zip file:\n",
    "\n",
    "    1. main.ipynb - this file and your queries/reporting code.\n",
    "    2. hurricanes.py - your preprocessing and data writing file.\n",
    "    3. hurricanes.html - the local copy of the Wikipedia page.\n",
    "    4. hurricanes.db - the SQLite database you create.\n",
    "    5. hurricans.sql - the DDL commands to create your database.\n",
    "    \n",
    "Put all of these files in a directory/folder named main and zip it (only use Zip compression, nothing else). Submit your zip file to Blackboard.\n",
    "    \n",
    "2. Data Science is as much about what you write (communicating) as the code you execute (researching). In many places, you will be required to execute code and discuss both the purpose and the result. Additionally, Data Science is about reproducibility and transparency. This includes good communication with your team and possibly with yourself. Therefore, you must show **all** work.\n",
    "\n",
    "3. Avail yourself of the Markdown/Codecell nature of the notebook. If you don't know about Markdown, look it up. Your notebooks should not look like ransom notes. Don't make everything bold. Clearly indicate what question you are answering.\n",
    "\n",
    "4. Submit a cleanly executed notebook. The first code cell should say `In [1]` and each successive code cell should increase by 1 throughout the notebook."
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Individual Submission\n",
    "\n",
    "## Getting and Storing Data\n",
    "\n",
    "This Lab is about acquiring, cleaning and storing data (commonly known as \"wrangling\") as well as doing a little bit of analysis.\n",
    "\n",
    "### Basic Outline\n",
    "\n",
    "1. Use `curl` or `wget` obtain a local copy of the following web page: Atlantic Hurricane Season ( https://en.wikipedia.org/wiki/Atlantic_hurricane_season ). You can also just visit the page and save the HTML. **include this in your submission as `hurricanes.html`**. This is important. In Spring 2016, the page was edited during the module and different people got different answers at different times. You only need to be correct with respect to your `hurricanes.html` file.\n",
    "2. create the SQLite3 database and initialize the tables. **include these files as `hurricanes.db` and `hurricanes.sql`**. We are using SQLite3 not because it is the best database in the world but because its database files are shareable.\n",
    "3. use Beautiful Soup 4 and Python, parse the HTML file into a usable dataset. Write the data to the database. **this code should be in a file `hurricanes.py` and included in your submission** Your code should have copious comments.\n",
    "4. Write this data set to a SQLite3 database called `hurricanes.db` **include this in your submission**.\n",
    "5. Run the requested queries against the data set. **see below** The results should be **nicely formatted**.\n",
    "\n",
    "Although Wikipedia has an API, I do not what you to use it for this assignment."
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning Objectives\n",
    "\n",
    "This assignment has several learning objectives:\n",
    "\n",
    "* create a database and tables in one database type, SQLite\n",
    "* use BeautifulSoup to parse a web page.\n",
    "* use Regular Expressions to parse text to extract numbers.\n",
    "* write data to a database using SQL.\n",
    "* read data from a database using SQL.\n",
    "\n",
    "You can reference the Module links to various tutorials on these topics."
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Details\n",
    "\n",
    "The data is contained in many separate HTML tables. You will need to look at the raw HTML to determine how everything is laid out. Make sure you decompose the problem into nice, neat, focused functions. You should have a function that processes a table, one that processes a row, one that processes something that may be numeric.\n",
    "\n",
    "You only need to get the data from the tables starting in the 1850s but note that not all years have the same data. You only need to save the following columns. The name is parentheses is the name the column should have in the database table.\n",
    "\n",
    "- Year (`year`)\n",
    "- Number of tropical storms (`tropical_storms`)\n",
    "- Number of hurricanes (`hurricanes`)\n",
    "- Number of Major Hurricanes (`major_hurricanes`)\n",
    "- Deaths (`deaths`)\n",
    "- Damage (`damage`)\n",
    "- Notes (`notes`)\n",
    "\n",
    "Note that \"Damage\" doesn't start until 1900s and \"Notes\" was added in 1880s. \"Strongest Storm\" should be added to the Notes column (even in years that didn't have Notes) as should \"Retired Storms\". The name of the database table should be `atlantic_hurricanes`. The name of the table file (SQLite3 uses a file) should be `hurricanes.db` (who knows...you might need to add Pacific storms someday).\n",
    "\n",
    "There are a number of parsing problems which will most likely require regular expressions. First, the Deaths column has numbers that include commas and entries that are not numbers (Unknown and None). How should you code Unknown and None so that answers are not misleading but queries are still fairly straightforward to write?\n",
    "\n",
    "Similarly, Damages has numbers with commas, currency signs and different amount words (millions, billions). How will you normalize all of these so that a query can compare them? You may need regular expressions.\n",
    "\n",
    "Additionally, the way that Tropical Storms are accounted for seems to change mysteriously. Looking over the data, it's not immediately apparent when the interpretation should change. 1850s, 1860s definitely but 1870s? Not sure. It could just be a coincidence that there were never more hurricanes than tropical storms which seems to be the norm but see, for example, 1975. Welcome to Data Science!\n",
    "\n",
    "You should put your parsing code in `hurricanes.py`. None of it should be in this file. Imagine this file is going to be the report you give to your boss."
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Documentation\n",
    "\n",
    "Any time you run into a problem where you have to decide what to do--how to solve the problem or treat the values--document it here."
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* I used the following command to execute ddl script (create a database): sqlite3 hurricanes.db \".read hurricans.sql\"\n",
    "* Problems with parsing Demage, replace 'million' with six zeros and 'billion' with nine zeros."
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hurricanes.db\n",
    "\n",
    "What is the *function* of `hurricanes.db` in this assignment?"
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* We use SQLite database named 'hurricanes.db' to store the data we obtained by parsing Wikipedia pege. By stroking information into the database, we can easily accesse, manage, and update information."
   ]
},
{
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Queries\n",
    "\n",
    "When you are done, you must write and execute the following queries against your database. Those queries should be run from this notebook. Find the documentation for using SQLite3 from Python (the library is already included). You should never output raw Python data structures instead, you need to output well-formatted tables. You may need to look around for a library to help you or write your own formatting code. `Pandas` is one possibility. \n",
    "\n",
    "However, I want you to use raw SQL for your queries so if you use `Pandas` use it only for the formatting of query results (**don't load the data into Pandas and use Pandas/Python to query the data**). Imagine that you were somewhere you couldn't get Pandas...you'd need to know how to do this without it.\n",
    "\n",...
$75.00 for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Python Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats