{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab 4 - Data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The special command above will make all the `matplotlib` images appear in the notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Directions\n",
"\n",
"**Failure to follow the directions will result in a \"0\"**\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### General Instructions\n",
"\n",
"1. You will be submitting your assignment to Blackboard.\n",
" \n",
"This assignment has many files. You should include in your zip file:\n",
"\n",
" 1. main.ipynb - this file and your queries/reporting code.\n",
" 2. hurricanes.py - your preprocessing and data writing file.\n",
" 3. hurricanes.html - the local copy of the Wikipedia page.\n",
" 4. hurricanes.db - the SQLite database you create.\n",
" 5. hurricans.sql - the DDL commands to create your database.\n",
" \n",
"Put all of these files in a directory/folder named main and zip it (only use Zip compression, nothing else). Submit your zip file to Blackboard.\n",
" \n",
"2. Data Science is as much about what you write (communicating) as the code you execute (researching). In many places, you will be required to execute code and discuss both the purpose and the result. Additionally, Data Science is about reproducibility and transparency. This includes good communication with your team and possibly with yourself. Therefore, you must show **all** work.\n",
"\n",
"3. Avail yourself of the Markdown/Codecell nature of the notebook. If you don't know about Markdown, look it up. Your notebooks should not look like ransom notes. Don't make everything bold. Clearly indicate what question you are answering.\n",
"\n",
"4. Submit a cleanly executed notebook. The first code cell should say `In [1]` and each successive code cell should increase by 1 throughout the notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Individual Submission\n",
"\n",
"## Getting and Storing Data\n",
"\n",
"This Lab is about acquiring, cleaning and storing data (commonly known as \"wrangling\") as well as doing a little bit of analysis.\n",
"\n",
"### Basic Outline\n",
"\n",
"1. Use `curl` or `wget` obtain a local copy of the following web page: Atlantic Hurricane Season ( https://en.wikipedia.org/wiki/Atlantic_hurricane_season ). You can also just visit the page and save the HTML. **include this in your submission as `hurricanes.html`**. This is important. In Spring 2016, the page was edited during the module and different people got different answers at different times. You only need to be correct with respect to your `hurricanes.html` file.\n",
"2. create the SQLite3 database and initialize the tables. **include these files as `hurricanes.db` and `hurricanes.sql`**. We are using SQLite3 not because it is the best database in the world but because its database files are shareable.\n",
"3. use Beautiful Soup 4 and Python, parse the HTML file into a usable dataset. Write the data to the database. **this code should be in a file `hurricanes.py` and included in your submission** Your code should have copious comments.\n",
"4. Write this data set to a SQLite3 database called `hurricanes.db` **include this in your submission**.\n",
"5. Run the requested queries against the data set. **see below** The results should be **nicely formatted**.\n",
"\n",
"Although Wikipedia has an API, I do not what you to use it for this assignment."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Learning Objectives\n",
"\n",
"This assignment has several learning objectives:\n",
"\n",
"* create a database and tables in one database type, SQLite\n",
"* use BeautifulSoup to parse a web page.\n",
"* use Regular Expressions to parse text to extract numbers.\n",
"* write data to a database using SQL.\n",
"* read data from a database using SQL.\n",
"\n",
"You can reference the Module links to various tutorials on these topics."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Details\n",
"\n",
"The data is contained in many separate HTML tables. You will need to look at the raw HTML to determine how everything is laid out. Make sure you decompose the problem into nice, neat, focused functions. You should have a function that processes a table, one that processes a row, one that processes something that may be numeric.\n",
"\n",
"You only need to get the data from the tables starting in the 1850s but note that not all years have the same data. You only need to save the following columns. The name is parentheses is the name the column should have in the database table.\n",
"\n",
"- Year (`year`)\n",
"- Number of tropical storms (`tropical_storms`)\n",
"- Number of hurricanes (`hurricanes`)\n",
"- Number of Major Hurricanes (`major_hurricanes`)\n",
"- Deaths (`deaths`)\n",
"- Damage (`damage`)\n",
"- Notes (`notes`)\n",
"\n",
"Note that \"Damage\" doesn't start until 1900s and \"Notes\" was added in 1880s. \"Strongest Storm\" should be added to the Notes column (even in years that didn't have Notes) as should \"Retired Storms\". The name of the database table should be `atlantic_hurricanes`. The name of the table file (SQLite3 uses a file) should be `hurricanes.db` (who knows...you might need to add Pacific storms someday).\n",
"\n",
"There are a number of parsing problems which will most likely require regular expressions. First, the Deaths column has numbers that include commas and entries that are not numbers (Unknown and None). How should you code Unknown and None so that answers are not misleading but queries are still fairly straightforward to write?\n",
"\n",
"Similarly, Damages has numbers with commas, currency signs and different amount words (millions, billions). How will you normalize all of these so that a query can compare them? You may need regular expressions.\n",
"\n",
"Additionally, the way that Tropical Storms are accounted for seems to change mysteriously. Looking over the data, it's not immediately apparent when the interpretation should change. 1850s, 1860s definitely but 1870s? Not sure. It could just be a coincidence that there were never more hurricanes than tropical storms which seems to be the norm but see, for example, 1975. Welcome to Data Science!\n",
"\n",
"You should put your parsing code in `hurricanes.py`. None of it should be in this file. Imagine this file is going to be the report you give to your boss."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Documentation\n",
"\n",
"Any time you run into a problem where you have to decide what to do--how to solve the problem or treat the values--document it here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* I used the following command to execute ddl script (create a database): sqlite3 hurricanes.db \".read hurricans.sql\"\n",
"* Problems with parsing Demage, replace 'million' with six zeros and 'billion' with nine zeros."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hurricanes.db\n",
"\n",
"What is the *function* of `hurricanes.db` in this assignment?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* We use SQLite database named 'hurricanes.db' to store the data we obtained by parsing Wikipedia pege. By stroking information into the database, we can easily accesse, manage, and update information."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Queries\n",
"\n",
"When you are done, you must write and execute the following queries against your database. Those queries should be run from this notebook. Find the documentation for using SQLite3 from Python (the library is already included). You should never output raw Python data structures instead, you need to output well-formatted tables. You may need to look around for a library to help you or write your own formatting code. `Pandas` is one possibility. \n",
"\n",
"However, I want you to use raw SQL for your queries so if you use `Pandas` use it only for the formatting of query results (**don't load the data into Pandas and use Pandas/Python to query the data**). Imagine that you were somewhere you couldn't get Pandas...you'd need to know how to do this without it.\n",
"\n",

Lab 4 - Data In [1]: % matplotlib inline The special command abo...