{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"grade_id": "cell-68fd8fe196863861",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"# Part 0 of 2: Simple string processing review"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"text = \"sgtEEEr2020.0\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
   "1. False\n",
   "2. [True, True, True, True, True, True, True, False, False, False, False, False, False]\n"
]
}
],
"source": [
"# Strings have methods for checking \"global\" string properties\n",
"print(\"1.\", text.isalpha())\n",
"\n",
"# These can also be applied per character\n",
"print(\"2.\", [c.isalpha() for c in text])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
   "BELOW: (global) -> (per character)\n",
   "False --> [False, False, False, False, False, False, False, True, True, True, True, False, True]\n",
   "False --> [False, False, False, False, False, False, False, False, False, False, False, False, False]\n",
   "False --> [True, True, True, False, False, False, True, False, False, False, False, False, False]\n",
   "False --> [False, False, False, True, True, True, False, False, False, False, False, False, False]\n",
   "False --> [False, False, False, False, False, False, False, True, True, True, True, False, True]\n"
]
}
],
"source": [
"# Here are a bunch of additional useful methods\n",
"print(\"BELOW: (global) -> (per character)\")\n",
"print(text.isdigit(), \"-->\", [c.isdigit() for c in text])\n",
"print(text.isspace(), \"-->\", [c.isspace() for c in text])\n",
"print(text.islower(), \"-->\", [c.islower() for c in text])\n",
"print(text.isupper(), \"-->\", [c.isupper() for c in text])\n",
"print(text.isnumeric(), \"-->\", [c.isnumeric() for c in text])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"grade_id": "cell-0a99319dc0dd5f6c",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"**Exercise 0** (1 point). Create a new function that checks whether a given input string is a properly formatted social security number, i.e., has the pattern, `XXX-XX-XXXX`, _including_ the separator dashes, where each `X` is a digit. It should return `True` if so or `False` otherwise."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "is_ssn",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"def is_ssn(s):\n",
" if (s[3]!='-' or s[6]!='-'):\n",
"    return False\n",
" if (len(s)<7):\n",
"    return False\n",
" \n",
" s1 = s[0:3] + s[4:6] + s[7:]\n",
"    \n",
" for c in s1:\n",
"    if not c.isdigit():\n",
"          return False\n",
" return True\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "is_ssn_test",
"locked": true,
"points": 1,
"schema_version": 1,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
   "\n",
   "(Passed!)\n"
]
}
],
"source": [
"# Test cell: `is_snn_test`\n",
"assert is_ssn('832-38-1847')\n",
"assert not is_ssn('832 -38 - 1847')\n",
"assert not is_ssn('832-bc-3847')\n",
"assert not is_ssn('832381847')\n",
"assert not is_ssn('8323-8-1847')\n",
"assert not is_ssn('abc-de-ghij')\n",
"print(\"\\n(Passed!)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Regular expressions\n",
"\n",
"Exercise 0 hints at the general problem of finding patterns in text. A handy tool for this problem is Python's Regular Expression module, `re`.\n",
"\n",
"A _regular expression_ is a specially formatted pattern, written as a string. Matching patterns with regular expressions has 3 steps:\n",
"\n",
"1. You come up with a pattern to find.\n",
"2. You compile it into a _pattern object_.\n",
"3. You apply the pattern object to a string to find _matches_, i.e., instances of the pattern within the string.\n",
"\n",
"As you read through the examples below, refer also to the [regular expression HOWTO document](https://docs.python.org/3/howto/regex.html) for many more examples and details."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"grade_id": "cell-c953dd7aa8b5f9a5",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Basics\n",
"\n",
"Let's see how this scheme works for the simplest case, in which the pattern is an *exact substring*. In the following example, suppose want to look for the substring `'fox'` within a larger input string."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
   "<re.Match object; span=(16, 19), match='fox'>\n"
]
}
],
"source": [
"pattern = 'fox'\n",
"pattern_matcher = re.compile(pattern)\n",
"\n",
"input_string = 'The quick brown fox jumps over the lazy dog'\n",
"matches = pattern_matcher.search(input_string)\n",
"print(matches)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that the returned object, `matches`, is a special object. Inspecting the printed output, notice that the matching text, `'fox'`, was found and located at positions 16-18 of the `input_string`. Had there been no matches, then `.search()` would have returned `None`, as in this example:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
   "None\n"
]
}
],
"source": [
"print(pattern_matcher.search(\"This input has a FOX, but it's all uppercase and so won't match.\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also write code to query the `matches` object for more information."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
   "fox\n",
   "16\n",
   "19\n",
   "(16, 19)\n"
]
}
],
"source": [
"print(matches.group())\n",
"print(matches.start())\n",
"print(matches.end())\n",
"print(matches.span())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Module-level searching.** For infrequently used patterns, you can also skip creating the pattern object and just call the module-level search function, `re.search()`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
   "Found jump @ (20, 24)\n"
]
}
],
"source": [
"matches_2 = re.search('jump', input_string)\n",
"assert matches_2 is not None\n",
"print (\"Found\", matches_2.group(), \"@\", matches_2.span())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Other Search Methods.** Besides `search()`, there are several other pattern-matching procedures:\n",
"\n",
"1. `match()` - Determine if the regular expression (RE) matches at the beginning of the string.\n",
"2. `search()` - Scan through a string, looking for any location where this RE matches.\n",
"3. `findall()` - Find all substrings where the RE matches, and returns them as a list.\n",
"4. `finditer()` - Find all substrings where the RE matches, and returns them as an iterator.\n",
"\n",
"We'll use several of these below; again, refer to the [HOWTO](https://docs.python.org/3/howto/regex.html) for more details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A pattern

Overview For the last few years, the United Nations Sustainable De...