QuestionQuestion

In this assignment, you will create a Naive Bayes classifier for detecting e-mail spam, and you will test your classifier on a publicly available spam dataset using 5-fold cross-validation.
I. Implement Naive Bayes in Python, Java, or C#.

• Step 1: Download the Spambase dataset available from the UCI Machine Learning Repository.
The Spambase data set consists of 4,601 e-mails, of which 1,813 are spam (39.4%). The data set archive contains a processed version of the e-mails wherein 57 real-valued features have been extracted and the spam/non-spam label has been assigned. You should work with this processed version of the data. The data set archive contains a description of the features extracted as well as some simple statistics over those features.

• Step 2: Partition the data into 5 folds.
To estimate the generalization (testing) error of your classifier, you will perform cross-validation. In k-fold cross-validation, one would ordinarily partition the data set randomly into k groups of roughly equal size and perform k experiments (the "folds") wherein a model is trained on k-1 of the groups and tested on the remaining group, where each group is used for testing exactly once. The generalization error of the classifier is estimated by the average of the performance across all k folds.

While one should perform cross-validation with random partitions, for consistency and comparability of your results, you should partition the data into 5 groups as follows: Consider the 4,601 data points in the order they appear in the processed data file. Finally, Fold k will consist of testing on Group k a model obtained by training on the combined remaining k-1 groups.
• Step 3: Create a Naive Bayes classifier by modeling the features in the following way.
The 57 features are real-valued, and one can model the feature distributions in simple and complex ways. For our assignment, model the features as simple Boolean random variables. Consider a threshold using the overall mean value of the feature (available in the Spambase documentation), and simply compute the fraction of the time that the feature value is above or below the overall mean value for each class. In other words, for feature fi with overall mean value mui, estimate
o Pr[fi <= mui | spam]
o Pr[fi > mui | spam]
o Pr[fi <= mui | non-spam]
o Pr[fi > mui | non-spam]

and use these estimated values in your Naive Bayes predictor, as appropriate.
To avoid any issues with zero probabilities, if any of the probability values are 0, simplly replace it with the small value .0014 to avoid multiplying by 0.

II. Evaluate your results.
1. Error tables: Create a table with one row per fold showing your false positive, false negative, and overall error rates, and add one final row per table corresponding to the average error rates across all folds. For this problem, the false positive rate is the fraction of non-spam testing examples that are misclassified as spam, the false negative rate is the fraction of spam testing examples that are misclassified as non-spam, and the overall error rate is the fraction of overall examples that are misclassified.

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

import math
import csv

# Functions

def mean(numbers):

    """
    :param numbers: List of floats
    :return: Mean value of floats in numbers
    """

    return sum(numbers) / float(len(numbers))


def stdev(numbers):

    """
    :param numbers: List of floats
    :return: Standard Deviation of floats in numbers
    """
    avg = mean(numbers)
    variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)

    return math.sqrt(variance)


def calculate_probability(x, mean, stdev):

    """
    :param x: continuous value
    :param mean: Mean value
    :param stdev: Standard Deviation
    :return: Probability of belonging to this dataset (mean, stdev)
    """

    if stdev <= 0.0014:

       stdev = 0.0014

    exponent = math.exp(-(math.pow(x-mean, 2) / (2 * math.pow(stdev, 2))))

    return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent...

By purchasing this solution you'll be able to access the following files:
Solution.py.

$30.00
for this solution

or FREE if you
register a new account!

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Python Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats