Transcribed TextTranscribed Text

Spam Classification Consider the email spam data set. This consists of 4601 email messages, from which 57 features have been extracted. These features are described as follows: 48 features giving the percentage of certain words (e.g., "business", "free", "george") in a given message 6 features giving the percentage of certain characters (; ([!$#) feature 55: the average length of an uninterrupted sequence of capital letters feature 56: the length of the longest uninterrupted sequence of capital letters feature 57: the sum of the lengths of uninterrupted sequences of capital letters The data set contains a training set of size 3065, and a test set of size 1536. One can imagine performing several kinds of preprocessing to this data. Try each of the following separately: 1) Standardize the columns so that they all have zero mean and unit variance; 2) Transform the features using log(xij + 1); 3) Discretize each feature using ](xij > 0). (a) For each version of the data, visualize it using the tools introduced in the class. (b) For each version of the data, fit a logistic regression model. Interpret the results, and report the classification errors on both the training and test sets. Do any of the 57 features/ predictors appear to be statistically significant? If so, which ones? (Hint: consider this as a multiple testing problem). (c) Apply both linear and quadratic discriminant analysis methods to the standardized data, and the log transformed data. What are the classification errors (training and test)? (d) Apply linear and nonlinear support vector machine classifiers to each version of the data. What are the classification errors (training and test)? Report classification errors using different methods and different preprocessed data in a table, and comment on the different performances. Finally, use either a single method with properly chosen tuning parameter or a combination of several methods to design a classifier with test error rate as small as possible. Describe your recommended method, and report its performance.

Solution PreviewSolution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

R-Programming Assignment
    $35.00 for this solution

    PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

    Find A Tutor

    View available Statistics-R Programming Tutors

    Get College Homework Help.

    Are you sure you don't want to upload any files?

    Fast tutor response requires as much info as possible.

    Upload a file
    Continue without uploading

    We couldn't find that subject.
    Please select the best match from the list below.

    We'll send you an email right away. If it's not in your inbox, check your spam folder.

    • 1
    • 2
    • 3
    Live Chats