Task description: The data set comes from the Kaggle Digit Recogni...

Question

The data set comes from the Kaggle Digit Recognizer competition. The goal is to recognize digits 0 to 9 in handwriting images. Because the original data set is too large to be loaded in Weka GUI, I have systematically sampled 10% of the data by selecting the 10th, 20th examples and so on. You are going to use the sampled data to construct prediction models using naïve Bayes and decision tree algorithms. Tune their parameters to get the best model (measured by cross validation) and compare which algorithms provide better model for this task.
Due to the large size of the test data, submission to Kaggle is not required for this task. However, 1 extra point will be given to successful submissions. One solution for the large test set is to separate it to several smaller test set, run prediction on each subset, and merge all prediction results to one file for submission. You can also try use the entire training data set, or re-sample a larger sample.

Tip: check out the Kaggle forum to see if there are some patterns other people have found that you can use to build better models.
Report structure:
Section 1: Introduction
Briefly describe the classification problem and general data preprocessing. Note that some data preprocessing steps maybe specific to a particular algorithm. Report those steps under each algorithm section.
Section 2: Decision tree
Build a decision tree model. Tune the parameters, such as the pruning options, and report the 3-fold CV accuracy.
Section 3: Naïve Bayes
Build a naïve Bayes model. Tune the parameters, such as the discretization options, to compare results.
Section 4: Algorithm performance comparison
Compare the results from the two algorithms. Which one reached higher accuracy? Which one runs faster? Can you explain why?
Section 5: Kaggle test result (1 extra point)
Report the test accuracy for the naïve Bayes and decision tree models. Discuss whether overfitting occurs in these models.

Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

## Introduction

```{r}
train_data\$label <- factor(train_data\$label)
test_data\$label = NULL
```

## Decision Tree

```{r}
folds=3
repeats=10

rpart.grid <- expand.grid(.cp=0.01)
fitControl <- trainControl(method="repeatedcv",
number=folds, repeats=repeats...

By purchasing this solution you'll be able to access the following files:
Solution.Rmd.

50% discount

Hours
Minutes
Seconds
\$25.00 \$12.50
for this solution

or FREE if you
register a new account!

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Computer Science - Other Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.