QuestionQuestion

Background

This a data analytical programming assignment using the R programming language (in RStudio) that is designed to prepare you for two larger assignments later in the semester. RStudio is very popular at organizations of all sizes. This assignment is to get you used to using regular expressions, pattern matching, parsing, and string matching (or other methods if you want) using the R programming language. In the coming months you will have two larger assignments, looking at time series analysis with lots of data, and also a larger machine learning project using supervised and unsupervised learning algorithms.

Description

Write an R script (that needs to run in RStudio), that will preprocess/clean data I provide to you in an Excel document. It is semi-structured data and I believe this is fairly realistic to how some organizations may provide this to you to work with. To accomplish this consider using regular expressions, string match, parsing, and pattern matching. If there are other ways to accomplish this then feel free to use that as well.

Guidance

1. Feel free to use an entire R script or part of an R script that you have used in another class.

2. Remember to avoid duplication of data.

3. There are eight items with the corresponding data that I want to see.

Name, Email, PayPal Account Number, Bank Accounts, Creation IP Address, Total Transaction Amount, Last Login, Total Logins

4. The first five sections should be reasonable enough to isolate but the last three may require you to program some adding and/or subtracting.

5. The R script needs to write the data in column format to an Excel document.

Example)
Name Account Number Last Login
John Smith 1234987 15 July 2091

6. When writing the R script don’t just make the script look for a specific name, PayPal account, bank account, or other specified column heading. For example don’t just initially have it look for John Smith or David Smith to get the answer. You will be getting a lot more data later in the semester! In the real world you would receive substantial sized data sets and this would not work if you just looked for a specific name.

7. Try to scrub out the USD for the monetary identification. You want to make sure this is not in there because you don’t need extra non-numeric figures when you eventually get a larger dataset.

8. Please detail the area the does what in the R script with something similar in format to the below examples:
#Reading the file in   
#Email Address (of course there will be eight of these)
#Write file to (where it will be written to)
#Save file (wherever it will be saved)
*If there are other areas you want or think you should add – please feel free to include them!

9. Remember there are a number of fields and/or information in the Excel document that you don’t need! You will absolutely see things like this in the real world that while it may be useful for different types of customer analytics, it is not useful at this time. I want to emphasize there are only eight categories I am looking for…not anything else!

10. Because I will be running the script in RStudio myself to see if the script actually works, I need the library and package detailed. Feel free to use various packages. I have just added some package and libraries for general reference. Also, if you need to detach a package then detail that command as well as if you are re-attaching it.
Example) library(dplyr)
library(stringr)
library(lubridate)
library(readr)
library(xlxs)
library(tidyverse)
library(magrittr)
library(tibble)

If you are detaching a package to avoid a conflict/masking then write the command in the correct area of the script!
Example) detach(“package:dplyr”)

(of course this may be written another way I was just listing this way as an example)

Remember to add the package back if you need it….in the correct spot. I want to be able to run the full script without looking for what to attach and/or detach and then any packages to add back again!

11. Don’t be concerned about dates, numbers, or other items not being long enough or being too long, that is not the point of the exercise. It is to work with the data exactly as it is seen in the Excel document.

Solution PreviewSolution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

library("openxlsx")
library("stringr")


# Reading the data
unstruc_data<-read.xlsx("akqG5Wb7nly9NJd5PGH3.xlsx")
names(unstruc_data)<-c("text")

# Defining Function to extract Items based on a Pattern
extract_item <- function(pattrn, string_, sep=F){
name_loc <- unlist(str_locate_all(pattern = pattrn, string_))
str_split_1 <- substr(string_,name_loc[2]+1,nchar(string_))
str_split_1 <- trimws(str_split_1)
if(!sep){
    str_split_1<-gsub("[^A-Za-z0-9 ]","",str_split_1)
    return(trimws(str_split_1))
}
sep_found <- unlist(str_locate_all(pattern = ";", str_split_1))
if(length(sep_found) == 0){
    sep_found <- unlist(str_locate_all(pattern = ",", str_split_1))
}
str_split_2 <- substr(str_split_1,1,sep_found[1]-1)
str_split_2<-gsub("[^A-Za-z0-9 ]","",str_split_2)
return(trimws(str_split_2))
}...
$200.00 for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Programming (Dynamic, Linear, Non-linear, etc.) Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats