QuestionQuestion

Transcribed TextTranscribed Text

Assignment # 2 Requirements The assignment shall comply with the following requirements Install and configure MapReduce on YARN Pseudo-I Distributed Mode for Development Implement and test MapReduce Job called DivAvgByYearJob that takes stock dividends data as input and generates a file with yearly averages for each stock. o Input to thejob is set of files on HDFS containing dividends data. These files should alr eady reside on HDFS from Assignment #1 under /data/nasdaq/divider directory Dividends datais stored in Comma Separated Values (CSV) format and each line asingle record. Very firs line in file definesthe schema, so should not be processed. Must Validate that directoryi present and contains at least CSV file (has CSV file extension) Ifthis condition is not met descriptive exception must be emitted and thejobshould fail This directory may contain other directories ano files, your job must only process files in '/data/nasdaq/dividends directory that have .csv extension Note that this sonly sample data. You may need to modify this dataset or create new one to force certain situations. The job must produce single file where each line documents average dividends for a stock for each year In another words for year- stock combination compute average dividends. The file must be sorted by year and then by stock symbol Format :\t, for example: 1983:ADF (Automatic Data Processing) 0.00875 1984 ADP(Automatic Data Processing) 0.0094475 1985:ADF (Automatic Data Processing) 0.0103925 1986:ADP (Automatic Data Processing) 0.0115675 1986:AMFI 0.03333333 1986 AMSWA (American Software Inc Class A) 0.0237 1986 :ASBC (Associated Banc-Corp) 0.025940001 o The original input data contains stock symbols, but does not have the full name of the stock. . Download symbol description.csv from http://goo.el/D42BQb This file contain: mapping of the stock symbol to its full name/description Your job to enrich the result with the stock's full name/description from symbol description.csv If symbol description CSV does not contain stock symbol ther the symbol itself shall be used If the symbolcar be matched to ts full name then the final result must be the original symbol and the full name in parenthesis ( For example: AMSWA (American Software Inc. Class A) . You must utilize Disti ibuted Cache to implement this feature o Thejob must accept and honor the following properties Input directory on HDFS Output rectory on HDFS Start year this will give the ability to filter data for years prior the provided value End year this wil give the ability tofilter data for years after the provided value o Thejob must implement an effective Combiner o The job must implement/provide the following counter groups Counter group named "ENRICHED" for the number of records that were successfully enriched from symbol _description CSV with stock name as well as the number of records that could not be enriched Provide counter for # of enriched records Provide counter for of records that could not be enriched Counter group named "STATS" that will display various stats about processed records Provide counters for of records processed for each year (ex. 2011-2 148 records, 2012-677 records) Provide counters forthe number of records that were filtered out by start and end year property o You must utilize opencsv library to read CSV files. This applies to both dividends input files as well as symbol description CSV enrichment file. I know that you can very easily parse the lines without involving library but the point here to exercise an externa library usage. Download Location: http://opencsv.scurcetorg/ net/twhere-can-+get-it Instructions on how to read CSV file: http://opentsv.sourceforge. net/thow-to-read Maven dependency definition Your code must follow best programming practices o Code re-use o Validation o Comments whenever appropriate Answers to the following questions Each answer must be under sentences (no run-ons please). o Discuss your choice of mechanism toconfigure the job. Was there another option? What was your reasor behind the choice? o Explain your choice forthe keys passed between Mapper to Reducers. Was there another option? What was your reason be nind the choice? o Explain the chosen mechanism toindude an external library for the MapReduce job? Was there another option? Running Your Project You must provide wrapper script called proj2 sh that can beused to execute the job. The script speci lies commands and properties. This script must specify and configure three separ ate runs of DivAvgByYearJobjot 1. Run #1 a. Process all csv files in the input directory b. Configure input directory to '/data/nasdag /dividends c. Configure output directory to /proj2/ /run1 2. Run #2 a. Configure input directory to '/data/nasda /dividends b. Configure output directory to '/proj2/ /run2 c. Configure to only process csvfiles whose file name between NASDAQ _dividends_ c csvand NASDAQ _dividends_ K CSV (inclusive) i. Effectively only processing stocks that start with aletter between and K (inclusive) 3. Run #3 a. Configure input directory to '/data/nasd ag /dividends b. Configure output directory to /proj2/ /run3' c. Configure 'start year property to 2000 d. Configure 'end year' property to 2010

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

import java.io.IOException;

import java.util.StringTokenizer;



import javax.tools.Tool;



import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;



public class WordCount extends Configure implements Tool {



public static class TokenizerMapper...

By purchasing this solution you'll be able to access the following files:
solution.docx.

$90.00
for this solution

or FREE if you
register a new account!

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Java Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats