QuestionQuestion

Transcribed TextTranscribed Text

Project description In this project you will work with a different database than in Computer Tasks 1 and 2, but this database contains measurements of fish in the Togararalli' ((Togararalli is stock surveying of demersal fish around Iceland) from Marine Resources Inc from the year 1998. The database contains the following columns: recid Record Number reit field reporting smrt Small Field (sub fields) tog_nr Number of tow (within field) dag Day (within a month) man Month dyp_min The smallest depth in the tow dyp_max The greatest depth in the tow vf Fishing gear (always the same here) nr Number of fish in the sample le Length of fish ky Gender of the fish (1 = male) kt Maturation (1 = prepubertal, 4 = spawning) aldur Age of the fish osl Unladen weight (live weight) sl Crafted weight li Weight of the liver Fields can be converted to latitude and longitude: field reporting https://www.rdocumentation.org/packages/geo/versions/1.4 3/topics/rect2deg It makes sense to start by looking at each variable individually, e.g. by doing a bar chart and thus getting a feel for the data, It is also wise to consider whether it is more desirable to look at age as a beam or a continuous variable. The database that you are to work with is in Ugla (from the R computer project folder), which can also be found here: https://notendur.hi.is/~gunnar/LikTol/data98.cSV You can search with Mr. Google the "field reporting" or use the following to see where the fields are library(readr) r2d <-function(r) { at<lon<-(r-lat* - 100) %% 50 halfb <- (r - 100 * lat - - lon)/100 lon <- - (lon + 0.5) lat <- lat + 60 + halfb + 0.25 data.frame(lat = lat, lon = lon) } gs<-read_csv("http://hi.is/~gunnar/LikTol/data98.csv") reitir<-unique(gs$reit) # Fjarlaegjum tvitekin gildi til bess ao myndin verõi fallegri <<-r2d(reitir)$lon y<-r2d(reitir)$lat plot(x,y,type='n') # Teikna mynd af hafsvaedum text(x,y,as.character(reitir)) 721 720 719 718 717 716 715 714 675 674 673672 671 670 669 668 667 666 665 664 663 662 626 625 624 623 622 621 620 619 618 617 616 615 61461361261 go 576 575 574 571 570 569 568 564 563 562 561 527 526 525 524 523 513 512 511 gg 477 476475 474 473472 463 462 461 426 425 424 423 422 416 414 413 412 tg 375 374373372371370 367366365 364 363 324 323322321320 319 318 317 316 315 -25 -20 -15 X (note that this is NOT the best way to draw this picture - see the bonus question). a) Read the data file with the read_csv () command and store an object (data frame) that carries the initials of the project (if Gréta Halldórsdóttir and Sigurõur Jónsson are working together they should name the part gs, but if Atli Pétursson is working alone the name should be ap). (My name is Aevar Andri so AA) Missing values should not be missing from the data directory. Note that all values in the table are numbers. Some things are better done for factors (class variables, factors), but it can be done at later stages. Created a new variable that contains a sea area. Explain the new variable "hafsvaedi". The new variable should be part of your data table. Hafsvaedi variable is to be defined by the categories SW, NW, NE, SE. The breakdown of the categories is as follows: The S-N division is about latitude 65 degrees north latitude (ie 65, so fields 527, 561 land to the north) The A-W longitudinal division is 19 degrees west longitude (ie, - -19 so fields 319, 569 land to the west). Your variable Hafsvaedi will be a four-category variable. Tip: Use the function r2d () given above to create lat and lon variables in your data frame that you then use to create the hafsvaedi variable. Here is a neat use of the case_when function from the dplyr package Created a new maturation variable, kt2, which contains only two categories: puberty and prepubertal. The new maturity variable, kt2, is dependent on the old maturity variable, kt, as follows: The gender variable in the data, kt, contains the categories 1 = prepubertal 4 = spawning (stage of puberty of fish): puberty 2,3,5,22 = another stages of puberty of fish: puberty The lifetime of fish consists of several different stages of puberty. On the one hand, fish is immature or at some stage of puberty. Fish that are at any stage of puberty are classified as mature fish. Here you can use either the case_when () or ifelse ( ) function. Here on we will work with the maturity variable with these two classes. b) Show in the table how many fish of each maturity level were caught in each sea area. Create another table showing the maturation rate in each sea area (for example, it should be possible to read how high the proportion of fish in the south-west area (SW) is puberty and prepubertal). Draw a picture that describes the number of fish of each maturity level in each category of the four ocean areas. c) Show in table the number of fish, average length, average weight and standard deviation lengthened by age. Briefly describe your results as shown in the table. Draw the following two pictures: Image showing fish length by age when age is considered as a continuous variable along with average fish length for each age group as larger red dots A box chart showing the length of fish by age when looking at age as a variable Are there any outlaws in the age group? What are the benefits of each presentation? d) Now select two random areas with the sample () function after running set.seed () with your birthday as in project 2. .(My brithday is 9.june 1999) randomly select 50 fish from each sea area. To select the fish, it is good to use sample_n () function from the dplyr package because it selects rows from a data frame at random so that you get a new data frame for each sea area (it can also use sample ()) ). Remember to run set.seed ( ) with your birthday before running each sample_ ) command. Combine the fish 100 into one data table (data frame). It is good to use the rbind ( ) function to merge the data frames. e) With the appropriate hypothesis test, with a 95% significance requirement, check for differences between the average length of fish in two marine areas. Present the hypotheses formally (Ho and H1). Specify the criteria that the data must fulfill in order to expect the hypothesis test to give a sound result. Also show a 95% confidence interval for the difference between the mean of the two oceans. Does the safety range contain O? Why or why not? Explain. f) Draw a histogram of length for each sea area using all the length measurements from the original database. Draw in the histogram in red color density (graph) the normal distribution that "natural" (normal) would expect the data for each sea area to follow. Use binwidth = 3 for the histograms. It's best to draw the four pictures along with density functions as follows: Create a long-form data frame that contains the category variable hafsvaedi and the longitudinal variable le ibrary(reshape2) gs_long =melt(gs,id.vars='hafsvaedi', = measure.vars="le' value.name='le') Simulate data from the "natural" normal distribution for each sea area # Define a function that takes the data vector X and returns data that accompanies the "natural" normal distrib ution of the vector X get_normal_density <-function(x, binwidth) { grid <-seq(min(x),max(x),length=100) data.frame( le = grid, normal_curve= dnorm(grid,mean(x),sd(x)) * length(x) * binwidth ) } # Define a parameter for binwidth BW <-3 3 # Generate normally distributed data for each sea area by applying the "get_normal_density"function to the p erimeter measurements belonging to each sea area normaldens <- gs %>% troup_by(hafsvaedi) %>% do(get_normal_density(x=.$le, binwidth=BW)) You now need to write the code to draw the four images, use the facets () function as in task 2 and use the geom_line () function from the ggplot2 package along with the normal distributed data you create as shown above to draw the density functions (graphs) into the pictures Interpret what the pictures show. Does the data follow the normal distribution? Conclude. g) Remember, one of the prerequisites for using the t-test as done is that the data is normally distributed. There are numerous tests other than t-tests that do not require normal distribution. One such is the randomization test or the permutation test. Then, all the data (the lengths that went into calculating the test size for the t- test) from the two data sets (the two ocean areas worked in the preceding paragraphs) are compiled, randomly extracted the compiled database without return and split it into two databases of the same size (same number of data points) as the original databases were, calculate a new t-value and then count how often the new t-value is numerically larger than the original t-value (which you calculated in the normal t-test). The P-value of the test is the proportion of times that the new t-value is numerically larger than the original t-value. Perform the random test for your t-test: Let's say you saved the test size of the original t-test in variable tO. Then you can compare the numerical value of, t0, with the numerical value of t.test(z[sample(1:length(z))]~xyind)$statisticwhere xyind is a vector of the same length as the composite database stored in the variable z. The vector xyind contains the values 1 and 2 and is used to divide the data into two datasets of the same size as the original databases (for each sea area) processed in the preceding paragraphs. Repeat this 5000 times using the replicate () function as in previous projects. State the p-value together with the conclusion you draw (in continuous order). Is the conclusion consistent with the one you drew from the usual t-test? h) Examine, with the appropriate hypothesis test, with a 95% significant technical requirement, the difference between the proportion of mature fish in the two marine areas. Please note that you may need to dispose of your unused hafsvaedi area data in the hafsvaedi variable category variable. You can do this with the droplevels () function Publish a table of the number of mature and non-mature fish in each sea area. Present the hypotheses in a formal way (Ho and H1). Specify the value of the test size, the p-value and the evaluation of the parameters that are being tested in continuous measure. Indicate whether the assessment of the stake is for the proportion of mature or non-mature fish. In a few words, say what conclusion you are drawing. Also show a 95% confidence interval for the difference in ratios. Does the safety range contain O? Why or why not? i) Either choose the sea area that you worked with in the paragraphs above (you should have 50 fish). You can directly use another data frame you created in item d) or use filter () on your combined data frame. Draw a picture showing the relationship between the length and weight of the fish (weight is dependent on the variable, y-axis). Draw a picture showing the relationship between the log of length and the log of the weight of the fish (weight is dependent on the variable, y-axis). Build a regression analysis model that can be used to predict weight by length. Keep the model in the variable fit. Note that here, it is most natural to use the log of both variables (why? - you need to explain this). Present the model together with an assessment of its levels. In a few words, say whether you think it wise to use a model like this to predict weight based on the length of the fish. j) Draw a picture again, as in point (i), showing the relationship between the length and weight of the fish (length is independent of the variable but weight is dependent on the variable). But this time you should draw in the picture the best straight line along with the best line that the model you created in item i) gives. Draw the best line of the model in red. In the ggplot package there is a function that automatically draws the best straight line through the data for you: stat_smooth(method='Im', se=FALSE). The simplest way to draw the best line of the model is to create a new data frame containing the model's value of the variables and the values of the variables of the model. This can be done as follows gogn_likan<- data.frame( X = ?(fit$model[["X"]]) y = ?(predict(fit)) ) where you need to switch? beyond a function that returns the data to the real scale (the model was created on a log scale) and X beyond the name of the X variable in the model. You can then use the ggplot function called geom_line () to draw the model's best line. Note that since the model was created on log scale data, it is natural that its best line is NOT straight once the actual scale has been re-mapped. k) Here we continue to work on the sea area that was selected in the previous section. Straight lines are often not the right model. First draw a length versus age for your ocean area, as a box office. Create two models with the ch () function. On the one hand, a model that makes a straight line through the data and then a free model with age as a class variable: litid <-Im(le~aldur, data=gs) stort <- Im(le~factor(aldur), data=gs) anova(litid, stort) Remember to use your own data but not: data=gs. Both models are called linear models because they are linear in their toolbars. However, the latter relationship does not at all describe a straight line between age and age. Interpret the result, both the image and the test, that the anova command made for you, but it compares the models. You can do this as a formal hypothesis test or in words, but you must at least interpret both the image and the last number in the table. Bonus question) Find out for yourself how to draw the right longitude and latitude along with the outline of the country and delineate the area of your choice, preferably with notification fields. Note that this requires the use of rational projection, etc. You can start with this, but this is not enough at all: 66 65 - 64 - - I - -25.0 -22.5 -20.0 -17.5 - -15.0 long

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

    By purchasing this solution you'll be able to access the following files:
    Solution1.pdf and Solution2.Rmd.

    $38.00
    for this solution

    or FREE if you
    register a new account!

    PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

    Find A Tutor

    View available Statistics-R Programming Tutors

    Get College Homework Help.

    Are you sure you don't want to upload any files?

    Fast tutor response requires as much info as possible.

    Decision:
    Upload a file
    Continue without uploading

    SUBMIT YOUR HOMEWORK
    We couldn't find that subject.
    Please select the best match from the list below.

    We'll send you an email right away. If it's not in your inbox, check your spam folder.

    • 1
    • 2
    • 3
    Live Chats