Transcribed Text
Project description
In this project you will work with a different database than in Computer Tasks
1 and 2, but this database contains measurements of fish in the
Togararalli' ((Togararalli is stock surveying of demersal fish around Iceland)
from Marine Resources Inc from the year 1998. The database contains the
following columns:
recid
Record Number
reit
field reporting
smrt
Small Field (sub fields)
tog_nr Number of tow (within field)
dag
Day (within a month)
man Month
dyp_min The smallest depth in the tow
dyp_max The greatest depth in the tow
vf
Fishing gear (always the same here)
nr
Number of fish in the sample
le
Length of fish
ky
Gender of the fish (1 = male)
kt
Maturation (1 = prepubertal, 4 = spawning)
aldur Age of the fish
osl
Unladen weight (live weight)
sl
Crafted weight
li
Weight of the liver
Fields can be converted to latitude and longitude: field reporting
https://www.rdocumentation.org/packages/geo/versions/1.4
3/topics/rect2deg
It makes sense to start by looking at each variable individually, e.g. by doing a
bar chart and thus getting a feel for the data, It is also wise to consider
whether it is more desirable to look at age as a beam or a continuous variable.
The database that you are to work with is in Ugla (from the R computer project
folder), which can also be found here:
https://notendur.hi.is/~gunnar/LikTol/data98.cSV
You can search with Mr. Google the "field reporting" or use the following to
see where the fields are
library(readr)
r2d <function(r)
{
at<lon<(rlat*  100) %% 50
halfb < (r  100 * lat   lon)/100
lon <  (lon + 0.5)
lat < lat + 60 + halfb + 0.25
data.frame(lat = lat, lon = lon)
}
gs<read_csv("http://hi.is/~gunnar/LikTol/data98.csv")
reitir<unique(gs$reit) # Fjarlaegjum tvitekin gildi til bess ao myndin verõi fallegri
<<r2d(reitir)$lon
y<r2d(reitir)$lat
plot(x,y,type='n') # Teikna mynd af hafsvaedum
text(x,y,as.character(reitir))
721 720 719 718 717 716 715 714
675 674 673672 671 670 669 668 667 666 665 664 663 662
626 625 624 623 622 621 620 619 618 617 616 615 61461361261
go
576 575 574
571 570 569 568
564 563 562 561
527 526 525 524 523
513 512 511
gg
477 476475 474 473472
463 462 461
426 425 424 423 422
416
414 413 412
tg
375 374373372371370
367366365 364 363
324 323322321320 319 318 317 316 315
25
20
15
X
(note that this is NOT the best way to draw this picture  see the bonus
question).
a)
Read the data file with the read_csv () command and store an object (data
frame) that carries the initials of the project (if Gréta Halldórsdóttir and
Sigurõur
Jónsson
are
working
together
they
should
name
the
part
gs,
but
if
Atli
Pétursson is working alone the name should be ap). (My name is Aevar Andri
so AA)
Missing values should not be missing from the data directory.
Note
that
all
values
in
the
table
are
numbers.
Some
things
are
better
done
for
factors (class variables, factors), but it can be done at later stages.
Created a new variable that contains a sea area.
Explain the new variable "hafsvaedi". The new variable should be part of your
data table.
Hafsvaedi variable is to be defined by the categories SW, NW, NE, SE. The
breakdown of the categories is as follows:
The SN division is about latitude 65 degrees north latitude (ie 65, so
fields 527, 561 land to the north)
The AW longitudinal division is 19 degrees west longitude (ie,  19 so
fields 319, 569 land to the west).
Your variable Hafsvaedi will be a fourcategory variable.
Tip: Use the function r2d () given above to create lat and lon variables in your
data frame that you then use to create the hafsvaedi variable.
Here is a neat use of the case_when function from the dplyr package
Created a new maturation variable, kt2, which contains only two categories:
puberty and prepubertal.
The new maturity variable, kt2, is dependent on the old maturity variable, kt,
as follows:
The gender variable in the data, kt, contains the categories
1 = prepubertal
4 = spawning (stage of puberty of fish): puberty
2,3,5,22 = another stages of puberty of fish: puberty
The lifetime of fish consists of several different stages of puberty. On the one
hand, fish is immature or at some stage of puberty. Fish that are at any stage
of puberty are classified as mature fish.
Here you can use either the case_when () or ifelse ( ) function.
Here on we will work with the maturity variable with these two classes.
b)
Show in the table how many fish of each maturity level were caught in each
sea area.
Create another table showing the maturation rate in each sea area (for
example, it should be possible to read how high the proportion of fish in the
southwest area (SW) is puberty and
prepubertal).
Draw a picture that describes the number of fish of each maturity level in each
category of the four ocean areas.
c)
Show in table the number of fish, average length, average weight and standard
deviation lengthened by age. Briefly describe your results as shown in the
table.
Draw the following two pictures:
Image showing fish length by age when age is considered as a
continuous variable along with average fish length for each age group as
larger red dots
A box chart showing the length of fish by age when looking at age as a
variable
Are there any outlaws in the age group?
What are the benefits of each presentation?
d)
Now select two random areas with the sample () function after running
set.seed () with your birthday as in project 2. .(My brithday is 9.june 1999)
randomly select 50 fish from each sea area. To select the fish, it is good to use
sample_n () function from the dplyr package because it selects rows from a
data frame at random so that you get a new data frame for each sea area (it
can also use sample ()) ). Remember to run set.seed ( ) with your birthday
before running each sample_ ) command.
Combine the fish 100 into one data table (data frame). It is good to use the
rbind ( ) function to merge the data frames.
e)
With the appropriate hypothesis test, with a 95% significance requirement,
check for differences between the average length of fish in two marine areas.
Present the hypotheses formally (Ho and H1).
Specify the criteria that the data must fulfill in order to expect the hypothesis
test to give a sound result.
Also show a 95% confidence interval for the difference between the mean of
the two oceans. Does the safety range contain O? Why or why not? Explain.
f)
Draw a histogram of length for each sea area using all the length
measurements from the original database.
Draw in the histogram in red color density (graph) the normal distribution that
"natural" (normal) would expect the data for each sea area to follow.
Use binwidth = 3 for the histograms.
It's best to draw the four pictures along with density functions as follows:
Create a longform data frame that contains the category variable
hafsvaedi and the longitudinal variable le
ibrary(reshape2)
gs_long =melt(gs,id.vars='hafsvaedi', = measure.vars="le' value.name='le')
Simulate data from the "natural" normal distribution for each sea area
# Define a function that takes the data vector X and returns data that accompanies the "natural" normal distrib
ution of the vector X
get_normal_density <function(x, binwidth) {
grid <seq(min(x),max(x),length=100)
data.frame(
le = grid,
normal_curve= dnorm(grid,mean(x),sd(x)) * length(x) * binwidth
)
}
# Define a parameter for binwidth
BW <3 3
# Generate normally distributed data for each sea area by applying the "get_normal_density"function to the p
erimeter measurements belonging to each sea area
normaldens <
gs %>%
troup_by(hafsvaedi) %>%
do(get_normal_density(x=.$le, binwidth=BW))
You now need to write the code to draw the four images, use the facets
() function as in task 2 and use the geom_line () function from the
ggplot2 package along with the normal distributed data you create as
shown above to draw the density functions (graphs) into the pictures
Interpret what the pictures show. Does the data follow the normal
distribution? Conclude.
g)
Remember, one of the prerequisites for using the ttest as done is that the
data is normally distributed. There are numerous tests other than ttests that
do not require normal distribution. One such is the randomization test or the
permutation test.
Then, all the data (the lengths that went into calculating the test size for the t
test) from the two data sets (the two ocean areas worked in the preceding
paragraphs) are compiled, randomly extracted the compiled database without
return and split it into two databases of the same size (same number of data
points) as the original databases were, calculate a new tvalue and then count
how often the new tvalue is numerically larger than the original tvalue (which
you calculated in the normal ttest).
The Pvalue of the test is the proportion of times that the new tvalue is
numerically larger than the original tvalue.
Perform the random test for your ttest:
Let's say you saved the test size of the original ttest in variable tO.
Then you can compare the numerical value of, t0, with the numerical value of
t.test(z[sample(1:length(z))]~xyind)$statisticwhere xyind is a vector of the
same length as the composite database stored in the variable z.
The vector xyind contains the values 1 and 2 and is used to divide the data into
two datasets of the same size as the original databases (for each sea area)
processed in the preceding paragraphs.
Repeat this 5000 times using the replicate () function as in previous projects.
State the pvalue together with the conclusion you draw (in continuous order).
Is the conclusion consistent with the one you drew from the usual ttest?
h)
Examine, with the appropriate hypothesis test, with a 95% significant technical
requirement, the difference between the proportion of mature fish in the two
marine areas. Please note that you may need to dispose of your unused
hafsvaedi area data in the hafsvaedi variable category variable. You can do this
with the droplevels () function
Publish a table of the number of mature and nonmature fish in each sea area.
Present the hypotheses in a formal way (Ho and H1). Specify the value of the
test size, the pvalue and the evaluation of the parameters that are being
tested in continuous measure. Indicate whether the assessment of the stake is
for the proportion of mature or nonmature fish.
In a few words, say what conclusion you are drawing.
Also show a 95% confidence interval for the difference in ratios. Does the
safety range contain O? Why or why not?
i)
Either choose the sea area that you worked with in the paragraphs above (you
should have 50 fish). You can directly use another data frame you created in
item d) or use filter () on your combined data frame.
Draw a picture showing the relationship between the length and weight of the
fish (weight is dependent on the variable, yaxis).
Draw a picture showing the relationship between the log of length and the log
of the weight of the fish (weight is dependent on the variable, yaxis).
Build a regression analysis model that can be used to predict weight by length.
Keep the model in the variable fit. Note that here, it is most natural to use the
log of both variables (why?  you need to explain this).
Present the model together with an assessment of its levels.
In a few words, say whether you think it wise to use a model like this to predict
weight based on the length of the fish.
j)
Draw a picture again, as in point (i), showing the relationship between the
length and weight of the fish (length is independent of the variable but weight
is
dependent on the variable). But this time you should draw in the picture the
best straight line along with the best line that the model you created in item i)
gives. Draw the best line of the model in red.
In the ggplot package there is a function that automatically draws the best
straight line through the data for you:
stat_smooth(method='Im', se=FALSE).
The simplest way to draw the best line of the model is to create a new data
frame containing the model's value of the variables and the values of the
variables of the model. This can be done as follows
gogn_likan<
data.frame(
X = ?(fit$model[["X"]])
y = ?(predict(fit))
)
where you need to switch? beyond a function that returns the data to the real
scale (the model was created on a log scale) and X beyond the name of the X
variable in the model.
You can then use the ggplot function called geom_line () to draw the model's
best line. Note that since the model was created on log scale data, it is natural
that its best line is NOT straight once the actual scale has been remapped.
k)
Here we continue to work on the sea area that was selected in the previous
section.
Straight lines are often not the right model.
First draw a length versus age for your ocean area, as a box office.
Create two models with the ch () function. On the one hand, a model that
makes a straight line through the data and then a free model with age as a
class variable:
litid <Im(le~aldur, data=gs)
stort < Im(le~factor(aldur), data=gs)
anova(litid, stort)
Remember to use your own data but not: data=gs.
Both models are called linear models because they are linear in their toolbars.
However, the latter relationship does not at all describe a straight line
between age and age.
Interpret the result, both the image and the test, that the anova command
made for you, but it compares the models.
You can do this as a formal hypothesis test or in words, but you must at least
interpret both the image and the last number in the table.
Bonus question)
Find out for yourself how to draw the right longitude and latitude along with
the outline of the country and delineate the area of your choice, preferably
with notification fields. Note that this requires the use of rational projection,
etc.
You can start with this, but this is not enough at all:
66
65

64


I

25.0
22.5
20.0
17.5
 15.0
long
These solutions may offer stepbystep problemsolving explanations or good writing examples that include modern styles of formatting and construction
of bibliographies out of text citations and references. Students may use these solutions for personal skillbuilding and practice.
Unethical use is strictly forbidden.