# Converniting the data in long format
library(tidyr)
endotoxin_gene_exp_long<-gather(endotoxin_gene_exp,key = class,value = Values, 2:9)
endotoxin_gene_exp_long$class_org<-ifelse(endotoxin_gene_exp_long$class %in% c("sample_1","sample_2","sample_3","sample_4"),"Endotoxin","Control")

endotoxin_gene_exp_long<-na.omit(endotoxin_gene_exp_long)

# Problem 2
# We have two different covariates, class and time . class is clearly a factor. What are the advantages and disadvantages to treating time as a continuous variable or as a factor variable?

# The advanage of using time as a factor is that the data is collected in hour of the day format and there can be only 24 hours which should not be treated as continous variables, but by using time as factor we will not be able to assess gene expression profiles of patients for the times other than 2,4,6,9 and 24.

# but if we treat them as continous variable then the basic advantage is that we can asses the gene expression profiles of patients over a the time other than the predefined time in the dataset. For each sample time is from 0 to 24 hour and for either of the sample time can not have any other value than 0 to 24 so we should consider it as a factor variable.

#clopper pearson()

# Part b
# Use linear regression and ANOVA to test the null hypothesis, and report a p-value.

# Let's fit a regression model to see the behaviour of the profiles
FitRegression<-lm(Values~as.factor(time)+class_org,data=endotoxin_gene_exp_long)
summary(FitRegression)

# We can see from the above results that the model is giving a significant p-value of 2.425e-09 at f-Statistic 16.46, which means that the model is good to predict the variable under study. Also all the p-Values are coming to be significant except for the one i.e. the 24th hour time

# Now let's proceed with the ANOVA

FitANOVA<-aov(Values~as.factor(time)+class_org,data=endotoxin_gene_exp_long)
summary(FitANOVA)

1. Given the distance matrix below, perform clustering using UPGMA....