I. Answer the following questions with respect to the following papers:
Lu, L. J. et al. Assessing the limits of genomic data integration for predicting protein networks. Genome Res 15, 945-953 (2005) PMID: 15998909
Linding, R. et al. Systematic discovery of in vivo phosphorylation networks. Cell 129, 1415-26 (2007) PMID: 17570479
Pujana, M. A. A. et al. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nature Genetics 39, 1338-49 (2007) PMID: 17922014
1. Catalogue the genomic features of Lu et al., 2005 for inferring protein interaction networks. For each feature, describe the underlying â€œprincipleâ€ that makes the feature informative for protein interactions. Which features are likely to be highly correlated?
Why? Which features are found to be most significant in contributing to the prediction power of protein networks? Why?
2. Which of the techniques from Lu et al. (2005) are used to infer protein interactions in Pujana et al. (2007) and Linding et al. (2007)? What additional techniques are used?
3. In Pujana et al., 2007, download Supplementary Table S5 from blackboard, which contains a list of the interacting genes/proteins (nodes) and their associations (edges) from the breast cancer network (BCN):
i. How were the genes/proteins of the BCN determined?
ii. What techniques or evidence was used to infer edges in the BCN?
iii. What proportion of each technique or evidence was used in the BCN?
iv. Load the network into Cytoscape and identify and describe a region demonstrating functional modularity.
v. Describe the specific techniques or evidence used to predict the BRCA1 and HMMR interaction. What other proteins and functional modules are involved?
4. In Linding et al., 2007, download the supplementary NetworKIN Predictions file from blackboard. What evidence was used to predict the substrate specificity of protein kinases ATM (for Rad50) and CDK1/CDC2 (for TP53BP1)? What are the scores for phosphorylation motif prediction and for the genomic evidence (context score), respectively? How do these scores compare with those of known substrates of ATM or CDK1?

II. Pick one of the papers Pujana et al. (2007) or Linding et al. (2007) to read i depth and write a comprehensive summary of its hypothesis, methods, and findings. Use the following questions as a guide.
1. List the data-sets used in the study. Thoroughly describe all aspects of each data-set, including, but not limited to, the nature of the data (mRNA expression, protein-protein interaction, journal manuscripts, ontology, functional annotation, etc.); the source (public, private, generated in-lab); the type (experimental, computational prediction, online dataresource); any processing and cleanup (manual curation, significance tests, thresholding, signal processing); and the data quality (experimental design, including replicates, experimental or sampling bias, potential for false positives and false negatives).
2. Each study integrates its various data-sets in a sequence of inclusion/exclusion strategies to expand and refine a set of functionally important proteins. Describe this structural aspect of each study, in detail. What proteins might be falsely included or excluded by this approach? How are protein identities mapped between the various data-sets? What thresholds and statistical criteria were used to guide this process?
3. Read the methods sections, supplemental material and follow the links of citations until you find a detailed description of each of the informatics methods used in the study. Give a detailed description of the methods used including details of the online databases, software, and formulae used to carry out each method. Assess the reproducibility of each informatics method. How many methods could you readily reproduce? How many of the data-sets are publicly available online? What format are they in?
4. Describe how graph drawing, visualization, graphics, and plots are used in each study.
How are these tools used to guide the studyâ€™s investigations, and to communicate biological and statistical conclusions in the manuscript?
5. Describe the network and graph-theoretic properties examined in the study. How do these relate to the biological system in the study? When are these properties a good match for the biology and when might they be misleading?
6. Describe the statistical significance computations used. What is the null-model? Why is the null-model appropriate for the statistic being measured? Is the p-value determined by sampling or with respect to a theoretical probability model? How is multiple test correction done?
7. Identify data-sets used only to â€œsanity checkâ€ the conclusions of the systems biology model. Why are these data-sets appropriate to use in this way? How might these checks give a false sense of confidence in the systems biology model?
8. What experimental follow-up was carried out to test the predictions of the study?

I. Answer the following questions with respect to the following pap...