Biotechnology
1. Imagine you wish to obtain the complete sequence of the insert for a 150kb recombinant BAC using dideoxy sequencing. Your objective is always to produce a highly accurate sequence for the BAC (as close to 100% accurate as possible).

(a) You first consider using primer walking.
From one sequence run you can read from about 20nt beyond the end of the primer (the first few nts are usually unclear) onwards with decreasing clarity at lengths greater than about 700nt, so that the last 30nt of sequence you write down are not very certain.
(i) If you synthesized the next primer (for primer walking) corresponding to 19nt of the last 30nt read above would the next sequencing run confirm whether any part of the last 30nt from the first run is correct? Explain.
(ii) Where would you choose to position the second primer (instead of the choice above)? Explain carefully- since you would be doing this repeatedly you need a defined method to implement each time in order to choose an exact sequence.
(ii) Would you complete the project by making successive primers, â€œmarching downâ€ the BAC from one end all the way, both ends meeting half way or both ends all the way?

(b) You consider an alternative strategy of shotgun sequencing by making a library of plasmid clones containing roughly 2kb inserts derived from BAC sonication.
(i) What do you think will be the quality of any one sequence from the shotgun sequencing experiment compared to any one sequence from the primer walking experiment? Explain.
(ii) What is the total number of sequencing runs required to finish the project using the shotgun strategy compared to primer walking (estimate the ratio, with explanation)?
(iii) Assuming you make good choices in both approaches, are you likely to be more confident of the final BAC sequence using one approach or the other? Explain (do not just repeat your answer to (i)).

2. Consider shotgun sequencing of a whole mammalian genome (of unknown sequence), still using dideoxy sequencing. You make a 2kb insert plasmid library, sequence the products and start to merge overlapping sequences.
(i) In your merged sequence contig you find a region where all surrounding sequences are identical in individual sequence reads but some have a C at a certain position, while others have a T. What are TWO possible underlying reasons?
(ii) Imagine there are some tandem repeats of identical 180bp units in some of your individual sequence reads. What would be a good initial approach in your preparation for merging overlapping sequences to anticipate this kind of repeat and react to the presence of such sequences?
(iii) If there were at most 5 consecutive occurrences of the 180bp repeat (900bp total) would you ultimately expect to derive a definitive merged sequence that includes such regions? Be sure to explain how the total repeat length is critical to the outcome.
(iv) If the tandem repeats had a total length of more than 2kb but less than 10kb how might you be able to define the total length of a repeat if you had also constructed a library of 10kb plasmid clones and recorded the plasmid source of each sequenced insert (to produce mate-pairs)?
(v) Would the same approach as above (mate-pairs for longer inserts of defined size) also help to define the exact locations of a sequence that contains no internal tandem repeats but is found in many different locations in the genome? Explain whether your answer depends on the length of the sequence that is repeated.

3. Suggest one short question you would like to see on the mid-term exam.

4. Second generation sequencing
Slide 21 in Topic 3 shows one way of preparing fragmented DNA for 454 sequencing by adding specific DNA sequences to each end. In that slide, the top strand is going from right to left in the 5â€™ to 3â€™ direction. The final product runs 5â€™ to 3â€™ from left to right going from A through unknown fragment sequence to B.
Imagine that the exact sequence of the longest strand of the initial adapters for A and B were as below (these sequences do not deliberately have any complementarity or repeats):
A 5â€™ CATATCGCCTCCCTCGCGCCAGATC 3â€™ and B 5â€™ CGATGCGCCTTGCCAGCCCGCTCAG 3â€™
Hence, the sequence of the A region in the recovered DNA at the bottom of the slide is exactly as written above and the sequence of the B region on that fragment is the complement of what is written aboveâ€¦.5â€™â€¦..CTGAGCGGGCTGGCAAGGCGCATCG 3â€™

The blue sequence includes one copy of each nucleotide as a sequencing control (to check the sequencing is working in the first four cycles).
(a) Strands like the one shown at the bottom of slide 21 are captured largely as single molecules by hybridizing to an oligonucleotide that is attached to a bead. In the next step (after forming an emulsion) the bead-attached oligo acts as a primer for DNA synthesis guided by the hybridized template.
(i) What should be the sequence of the bead-attached oligo? Explain the reason for any choices you make.
(ii) After PCR amplification and depositing beads with amplified templates into honeycomb wells the amplified templates will be sequenced. What should be the sequence of the primer? Explain your choice of the exact 3â€™ end of the primer.

(b) A standard sequencing run by a company selling sequencing services would involve taking your DNA sample, preparing a bead library, sequencing all of the bead-associated products simultaneously and returning your results (with optional bioinformatics analyses). The total number of sequencing reads might, for example, be 5 million in a single sequencing run. If you only wanted one million reads but you wanted to do that for five different libraries, could your libraries be constructed (by you or the company) in such a way that you could combine all five in a single sequencing run? Explain.

(c) What would you do if you wanted to sequence only 50kb of DNA? (please take this question at face value, in isolation- no assumptions from earlier questions or section titles).

(d) In dideoxy sequencing the length of a read is limited by resolution of different sized fragments on a capillary polyacrylamide gel. For both 454 and Illumina methods there is also a limit to the length of any single DNA sequence read in a sequencing run.
(i) What is the reason (that applies to both 454 and Illumina) that there is a limit?
(ii) Why do you think this limit is smaller for Illumina (shorter sequencing reads) than for the 454 method? Aim for TWO reasons.

(e) An Illumina spot contains templates in both orientations. Consequently, it is possible to sequence consecutively from each end of an amplified DNA.
(i) Those paired sequences are generally separated by only a short distance and could not, for example, be 10kb apart if a library was prepared just as outlined in slides. Why?
(ii) A library can be prepared in a different, more complicated way. For example, one could first purify fragments of a chosen, larger size range (say about 10kb). Then ligate adapters (L and R) that have biotin modifications. Then dilute and allow ligation of L and R to produce circles (donâ€™t worry about the tricks that allow these steps to occur efficiently). Then sonicate (or nebulize) to fragment the circles, collect DNAs that have biotin and proceed to normal Illumina library construction. What has this elaborate procedure achieved that may be useful in a de novo whole genome sequencing project?

(f) If you were sequencing a genome de novo (for the first time) would this be easier if individual sequencing runs were longer (as for 454) or shorter (as for Illumina) if the total sequence output were similar? Explain.

(g) If you already had a generic â€œnormalâ€ genome sequence, as for humans, and you were sequencing genomes from individuals in order to identify variants what would be your choice of sequencing method? Consider looking for point mutations and looking for deletions, inversions or translocations.

Biotechnology 1. Imagine you wish to obtain the complete sequenc...