QuestionQuestion

1. Given a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring. Assuming parsimony, the shortest possible superstring over a collection of reads serves as a candidate chromosome.

Write a program that inputs at most 50 DNA strings whose length does not exceed 1 kbp in FASTA format (which represent reads deriving from the same strand of a single linear chromosome) and outputs a shortest superstring containing all the given strings. The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by at least half their length.

Sample Dataset:
    >Rosalind_56
    ATTAGACCTG
    >Rosalind_57
    CCTGCCGGAA
    >Rosalind_58
    AGACCTGCCG
    >Rosalind_59
    GCCGGAATAC

Sample Output (first line only, the following four lines are for illustration purposes):

                ATTAGACCTGCCGGAATAC
    Rosalind_56 ----------
    Rosalind_57       ----------
    Rosalind_58    ----------
    Rosalind_59          ----------

2. Write a branch-and-bound version of the MedianStringMotifSearch() algorithm shown below. You should base your alogrithm on the observation that, if the best alignment of a prefix of a candidate motif has a greater Hamming distance than the minimal Hamming distance of the best motif found so far, it could not possibly lead to a better solution.

import itertools

def ScanAndScoreMotif(DNA, motif):
    totalDist = 0
    bestAlignment = []
    k = len(motif)
    for seq in DNA:
       minHammingDist = k+1
       for s in xrange(len(seq)-k+1):
            HammingDist = sum([1 for i in xrange(k) if motif[i] != seq[s+i]])
            if (HammingDist < minHammingDist):
                bestS = s
                minHammingDist = HammingDist
       bestAlignment.append(bestS)
       totalDist += minHammingDist
    return bestAlignment, totalDist

def MedianStringMotifSearch(DNA,k):
    """ Consider all possible 4**k motifs"""
    bestAlignment = []
    minHammingDist = k*len(DNA)
    kmer = ''
    for pattern in itertools.product('acgt', repeat=k):
       motif = ''.join(pattern)
       align, dist = ScanAndScoreMotif(DNA, motif)
       if (dist < minHammingDist):
            bestAlignment = [s for s in align]
            minHammingDist = dist
            kmer = motif
    return bestAlignment, minHammingDist, kmer
Apply your algoritm to the set of sequences used in lecture:
seqApprox = [
    'tagtggtcttttgagtgtagatctgaagggaaagtatttccaccagttcggggtcacccagcagggcagggtgacttaat',
    'cgcgactcggcgctcacagttatcgcacgtttagaccaaaacggagttggatccgaaactggagtttaatcggagtcctt',
    'gttacttgtgagcctggttagacccgaaatataattgttggctgcatagcggagctgacatacgagtaggggaaatgcgt',
    'aacatcaggctttgattaaacaatttaagcacgtaaatccgaattgacctgatgacaatacggaacatgccggctccggg',
    'accaccggataggctgcttattaggtccaaaaggtagtatcgtaataatggctcagccatgtcaatgtgcggcattccac',
    'tagattcgaatcgatcgtgtttctccctctgtgggttaacgaggggtccgaccttgctcgcatgtgccgaacttgtaccc',
    'gaaatggttcggtgcgatatcaggccgttctcttaacttggcggtgcagatccgaacgtctctggaggggtcgtgcgcta',
    'atgtatactagacattctaacgctcgcttattggcggagaccatttgctccactacaagaggctactgtgtagatccgta',
    'ttcttacacccttctttagatccaaacctgttggcgccatcttcttttcgagtccttgtacctccatttgctctgatgac',
    'ctacctatgtaaaacaacatctactaacgtagtccggtctttcctgatctgccctaacctacaggtcgatccgaaattcg']

Use the %time magic function to compare the performance of your method to the one given.

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

import itertools
import time

# Helper function -- Computes best alignment for DNA Sequence and Motif
def ScanAndScoreMotif(DNA, motif):
    totalDist = 0
    bestAlignment = []
    k = len(motif)
    for seq in DNA:
       minHammingDist = k+1
       for s in range(len(seq)-k+1):
            HammingDist = sum([1 for i in range(k) if motif[i] != seq[s+i]])
            if (HammingDist < minHammingDist):
                bestS = s
                minHammingDist = HammingDist
       bestAlignment.append(bestS)
       totalDist += minHammingDist
    return bestAlignment, totalDist

# Main function -- Computes Motif Search using Median String method
def MedianStringMotifSearch(DNA,k):
    """ Consider all possible 4**k motifs"""
    bestAlignment = []
    minHammingDist = k*len(DNA)
    kmer = ''
    bins = len(DNA[0])-k+1
    i = 0
    total = len(list(itertools.product('acgt', repeat=k)))
    increment = int(total/bins)
    for pattern in itertools.product('acgt', repeat=k):
       if (i % increment == 0):
            print(str((int(i/increment))+1)+"/"+str(bins))
       motif = ''.join(pattern)
       align, dist = ScanAndScoreMotif(DNA, motif)
       if (dist < minHammingDist):
            bestAlignment = [s for s in align]
            minHammingDist = dist
            kmer = motif
       i += 1
    return bestAlignment, minHammingDist, kmer


# Helper function -- computes score for DNA Sequence and Motif
def Score(s, DNA, k):
    """
       compute the consensus SCORE of a given k-mer
       alignment given offsets into each DNA string.
            s = list of starting indices, 1-based, 0 means ignore
            DNA = list of nucleotide strings
            k = Target Motif length
    """
    score = 0
    for i in range(k):
       # loop over string positions
       cnt = dict(zip("acgt",(0,0,0,0)))
       for j, sval in enumerate(s):
            # loop over DNA strands...

By purchasing this solution you'll be able to access the following files:
Solution.zip.

$50.00
for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Python Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats