## Question

• Human Gene

• Mouse Gene

• Yeast Gene

Devise a method to find the longest conserved substring shared by all three subsequences. Recall that a substring is a contiguous series of bases common to all three strings. Explain your approach. (You may write code to solve this part if you wish, however, if you do include both your code and your answer).

Part 2: Write a function to compute the minimal pairwise edit distance between two gene sequences such as the ones given part 1. Assume exact matches are weighted as +1, mismatches as 0, and any introduced gaps as -1

Longest Common Subsequence:

• A special case of edit distance where no substitutions are allowed

• A subsequence need not be contiguous, but order must be preserved Ex. If v = ATTGCTA then AGCA and TTTA are subsequences of v, but TGTT and ACGA are not

• The length of the LCS, s, is related to the strings edit distance, d, by:

d(u,w)=len(v)+len(w)–2s(u,w)

Difference between Gap and Mismatch

With two string GTAGGCTTA, GTAGATA

GTAGGCTTA

GTAGA - -TA

Here Red letters indicate mismatch while blue letters indicate gap. Total score for this pair using score of (1,0,-1) would be 6-1+0+0 = 5

Example: Consider these 4 sequences

s1: GATTCA

s2: GTCTGA

s3: GATATT

s4: GTCAGC

• with the scoring matrix: {Match = 1, Mismatch = -1, IntroGap = -1}

There are (4 Combination 2)= 6

• possible pairwise alignments

s2: GTCTGA s1: GATTCA--

s4: GTCAGC (score = 2) s4: G-T-CAGC (score = 0)

s1: GAT-TCA s2: G-TCTGA

s2: G-TCTGA (score = 1) s3: GATAT-T (score = -1)

s1: GAT-TCA s3: GAT-ATT

s3: GATAT-T (score = 1) s4: G-TCAGC (score = -1)

• The best pairwise score, 2, is between s2 and s4

Part 3: Use the function that you wrote for Part 2 to find the edit distances(total score) between all pairs of genes sequences given in Part 1

editDistance (Human, Mouse) =

editDistance (Human, Yeast) =

editDistance (Mouse, Yeast) =

## Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

from numpy import *import sys

# Fix specifically because provided code hits a recursion limit in giant text files

sys.setrecursionlimit(10000)

# (SCORE_MATCH, SCORE_MISMATCH, SCORE_GAP)

# EDIT THIS TO MODIFY SCORE FORMULA; REST OF CODE WILL USE THIS!

SCORE_MATRIX = (0, 1, 1)

def findLCS(v, w):

score = zeros((len(v)+1,len(w)+1), dtype="int32")

backt = zeros((len(v)+1,len(w)+1), dtype="int32")

for i in range(1,len(v)+1):

for j in range(1,len(w)+1):

# find best score at each vertex

if (v[i-1] == w[j-1]):

score[i,j], backt[i,j] = max((score[i-1,j-1] + 1, 3),

(score[i-1,j],1),

(score[i,j-1],2))

else:

score[i,j], backt[i,j] = max((score[i-1,j],1),

(score[i,j-1],2))

return (score, backt)

def LCS(b,v...

By purchasing this solution you'll be able to access the following files:

code.py, humanGene.seq, mouseGene.seq and yeastGene.seq.