Bioinformatics refers the application of statistics and computer science to the management and analysis of data from the biosciences. In common with computational linguistics, the field shares the task of detecting and searching for patterns in gigantic datasets. For this project you search for DNA sequences in the 24 chromosomes of the complete hg19 GRCh37 human genome. These base pairs are represented in text form, so the problem is essentially a text processing problem.
There are 4,965 target sequences. For each of these you will identify the chromosome file and the offset within the file for all matches. This problem was selected because this search cannot be performed efficiently without special attention to the algorithm that is used. Using grep around five thousand times to search 2.8 gigabytes probably wonâ€™t end well, even in this simplified exercise. Instead, you will store the search targets in a prefix trie and make a single pass through the genome corpus. Because a position in the trie keeps track of multiple simultaneous searches, this reduces the problem from ð‘‚(ð‘ð‘€) to ð‘‚(ð‘ log ð‘€). Hashing cannot be used for this purpose, because the target sequences are different lengths, so you donâ€™t know how many characters from the input to use for computing a hash.

Human Genome Corpus
The following directory on patas contains 24 files, corresponding to human chromosomes 1 through 22, plus X and Y. Each file describes an ordered sequence of nucleotide bases A, T, C or G.

Masking
In accordance with convention for DNA data, some of the nucleotides are represented by the corresponding lower-case letter. This indicates that they are â€œmasked,â€ meaning that they are of low complexity and thus believed to be less important. For this project, you will do a caseinsensitive search, so masked areas will be included (the target sequences are specified as all uppercase).
Unknown
There is no data available for some parts of the genome. These areas are indicated by the letter N. None of the target sequences contain N, and your trie does not need to be able to recognize N.
Loading the Target Sequences
You will develop a trie structure. Each trie node makes reference to (up to) four child nodes, one for each nucleotide. Any or all of these references can be null. When all four children of a node are null, the node is a terminal node. Each node (except the root) has exactly one parent, but nodes do not need to store a reference to their parent node, because navigation through a trie is always one-way, starting from the root. The elegance of a trie is that every node implicitly represents a unique string merely by its position.

Some trie implementations allow each node to store a â€œpayload.â€ One purpose of the payload is to signal whether the node represents the last character in a target string. In this case, the payload would be null if the trie node is not a stopping point. This sort of mechanism (for differentiating valid acceptance points) is required if the trie has to store strings that are exact prefixes of other strings that are also stored. For example, {â€œcatâ€ â€œcatamaranâ€}. In our application, none of the target sequences start with any of the others, so we donâ€™t need to use a payload. Instead, we will know if we are at an acceptance point (i.e. that we found a match) if the node has no child nodes. After defining the node structure, you will instantiate a single node to represent the trie root. Next, you will populate the trie with each target sequence.

Add each target sequence to the trie by starting at the trie root and following the sequenceâ€™s path of nucleotides. This is done by advancing through the target string and navigating/building the trie at the same time. For example, if you come to a child node that is missing, add it by creating a new node. When you reach the end of the target string, you should be at a node with no childrenâ€”a terminal node that you just created.
If your trie nodes store a payload (not required here), you would store the payload in this node in order to mark the node as a completed match. In such a design, matching can occur at non-terminal nodes.
The target strings can be discarded now, since they are all stored in the trie implicitly (and explicitly as well, if you chose to store each string as a payload).
Searching
Now that you have a trie that represents the strings that you are searching for, you can scan the corpus in a single pass, finding all the matches. You will still need to check every character position in the corpus, but the trie allows you to check all targets at the same time and abort as soon as any match from that position becomes impossible.
You will need to keep track of two index positions in each genome file ð‘–, ð‘— and one navigation positionâ€”a pointer to a node, ð‘ð‘›â€”within the trie. The â€œouterâ€ index position ð‘– will always advance one character at a time. As you read from the corpus, donâ€™t forget to normalize the letter-case to whatever you stored in the trie, since we do wish to search within masked data.
For each of these outer positions, the navigation position ð‘ð‘› will start at the trie root, and a second â€œinnerâ€ character position ð‘— will be initialized to the value of the outer index position ð‘–.
Now, similar to how the trie was loaded, ð‘ð‘› will attempt to navigate through the trie, according to the characters read from the corpus, beginning from ð‘—. Of course, this time we donâ€™t modify the tree; if further navigation becomes impossible and we are not at a terminal node, there is no match, so increment the overall position ð‘–, set ð‘— = ð‘– and ð‘ð‘› = ð‘Ÿð‘œð‘œð‘¡, and repeat.
If you are able to reach a terminal node by matching the corpus to the trie, then you have found a match. If you didnâ€™t store the target sequence as a payload in the terminal node, itâ€™s easy enough to recover the string that youâ€™ve matched: itâ€™s the substring of the input from ð‘– to ð‘—, inclusive. (Note that you donâ€™t need to build it as you go.) Record each match as described in the Output section below.

Performance
A major challenge will be dealing with the large files in the human genome corpus, since ð‘— does have to continually navigate forward and then reset (seek) backwards within a small area. If you are able to load each file into memory, this isnâ€™t a problem. Otherwise, you might consider using file seeking for all operations on the corpus, reading just one character at a time. This is a great solution if you can count on the operating system to buffer your read operations. Even better would be to memory-map the file youâ€™re reading, an advanced technique which might be available in your programming language or OS.
Your program will certainly be long-running. My single-threaded solution, written in C#, executes in about 4 minutes (4:09) on a 3.17GHz machine. If you are using patas/dryas, please submit your job to condor. You can use the â€˜condor-execâ€™ shortcut for this. If execution times are prohibitive, you can report results for a smaller fraction of the corpus (5 points deduction).
Output Format
Output your results to the console (stdout). Print each filename on a new line (the full path can be included). All matches found in that file will follow, with each match on a single line that begins with a tab. After the tab character, print the zero-based offset of the match within the file (please use hex numbers). Follow this with another tab and the sequence that was found. Print the sequence in all caps.

Bioinformatics refers the application of statistics and computer sc...