This problem will be comparing and measuring the differences between DNA sequences. This type of analysis is common in the Computational Biology field. There are three parts to this assignment which will ask you to implement respectively the following functions:
- findBestMatch(…), findBestGenome(…)
You are free to choose the names of any other supporting functions. The names should be reflective of the function they perform.
In the writeup, when referring to a function we often remove the details of the parameters that the function accepts; i.e. in similarityScore(…) shown above. We do this for brevity. The parameters are described in detail in the problem descriptions listed below.
DNA is the hereditary material in humans and other species. Almost every cell in a person’s body has the same DNA. All information in a DNA is stored as a code in four chemical bases: adenine (A), guanine (G), cytosine (C) and thymine (T). The differences in the order of these bases provide a means of specifying different information.
In the lectures there have been examples of string representation and use in C++. In this assignment we will create strings that represent the possible DNA sequences. DNA sequences use a limited alphabet (A,C,G,T) within a string. Here we will be implementing a number of functions that are used to search for substrings that may represent genes and protein sequences within the DNA.
One of the challenges in computational biology is determining where a DNA binding protein will bind in a genome. Each DNA binding protein has a preference for a specific sequence of nucleotides. This preference sequence is known as a motif. The locations of a motif within a genome are important to understanding the behavior of a cell’s response to a changing environment.
To find each possible location along the DNA, we must consider how well the motif matches the DNA sequence at each possible position. The protein does not require an exact match to bind and can bind when the sequence is similar to the motif.
Another common DNA analysis function is the comparison of newly discovered DNA genomic sequences to the large databases of known DNA sequences and using the similarity of the sequences to help identify the origin of the new sequences.
Hamming distance is one of the most common ways to measure the similarity between two strings of the same length. Hamming distance is a position-by-position comparison that counts the number of positions in which the corresponding characters in the string are different. Two strings with a small Hamming distance are more similar than two strings with a larger Hamming distance.
first string = "ACCT" second string = "ACCG" A C C T | | | * A C C G
In this example, there are three matching characters and one mismatch, so the Hamming distance is one.
The similarity score for two sequences is then calculated as follows:
similarity_score = (string length - hamming distance) / string length similarity_score = (4-1)/4 = 3/4 = 0.75
Two sequences with a high similarity score are more similar than two sequences with a lower similarity score. The two strings are always the same length when calculating a Hamming distance.
In this assignment, you will calculate the similarity of two sequences using Hamming distance between two strings, search a genomic string looking for matches to a subsequence, and calculate the similarity scores for a sample DNA sequences compared to known DNA sequences.
Here we have provided a small portion of a DNA sequence from a human, a mouse, and an unknown species. Smaller DNA sequences will be compared to each of these larger DNA sequences to determine which has the best match. Each of the DNA sequences can be copied from this write-up and stored as a constant or variable in your program.
Define the similarityScore function. This function takes two string arguments, calculates their Hamming distance, and returns their similarity score as a floating-point number. This function should only calculate the similarity if the two strings are the same length, otherwise return 0.
You should implement your main function to repeatedly ask to the user for 2 strings (sequence1 and sequence2) such that you can pass them as parameters to your similarityScore function. Writing your own main function will allow you to test and debug every function you write.
COG will grade Part 1 by generating random sequences to be compared. The results of your function will be compared to our results for the same sequences.
The countMatches function takes three parameters, a string containing the genome to search, a string containing the sequence to find, and a floating point value containing the minimum similarity score that will be considered a match.
The function does the following:
- The function should find all the positions of the genome where the genome’s substring matches sequence1 with a similarity greater than or equal to the min_score.
- The function should return the count of the number of such positions it found.
COG will grade Part 2 by generating random sequences to be found in a given genome. The results of your function will be compared to our results for the same sequences.
We have a random DNA sequence, and we want to find the closest species to it. Is the DNA sequence more similar to human, mouse, or unknown?
When could this kind of comparison be useful? Suppose that the emergency room of some hospital sees a sudden and drastic increase in patients presenting with a particular set of symptoms. Doctors determine the cause to be bacterial, but without knowing the specific species involved they are unable to treat patients effectively. One way of identifying the cause is to obtain a DNA sample and compare it against known bacterial genomes. With a set of similarity scores, doctors can then make more informed decisions regarding treatment, prevention, and tracking of the disease.
The goal of this part of the assignment is to write functions that can be useful to determine the identity of different species of bacteria, animals, etc . By simply using the similarity score routine you implemented you can compare an unknown sequence to different genomes and figure out the identity of the unknown sample.
The findBestMatch function should take two string arguments and return a floating point value of the highest similarity score found for the given sequence at any position within the genome. In other words, this function should traverse the entire genome and find the highest similarity score by using similarityScore() for the comparisons between seq and each sequential substring of genome.
hint: this function is very similar in structure to the countMatches function
- The findBestGenome function should take four string arguments(unknown sequence, mouse_genome, human_genome and unknown_genome).
- Return an integer indicating which genome string, out of the 3 given, had the highest similarity score with the given sequence.
- For each genome, the function will find the highest similarity score of the sequence (at any position) within that genome (call function findBestMatch described above).
- The return value from this function will indicate which genome had the best match, 1, 2, or 3. In the case that two or more of the sequences have the same best similarity score, return 0.
COG will grade Part 3 based on both the value returned from findBestGenome and findBestMatch.
Note: DNA sequences for human, mouse and unknown genomes will be uploaded as a file on Moodle with this assignment for testing purposes.
Let’s talk about implementation and testing your solutions. We should approach the design of algorithms or programs from the top down. Begin writing your algorithms as very abstract descriptions and then repeatedly writing more detailed descriptions of the complex abstractions. Each of your abstractions could be its own function. Once you have reached a level of detail in your descriptions that provides well-understood solutions to the problem, you can begin to implement (write the code) for those functions. Implementation is best done from the bottom up, meaning you implement the base functions (those functions that don’t call other functions) first and test them until you are satisfied they behave correctly and therefore have the correct code for implementing that function.
Once you have your base functions implemented, you can implement the next layer of abstractions that use those base functions in their algorithms. Again, test the new functions until you are satisfied they perform as required. This method of bottom up will progressively add in functionality until the complete algorithm is implemented.
This assignment is written in a manner that will allow you to implement functions each part and use COG to verify that it works before implementing the next part.
Write your similarityScore() function and test it by calling it from main() and passing it two known strings. For example:
cout << "Test1: " << similarityScore("ACGT","ACGG") << endl;
cout << "Test2: " << similarityScore("ATAT","ACAC") << endl;
Once you know that your function works for the specific tests, modify your main to get the input from the user and pass the strings to the function. Once the user input and output are implemented as required, submit to COG to verify your compliance with all the requirements. Now you can implement the code for the next part of the assignment. Repeat the implementation, local testing, and COG testing stages for each of the required functions.
When testing the functions, it may be easier to use smaller genome sequences when debugging your code. Use simple examples to verify that the functions are working before testing it on longer strings. Once you’re confident the function works, call your functions from main() using the following strings as the first input parameter to the function:
DNA is double stranded even though we always write a single strand’s nucleotide sequence. This is because we can infer the other strand’s sequence from the given strand. Every A on one strand has a complementary T on the opposite strand (every C has a G). Therefore we can create the complement strand by swapping the nucleotides with the complementary nucleotide. Create a function to take the sequence and produce the complement sequence.
But, you are not done yet. You now have the complement strand, but the DNA is read in the forward direction for the given strand and in the reverse direction for the complementary strand. Therefore, we must reverse the strand (e.g. first character becomes the last character, second becomes second to last, ) to have the correctly specified reverse complement of the given DNA sequence. The challenge problem is to create a function to produce the reverse complement of a given DNA sequence. Once you have a function to produce the reverse complement of a string, you can search both the DNA and reverse complement looking for the best match.