Write a program that takes a FASTA file of DNA sequences as input. It then counts the frequencies of every 3-mer sequences.
Read FASTA file format from Wikipedia for a history of FASTA. Since there is no official definition of FASTA format, here we use the following customized definition:
- The FASTA file must be an ASCII text file with one or multiple lines.
- The FASTA file may contain one or multiple entries of DNA, RNA, or protein sequences.
- Each entry (except for the first entry) must start with a header line that starts with a greater sign, and followed by one or more sequence lines. The concatenation of the sequence lines provides the sequence of one entry.
- The first entry can either have or have not a header line.
- Whitespace characters can appear anywhere in the file. If the whitespaces are at the beginning of a header line, or anywhere in the sequence lines, they are to be discarded by the reading programs. For example, a line “ A CG T” will be the same as “ACGT”. And a line “ ]Sample sequence” will be the same as “]Sample sequence”.
- Empty lines are to be discarded.
- All lines that start with a semicolon (after removing the whitespaces) are comment lines and are to be discarded.
- For the purpose of this assignment, lower case letters a, c, g, and t are treated as the same as upper case letters A, C, G, and T.
- For the purpose of this assignment, any characters other than the upper and lower case A, C, G, and T in the sequences are not counted towards the 3-mers. For example, “ACCXGGTTT” will have only four 3-mers: ACC, GGT, and GTT, and TTT.
- Sample FASTA files can be downloaded. The yeast.nt.gz there is a good start. Explore the internet to find more data sources.
The input is a FASTA file. The output of your program is to the standard out in the following format:
AAA 0.0510 AAC 0.0111
Each line is two columns separated by a single whitespace. The first column is the three letter, and the second column is the frequency. Use rounding to keep four digits after the decimal points. Lines must be ordered alphabetically.
Things you can and should assume (in developing program and test cases):
- Input file size does not exceed 100M bytes. (This isn’t true in reality, but just assumed for sake of simplicity.)
- The file sizes in the test cases you submit should not exceed 10K bytes. However, the TA will also create test cases with larger files to test your program.
- The file has at least one non-empty sequence.
- If a 3-mer has a frequency of 0, it is still included in the output and the frequency is 0.0000.
You should login to your student.cs account and submit your assignment electronically. You are encouraged to submit multiple draft versions earlier than the deadline to ensure the final submission is smooth. Only your latest submission before the deadline will be marked. Your submission should be in a directory a1/ that contains the following subdirectories and files:
- A subdirectory src/.
All your source codes should be in this directory. Do not include the compiled files with the source code.
- A make file “makefile”.
We will use the command “make makefile” under the directory assn1/ to compile your program. This should compile your program and generate either an executable or a jar file in an out/ directory under assn1/. Note that out directory should not exist in your submission, and is automated generated by make.
- A shell script “run.sh” to run the program.
After make, the “./run.sh inputfile” command should run your program and print results to the standard output.
- A test cases directory test/.
- a. 10 test cases. Each case includes two files x.in and x.out, where x is the number from 0 to 9.
- b. A readme file readme.txt explaining what problem each test case was designed to test against. When your test case is strangely behaved, such as failing everybody’s program, the TA might need to look into the readme file to determine if that is a valid test case.
We will use a script to
- cd assn1/
- ./run.sh test/0.in > /tmp/tmpfile
- Then we’ll check if tmpfile’s result is the same as test/0.out.
- Step 3 and 4 are repeated for each of the test cases, including yours, the TA’s, and other students’ test cases.
The marks will be based on the test cases and the program. Different test cases may demonstrate different patterns in failing programs. The following explains this by showing one example with 6 test cases and 3 programs.
Although there are 6 test cases, there are two pairs that have the same patterns: 2=5 and 1=4. So there are only 4 different test classes, with patterns PPP, PFF, FPP, and FFP, respectively.
Suppose your test cases covers different test classes and fails different students’ programs.
Moreover, your program passes test classes. Additionally, the TA determines that of your test cases are invalid cases.
So, to maximize your marks, try to develop different and valid test cases that fail other students’ programs, meanwhile, try to develop a bug-free program that passes all the test cases.
- There is a tendency that species that are more evolutionarily apart from each other have larger differences in their k-mer frequencies. Try a few species (such as human, mouse, fruit fly, and yeast), and see if this is indeed true.
- Try to do a k-mer frequency for k>3 on a PC. For example, k=20. What problem will this create? How do you solve it?
These additional things will not be marked but just to make you feel good while programming for the assignment.