In this project, you will implement a function that takes as input a sequence of words encoded via an unknown Caesar’s Cipher, and returns a function that decodes words in that cipher back into plain text. You then use this function to write a code-breaker that decodes an entire document back into its plain text.
The cipher translates each letter in a word independently by “shifting” the letter up or down the alphabet by a fixed distance. We assume that we are using the English alphabet with 26 letters in their “usual” order, all lower case: a, b, c, . . . z. The function ctv (“character-to-value”) maps each character to its value, and the function vtc (“value-to-character”) maps each value to its corresponding character. For example, ctv(c) = 2 and vtc(25) = z.
The encryption function has the following form:
Encript_n(x) = vtc((ctv(x) + n) mod 26)
where “n” represents the amount by which the letters are “shifted” up the alphabet. For this project, we assume the values of “n” to be in the range of 0 and 25, i.e., 0 ≤ n ≤ 25.
As part of the project, you will first need to implement the encoding function encode-n which takes as input the shift value n and returns a function that takes as input a word w, and returns its encoded version:
For this project, words are represented as lists of lower case symbols, e.g., the word “class” is represented as
(c l a s s). Paragraphs are lists of words, e.g.,
((c l a s s) (i s) (a) (l o t) (o f) (f u n)). A document is a list of paragraphs.
A document encoded with the Caesar cipher is easy to break since there are only 26 possible “shift values” n. You are asked to implement two different methods to break the cipher, (1) a brute force version that uses a spell checker, and (2) a version that uses knowedge about the distribution of particular letters in the English language (frequency analysis). Both methods try to find the value of n0 such that
(n + n') = 26.
You will need to write two functions Gen-Decoder-A and Gen-Decoder-B that take as input a paragraph of encoded words, and return as output a function that takes as input an encoded word, and returns the plain text of the decoded word. This function can then be used to decode the entire document which may consist of multiple paragraphs. The function Code-Breaker takes an encoded document and a decoding function as input, and returns the entire document in plain text.
The two decoder functions implement different strategies to break the encrypted code. Both have advantages and disadvantages. Your scheme program will provide an infrastructure to assess the effectiveness of these decoding approaches for different document types.
The algorithm for the brute force code breaker is rather simple. The encode input words are encoded for each possible value of n0 . For each value, a spell checker determines whether the resulting words are words in the English language. The value of n0 for which most words are spelled correctly is assumed to be decoding value.
For this method, you will need to implement a spell checker. You can implement your spell checker using the dictionary of words in file dictionary.ss. You will need to implement the spell checker spell-checker that takes as input a word, and returns the truth value #t or #f. A brute force version of the spell checker may just see whether the word occurs in the dictionary or not.
This code breaker is based on the fact, that in the English language some letters occur more often than others. Knowing the distribution of the different letters, this method just counts the number of occurrences of each letter in the encoded words. If there are enough words to be statistically relevant, the letter distribution can be used to identify the most common letters in English, namely ‘e’, ‘t’, and ‘a’. In other words, the assumption is that the most frequent encoded letter has to correspond to the letter ‘e’, etc. This means that the shift between the encoded letter and ‘e’ has to be the value n. The same n has to work for the other frequent letters as well. From that, we can determine the decoding value n0 . Note that this method needs a large enough set of words in order to be successful. In other words, it may not work well for very short paragraphs.
For the implementation of these functions, you can use standard built-in Scheme functions such as append. You should use the reduce function at least once in your implementation. The definition of reduce is given in file include.ss. You must not use assignment (set!) in Scheme.
Copy the 5 files decode.ss, include.ss, test-dictionary.ss, dictionary.ss and document.ss into your own subdirectory. You will implement your code breaker in file decode.ss. File test-dictionary.ss contains a small dictionary consisting of only six words. Use it to debug your program. File dictionary.ss contains a list of over 45,000 words allowing you to generate a realistic spell checker. You must not change this file. Finally, file document.ss contains two sample documents to test your encoder, decoder, etc. You should use your own test cases as well.
You will submit your version of file decode.ss via Sakai.
Your programs will be graded based on programming style and functionality. Functionality will be verified through automatic testing on a set of test cases.
All questions regarding this project should be posted on our piazza web site. Start early since you will run into issues that will need to be resolved on the way. This is a feature of the project, not a bug!