For this assignment you must ensure your work executes correctly on the Linux machines in ELW B238. You are free to do some of your programming on your own computers; if you do this, give yourself a few days before the due date to iron out any bugs in the C program you have uploaded to the lab machines.
This assignment is to be completed by each individual student (i.e., no group work). Naturally you will want to discuss aspects of the problem with fellow students, and such discussion is encouraged. However, sharing of code fragments is strictly forbidden without the express written permission of the course instructor. If you are still unsure regarding what is permitted or have other questions about what constitutes appropriate collaboration, please contact me as soon as possible. (Codesimilarity analysis tools may be used to examine submitted programs.)
- Revisit the C programming language, this time using dynamic memory allocation.
- Learn and implement Move-to-front (MTF) and Run-length encoding (RLE) algorithms that are popularly used in the compression scheme.
- Use git to track changes in your source code and annotate the evolution of your solution with “messages” provided during commits.
- Test your code with the built-in testing functions (No text input files).
- Be prepared to justify your script’s handling of errors for this assignment.
For this assignment you will write a C program that can perform move-to-front encoding of its contents followed by run-length encoding. In essence, you will be implementing run-length encoding using dynamic memory in C, the implementation of MTF encoding algorithm has been given to you in order to help you complete the rest of your C program.
Move-to-Front (MTF) Coding is an adaptive coding scheme where frequentlyappearing symbols - for our assignment “symbols” equal “chars”, including such char codes as \x03 and \x0a - are replaced with numbers that behave as indices. MTF acts in a manner somewhat similar to the way we might use a vertical stack of books. If we need a book from the middle of the pile, we retrieve it and when finished put the book back on top of the pile.
The figure below shows a string of input symbols (“abcabcaaaaaaaab”), the output generated by the encoding, and finally the order in which symbols appear in a list after each input symbol is processed. Notice the cases when a symbol is moved from a position within the word list to the top position of the list.
In this assignment, we will use an initial static large table from ASCII to index up to 256 characters. However, in a more general case, you can improve the given MTF algorithm to be adaptive to any new characters (e.g. characters from Unicode), but this is NOT required in this assignment. For example, when a new character appears in the input we can (a) output the code corresponding to the first unused position in the list and (b) follow this with the character. If a character has already appeared, then we find its position in the list, output that position, and the move the character to the front of the list. A decoder can read the output and not only build the list on the fly, but also use codes (i.e., the “1 2 3 3” sequence above) to retrieve a character already in the list. (Codes represents the position of the char in the list before it is moved to the top.)
You may have noticed above that the first position in the list is numbered 1 instead of 0. There is a reason for this, and it is described in the next step of implementing RLE algorithm.
Here is what it might look like to call
mtf_encode() in a C program:
char s = "abcabcaaaaaaaab";
int len = strlen(s);
int *code = (int *) malloc(sizeof(int) * len);
mtf_encode(s, len, code);
$> gcc mtfencode.c -o mtfencode $> ./mtfencode [1, 2, 3, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 3]
The above result is based on the general case which uses a dynamic table to index the characters. This is the reason why the letter ‘a’ has the index 1. However, in the given sengencode.c file, you will see a larger index number for the letter ‘a’ because the static size table was created from ASCII and ‘a’ is in the middle of the table.
You should notice the MTF encoding is reversible. Simply maintain the same initial table and decode by replacing each index number in the encoded stream with the letter at that index in the list.
Assuming we have an initial table as “abcdefghijklmnopqrstuvwxyz”. You take the number “2” of the encoded block and look it up in the list, which results in “b”. Then move the “b” to front which results in (bacdef…). Then take the next “2”, look it up in the list, this results in “a”, move the “a” to front etc.
Although MTF encoding does not reduce the size of data, run-length encoding (RLE) can result in size reduction. In fact, it is one of the simplest forms of data compression. Quite simply, RLE replaces repeatedly-occurring symbols (e.g., chars, or integers) with a count of those symbols. If the symbol + count requires less memory to represent than original sequence of repeated symbols, then the final data size is reduced.
In principle RLE can be used with any long sequence of symbols, but for our assignment we will only deal with long sequences of 1s. (Being able to get long sequences of 1s was another story with applying the famous Burrows-Wheeler transform, you can learn more details by searching online if you have time.)
The figure below shows the MTF encoding from our example with its RLE equivalent.
The sequence of seven 1s in the MTF encoding are replaced in the RLE version with 0 followed by 7. That is, the code 0 is reserved to mean “repeated 1s” with the following number being the length of the repeat. This replacement will only ever be applied to sequences of 1s with a length of three or greater. (Now you should see why 0 is not used as an index in the MTF encoding.)
Going backwards from RLE to MTF is straightforward (i.e., replace all 0 codes with the right numbers of 1s). The three left-blank functions in the C program should be your target in this assignment.
Write your program sengencode.c program in the A4/ directory within your git repository.
The tester function run_length_test() is already given in the C program. You can use it to test and debug your program. Refrain from writing the program all at once, and budget time to anticipate for when “things go wrong”. There are 5 built-in testing cases.
Use git add and git commit appropriately. While you are not required to use git push during the development of your program, you must use git push in order to submit your assignment.
Reasonable runtime performance of your script is expected. None of the test cases should take longer than 15 seconds to complete on a machine in ELW B238.
A C program named sengencode.c should be added and committed within your git repository as it will be your solution to Assignment #4. Ensure your work is committed to your local repository and pushed to the remote before the due date/time.
A text file named “error_handling.txt” which enumerates the errors that are addressed by your submission.