Python代写:CSC148 Data Compression With Huffman Codes

Introduction

代写一个Huffman编码的小程序,生成树部分和传统的Huffman Tree不太一样,略有改进。

Overview

Data compression involves reducing the amount of space taken up by files. Anyone who has listened to an MP3 file or extracted a file from a zip archive has used compression. Reasons for compressing a file include saving disk space and reducing the time required to transfer the file to another computer.
There are two broad classes of compression algorithms: lossy compression and lossless compression. Lossy compression means that the file gets smaller, but that some information is lost. MP3 is a form of lossy compression: it saves space by throwing away some audio information, and therefore the original audio file can not be recovered. Lossless compression means that the file gets smaller and that the original file can be recovered from the compressed file. FLAC is a form of lossless audio compression: it can’t save as much space as MP3, but it does let you perfectly recover the original file.
In this assignment, we’ll be working with a lossless kind of compression called Huffman coding. When you’re finished the assignment, you’ll be able to compress a file, and then decompress that file to get back the original file.

Background

Fixed- and Variable-Length Codes

Suppose that we had an alphabet of four letters: a, b, c, and d. Computers store only 0s and 1s (each 0 and 1 is known as a bit), not letters directly. So, if we want to store these letters in a computer, it is necessary to choose a mapping from these letters to bits. That is, it’s necessary to agree on a unique code of bits for each letter.
How should this mapping work? One option is to decide on the length of the codes, and then assign letters to codes in whatever way we wish. Choosing a code length of 2, we could assign codes as follows: a gets 00, b gets 01, c gets 10, and d gets 11. (Note that I chose a code length of 2 because it is the minimum that supports an alphabet of four letters. Using just one bit gives you only two choices | 0 and 1 | and that isn’t sufficient for our four-letter alphabet.) How many bits are required to encode text using this scheme? If we have a text of 50 letters, for example, then it takes 100 bits (two per letter) to encode the text.
Another option is to drop the requirement that codes be the same length, and assign the shortest possible codes to each letter. This might give us the following: a gets 0, b gets 1, c gets 10, and d gets 11. Notice that the codes for a and b are shorter than before, and that the codes for c and d are the same length as before. Using this scheme, we’ll very likely use fewer bits to encode text. For example, the text aaaaa takes only 5 bits, not 10!
Unfortunately, there’s a catch. Suppose that someone gives us the encoding 0011. What text does this code represent? It could be aabb if you take one character at a time from the code. However, it could also equally be cd, if you break up the code into its 00 (c) and 11 (d) pieces. Our encoding is ambiguous! Which is really too bad, because it seemed that we had a way to reduce the number of bits required to encode text. Is there some way we can still proceed with the idea of using variable-length codes?
To see how we can proceed, consider the following encoding scheme: a gets 1, b gets 00, c gets 010, and d gets 011. The trick here is that no code is a prefix of any other code. This kind of code is called a prefix code, and it leads to unambiguous decoding. For example, 0000 decodes to the text bb; there is no other possibility.
What we have here is seemingly the best of both worlds: variable-length codes (not fixed-length codes), and unambiguous decoding. However, note that some of our codes are now longer than before. For example, the code for c is now 010. That’s a three-bit code, even longer than the 2-bit codes we were getting with the fixed-length encoding scheme. This would be OK if c letters didn’t show up very often. If c letters were rare, then we don’t care much that they cause us to use 3 bits. If c letters did show up often, then we’d worry because we’d be spending 3 bits for every occurrence of c.

Goal of Algorithm

In general, as hinted above, we want to have short codes for frequently-occurring letters and longer codes for letters that show up less often. For example, if the frequency of a in our text is 10000 and the frequency of b in our text is 50, we’d hope that the code for a would be much shorter than the code for b.
Our goal is to generate the best prefix code for a given text. The best prefix code depends on the frequencies in the text. What do we mean by best prefix code? It’s the one that minimizes the average number of bits per symbol or, alternately, the one that maximizes the amount of compression that we get. If we let C be our alphabet, f (x) denote the frequency of symbol x, and d (x) denote the number of bits used to encode x, then we’re seeking to minimize the sum x 2 C f (x) d (x). That is, we want to minimize the sum of bits over all symbols, where each symbol contributes its frequency times its length.
Let’s go back to our four-letter alphabet: a, b, c, and d. Suppose that the frequencies of the letters in our text are as follows: a occurs 600 times, b occurs 200 times, c occurs 100 times, and d occurs 100 times. The best prefix code for these frequencies turns out to be the one we gave earlier: a gets 1, b gets 00, c gets 010, and d gets 011. Calculate the above sum and you’ll get a value of 1600. There are other prefix codes that are worse; for example: a gets 110, b gets 0, c gets 10, and d gets 111. Calculate the sum for this and convince yourself that it is worse than 1600!
Huffman’s algorithm makes use of binary trees to represent codes. Each leaf is labeled with a symbol from the alphabet. To determine the code for such a symbol, trace a path from the root to the symbol’s leaf. Whenever you take a left branch, append a 0 to the code; whenever you take a right branch, append a 1 to the code. The Huffman tree corresponding to the best prefix code above is as follows:
Huffman’s algorithm generates a tree corresponding to the best prefix code. You’ll see proofs of this fact in future courses; for now, you’ll implement the algorithm.

Preliminaries

For this assignment, I’m asking you to do some background reading on several topics that are not explicitly taught in the course. (I don’t think that we give students sufficient opportunity to research on their own and build on the available knowledge. This assignment is an attempt to remedy this.) Here’s what you’ll want to learn:

  • Huffman’s algorithm. You’ll want to understand how Huffman’s algorithm works on a conceptual level before working on the implementation.
  • Python bytes objects. We’ll be reading and writing binary files, so we’ll be using sequences of bytes (not sequences of characters). bytes objects are like strings, except that each element of a bytes object holds an integer between 0 and 255 (the minimum and maximum value of a byte, respectively). We will consider each byte in an uncompressed file as a symbol.
  • Binary numbers. Background on binary numbers will be useful, especially when debugging your code.

You’re strongly encouraged to find and use online resources to learn this background material. Be sure to cite your sources in your assignment submission; include Python comments giving the locations of resources that you found useful.

Your Task

Your task is to complete the functions in huffman.py. Ultimately, you will be able to call the compress and uncompress functions to compress and uncompress a file, respectively.
The code in huffman.py uses two types of nodes from nodes.py. The first, HuffmanNode, is a node in a Huffman tree. Huffman trees are used to compress and uncompress data. The second, ReadNode, is a node as read from a compressed file. It will be necessary to reconstruct the root HuffmanNode of a Huffman tree from the ReadNode s stored in the file.
We now give a suggested order for writing the functions in huffman.py.

Building the Tree

  • Implement make_freq_dict. This function produces a frequency dictionary.
  • Implement huffman_tree. This function takes a frequency dictionary and returns the root HuffmanNode of the Huffman tree corresponding to a best prefix code. (There’s a tricky case to watch for here: you have to think of what to do when the frequency dictionary has only one symbol!)
  • Implement get_codes. This function takes a Huffman tree and returns a dictionary that maps each byte in the tree to its code. The dictionary returned by get_codes is very useful for compressing data, as it gives the mapping between a symbol and the encoding for that symbol.
  • Implement number_nodes. This assigns a number to each internal node. These numbers will be written when compressing a file to help represent relationships between nodes.
  • Now is a good time to implement avg_length. This function returns the average number of bits required to compress a symbol of text.

Compressing Data

Now we know how to generate a Huffman tree, and how to generate codes from that tree. Function generate_compressed uses the given frequency dictionary to generate the compressed form of the provided text.
Every byte stores 8 bits. For example, 00001010 is a byte. Recall that each symbol in the text is mapped to a sequence of bits, such as 00 or 111. generate_compressed takes each symbol in turn from text, looks-up its code, and puts the code’s bits into a byte. Bytes are filled-in from bit 7 (leftmost bit) to bit 0 (rightmost bit). Helper function bits_to_byte will be useful here. You might also find it helpful to use byte_to_bits for debugging; it gives you the bit-representation of a byte.
Note that a code may end up spanning multiple bytes. For example, suppose that you’ve generated six bits of the current byte, and that your next code is 001. There’s room for only two more bits in the current byte, so only the 00 can go there. Then, a new byte is required and its first bit will be 1.
Implement generate_compressed.

Writing a Compressed File

generate_compressed gives us the compressed version of the text. But, consider how someone would later decompress that result to get back the original file. They won’t have the Huffman tree or list of codes, so they won’t be able to decompress the file.
This means that some representation of the tree must also be stored in the file. There are many ways to do this: I have chosen one that is suboptimal but hopefully instructive (I learned the technique from old PC games that used Huffman coding to compress assets in order to fit the game on a single disc).
Follow along in function compress throughout this discussion.
Here is what we generate so that someone can later recover the original file from the compressed version:

  • (1) The number of internal nodes in the tree
  • (2) A representation of the Huffman tree
  • (3) The size of the uncompressed file
  • (4) The compressed version of the file

Functions are already available (in starter code or written by you) for all steps except (2). Here is what we do for step 2, in function tree_to_bytes :

  • Each internal node of the tree takes up four bytes. Leaf nodes are not output, but instead piggyback on the internal nodes.
  • Internal nodes are written out in postorder.
  • For a given internal node n, the four bytes are as follows. The first byte is a 0 if the left subtree of
    n is a leaf; otherwise, it is 1. The second byte is the symbol of the left child if the left subtree of n is a leaf; otherwise, it is the node number of the left child. The third and fourth bytes are analogous for the right subtree of n. That is, the third byte is a 0 if the rightsubtree is a leaf and 1 otherwise; and the fourth byte is a leaf symbol or node number as appropriate.

Implement tree_to_bytes.

Decompressing Text

There are two functions that work together to decompress text: a generate_tree_xxx function to generate a tree from its representation in a file, and a generate_uncompressed function to generate the text itself. Note that you are being asked to write two different generate_tree_xxx functions. Once you get one of them working, you can successfully uncompress text. But we’re asking for both, so don’t forget to do the other one!

  • generate_tree_general takes a list of nodes representing a tree in the form that it is stored in a file. It reconstructs the Huffman tree and returns its root HuffmanNode. This function assumes nothing about the order in which the nodes are given in the list. As long as it is given the index of the root node, it will be able to reconstruct the tree. This means that the tree nodes could have been written in postorder (as your assignment does), preorder, inorder, level-order, whatever.
  • generate_tree_postorder takes a list of nodes representing a tree in the form that it is stored in a file. It reconstructs the Huffman tree and returns its root HuffmanNode. This function assumes that the tree is given in the list in postorder (i.e. in the form generated by your compression code). Unlike generate_tree_general, it is not allowed to use the node-numbers stored in the ReadNode s to reconstruct the tree.
  • generate_uncompressed takes a Huffman tree, compressed text, and number of bytes to produce, and returns the decompressed text. There are two possible approaches: one involves generating a dictionary that maps codes to symbols, and the other involves using the compressed text to traverse the tree and discovering a symbol whenever a leaf is reached.

Another Function

For more practice, I am asking you to implement improve_tree. This has nothing to do with the rest of the program; it is just an extra function.
Suppose that you are given a Huffman tree and a frequency dictionary. The Huffman tree is suboptimal for the given shape of the tree, and the given frequencies, if it is possible to swap a higher-frequency symbol lower in the tree with a higher-frequency symbol higher in the tree.
improve_tree performs all such swaps, improving the tree as much as possible without changing the tree’s shape. Note, of course, that Huffman’s algorithm will never give you a suboptimal tree, so you can’t hope to improve those trees. But you can imagine that someone gives you a suboptimal tree, and you want to improve it as much as possible without changing it’s shape.

Testing Your Work

In addition to the doctests provided in huffman.py, you are encouraged to test your functions on small sample inputs. Make sure that your functions are doing the right thing in these small cases. Finding a bug on a huge example tells you that you have a bug but doesn’t give you much information about what is causing the bug.
Once you’ve suitably tested your functions, you can try the compressor and decompressor on arbitrary files. You’ll be happy to know that you can compress and decompress arbitrary files of any type | although the amount of compression that you get will vary widely!
When you run huffman.py, you have a choice of compressing or uncompressing a file. Choose an option and then type a filename to compress or decompress that file. When you compress file a, a compressed version will be written as a.huf. Similarly, when you decompress file a.huf, a decompressed version will be written as a.huf.orig.
I’ve provided some sample files to get you started:

  • book.txt is a public-domain book from Project Gutenberg. Text files like this compress extremely well using Huffman compression.
  • dan.bmp is a picture of me in bitmap (bmp) format. This is an uncompressed image format and so compresses very well.
  • music.wav is a few seconds of a song from a game called Mystic Ark. The song is in wav format; this file compresses terribly with Huffman compression.

Throw other files at your compressor and see what kind of compression you can get!

Property Tests

In file test_huffman_properties.py, I have provided some property tests that you can run on your code as you start completing functions. Please install the hypothesis Python package to use these property tests at home ( hypothesis is already installed for you in the labs).
What are property tests? Good question! Regular unittests work like this: you provide parameter values and expected return value for each case, and unittest determines whether your function works appropriately in those cases. Property tests, by contrast, work like this: you tell hypothesis about relationships between parameter values and return values, and hypothesis automatically generates tons of examples that try to falsify the correctness of your function. So, with hypothesis, you focus on general behaviour of your functions rather than specific behaviour on prespecified examples.
For a little on the syntax that hypothesis uses, take a look at the first class in test_huffman_properties.py. The @given line specifies that integers between 0 and 255 should be passed to the test_byte_to_bits function as parameter b. We then have two tests ( assertXxx statements) that must pass, no matter the value of b : byte_to_bits must return an object where everything in it is a 0 or 1, and it must return something of length 8. Of course, this particular property test is for a starter code function, not one that you write; but later in the file there are some property tests for the functions that you’re writing.
The property tests that we’ve given you do not fully characterize what it means for a function to be correct. It’s just a starting point for testing. You’re encouraged to write further property tests to help increase your confidence in your functions, though any tests that you write are not being marked.

pyta

My colleague David Liu at the St George campus is developing a utility called pyta to flag certain warnings and errors in students’ Python code. It’s under heavy development these days, but David kindly offered to let us use a prerelease version this semester at UTM. We’re making it available in the DH labs.
For use on your own computer: you can also download it from the link under Assignments on the course website. If you have a directory called e.g. a1, and in there you have both your assignment files and a pyta subdirectory with the files extracted from pyta.zip, then you’ll be able to use pyta.
To use it, just import pyta, and then do: pyta.check(‘huffman.py’)
It will give you warnings and errors in your code. You can ignore some of them | especially indentation style errors | but the others may help you improve the design and style of your code.
Another cool feature gives you documentation on particular warnings and errors. For example, if you want to know more about the W0612 warning, try: pyta.doc(‘W0612’)
If you give this a shot, please let me know what you think! We want to make this as useful as we can for students. Thanks!