This assignment is to write a program that does Huffman compression / decompression of a file. For example:
huffmann 1 file.txt file.cmp huffmann 2 file.cmp file.txt.2The arguements to the program should be:
From the user’s standpoint, running huffman 1 infile.txt outfile.cmp
at the command line reads in infile.txt
and produces a compressed file in outfile.cmp
. The contents of infile.txt
are compressed making the number of bytes written to stdout smaller than that of infile.txt
. To decompress, run huffmann 2 outfile.cmp decomp.txt
, which reads in outfile.cmp
and creates decomp.txt
, which should have the same exact contents as the original infile.txt
.
These programs are to work by using Huffman compression. The followiing is a summary of the Huffmann Compression algorithm, with a view to writing the program
Calculate the frequencies of all the characters in the input file. For example, if the file contains
abracadabrathen you will have 5 letters with frequencies given in the following table
Character | Frequency |
---|---|
a | 5 |
b | 2 |
c | 1 |
d | 1 |
r | 2 |
For each character with a non-zero frequency, create a "tree" with exactly one structure containing the character and its frequency. These trees (each initially containing one node) are collectively known as the forest.
Let x
be the minimum tree in the forest and y
be the next-minimum tree in the forest. (Comparing trees is done by the frequency; to break ties use the character value. Hence, for abracadabra, the order of the letters (from smallest to largest) is [c,d,b,r,a]. Remove x
and y
from the forest. Create a new node (call it n
), whose left child is x
and whose right child is y
. (So your structure will need to have spaces for frequency, the character -- as an integer, and a left and right pointer to other instances of the structure.) Choose a fresh negative number; that is, a negative number that hasn’t been used yet. A convenient way to do this is to count down from -1
. Let n
’s ch
field be this negative number, and let n
’s freq
field be the sum of x
’s freq
and y
’s freq
. Put n
into the forest.
If the forest has more than one element, repeat step 3. Otherwise, we have arrived at the final Huffman tree.
The Huffman tree tells you how to encode every character. To encode a character (all of which are stored in leaves of the tree), consider the path from the root to the leaf containing that character. For every left child traversed, add a 0
to the encoding. For every right child, add a 1
. So, if the path from the root to the leaf containing r
requires going left, then left, then right, the encoding for r
is 001
. Using this notion of encoding, build a Huffman table that allows you to look up a character and find its encoding.
There are 128 ascii characters. For each, write its frequency to the output file; the nthint
is the frequency for the nth ASCII character. If the text to be encoded is abracadabra
, int
number 97 will have the value 5, whereas all int
s before that will be 0.
For each character in the input file, write its encoding in the output file as a sequence of bits. Do not use bytes.
We’re done. Close the files and free your memory.
As you can see from these instructions, a compressed file has a 512-byte header: 128 4-byte int
s giving the frequencies (in the compressed text) of all ASCII characters (for abracadra there are a lot of zeroes). This header can make a compressed file larger than the original, for small enough originals.
To decompress a file:
Read in the table of character frequencies in the first 128 int
s in the input file.
Build the Huffman tree (as in steps 2-4 above).
For every bit read in from the input file, traverse down the Huffman tree (going left for 0
bits and right for 1
bits). When you reach a leaf, there will be a character stored in the leaf node. Write this character to the output file.
Repeat step 3 for every character in the input file (as determined by looking at the frequency of the root of the Huffman tree, which is necessarily the total number of characters in the input).
We’re done. Close the files and free your memory.
huff textfile -1 > /dev/nullyou will get, to stderr, a list of all 128 ascii characters (by number), their frequency in the entire file, the number of bits in their encoding and the encoding. Possibly more helpful still, change the -1 to a positive number (N). This gives you the same information, but for only the first N characters in the input file. A file compressed by your code should be decompressable by mine, and the converse. All tests will be on files that contain only characters in the ASCII set.
Some hints:
Writing and reading compressed files means that you have to work with bits not bytes. This is awkward with the functions in stdio.h
, all of which work with bytes. Consider writing bits. You will need a structure that stores the bits to write, as well as the number of bits that have been written. When this number goes above 8 (the number of bits in a byte), write out the byte and clear the bits from your buffer. When reading, you will have to do the opposite, reading in a byte and storing it, while offering up one bit at a time via a function.
When writing a compressed file, you might read the last input character and still have some unwritten bits in your buffer. Make sure to flush the buffer before close the file.
When reading a compressed file, there may be some leftover bits at the end of reading. (That’s because the writer might have written a number of bits that is not divisible by 8.) That’s OK—just don’t read them.
You have choices when it comes to implementing your forest. Best would be using a min-heap. But it’s OK if you use some other structure. You can get 10 bonus, extra-credit points for using a heap, though.
Include in your submission a reflection. This file should contain reflections on this assignment (what was hard? what wasn’t so bad? how much time did it take?) as well as any general reflections you wish to share about the course.