CS 246 Assignment 8


This assignment is to write a program that does Huffman compression / decompression of a file. For example:

    huffmann 1 file.txt  file.cmp
    huffmann 2 file.cmp  file.txt.2
The arguements to the program should be:
  1. 1 or 2. 1 indicates compression. 2 decompression.
  2. If compression, it is the file to compress. If decompression, the file to decompress.
  3. The file to put the results of the compressions / decompression.

From the user’s standpoint, running huffman 1 infile.txt outfile.cmp at the command line reads in infile.txt and produces a compressed file in outfile.cmp. The contents of infile.txt are compressed making the number of bytes written to stdout smaller than that of infile.txt. To decompress, run huffmann 2 outfile.cmp decomp.txt, which reads in outfile.cmp and creates decomp.txt, which should have the same exact contents as the original infile.txt.

These programs are to work by using Huffman compression. The followiing is a summary of the Huffmann Compression algorithm, with a view to writing the program

  1. Calculate the frequencies of all the characters in the input file. For example, if the file contains

    then you will have 5 letters with frequencies given in the following table
  2. For each character with a non-zero frequency, create a "tree" with exactly one structure containing the character and its frequency. These trees (each initially containing one node) are collectively known as the forest.

  3. Let x be the minimum tree in the forest and y be the next-minimum tree in the forest. (Comparing trees is done by the frequency; to break ties use the character value. Hence, for abracadabra, the order of the letters (from smallest to largest) is [c,d,b,r,a]. Remove x and y from the forest. Create a new node (call it n), whose left child is x and whose right child is y. (So your structure will need to have spaces for frequency, the character -- as an integer, and a left and right pointer to other instances of the structure.) Choose a fresh negative number; that is, a negative number that hasn’t been used yet. A convenient way to do this is to count down from -1. Let n’s ch field be this negative number, and let n’s freq field be the sum of x’s freq and y’s freq. Put n into the forest.

  4. If the forest has more than one element, repeat step 3. Otherwise, we have arrived at the final Huffman tree.

  5. The Huffman tree tells you how to encode every character. To encode a character (all of which are stored in leaves of the tree), consider the path from the root to the leaf containing that character. For every left child traversed, add a 0 to the encoding. For every right child, add a 1. So, if the path from the root to the leaf containing r requires going left, then left, then right, the encoding for r is 001. Using this notion of encoding, build a Huffman table that allows you to look up a character and find its encoding.

  6. There are 128 ascii characters. For each, write its frequency to the output file; the nthint is the frequency for the nth ASCII character. If the text to be encoded is abracadabra, int number 97 will have the value 5, whereas all ints before that will be 0.

  7. For each character in the input file, write its encoding in the output file as a sequence of bits. Do not use bytes.

  8. We’re done. Close the files and free your memory.

As you can see from these instructions, a compressed file has a 512-byte header: 128 4-byte ints giving the frequencies (in the compressed text) of all ASCII characters (for abracadra there are a lot of zeroes). This header can make a compressed file larger than the original, for small enough originals.

To decompress a file:

  1. Read in the table of character frequencies in the first 128 ints in the input file.

  2. Build the Huffman tree (as in steps 2-4 above).

  3. For every bit read in from the input file, traverse down the Huffman tree (going left for 0 bits and right for 1 bits). When you reach a leaf, there will be a character stored in the leaf node. Write this character to the output file.

  4. Repeat step 3 for every character in the input file (as determined by looking at the frequency of the root of the Huffman tree, which is necessarily the total number of characters in the input).

  5. We’re done. Close the files and free your memory.

You may, and indeed should test your code against the code of your classmates. You may also test against mine which is available on the lab computers (e.g. in the directory /home/gtowell/Public/246/HW8. To aid in your testing and debugging, if you execute
	huff textfile -1 > /dev/null
you will get, to stderr, a list of all 128 ascii characters (by number), their frequency in the entire file, the number of bits in their encoding and the encoding. Possibly more helpful still, change the -1 to a positive number (N). This gives you the same information, but for only the first N characters in the input file.

A file compressed by your code should be decompressable by mine, and the converse.

All tests will be on files that contain only characters in the ASCII set.

Some hints:

Finally, code organization matters. As usual, try to be inspired by OO ideas. While it might be convenient to put all of your code into one file; don't. Insteadbreak it up into at least two .c files and one .h. (You could certainly use more.)

What to submit:

How to submit

  1. Do not use the submit script
  2. Use tar to create an compressed archive of your submission. Name the tar file something clever like YOURLOGIN_qwe_8.tar.gz where "qwe" is a set of three "secret characters".
  3. Copy your tar file to /home/gtowell/submissions/spring2021/cmsc246/project8. Only I can get a directory listing of this directory so the three secret characters keeps other students from accidentally overwriting (before you get permission set as below), or reading, your work.
  4. Set permissions on your submission so that anyone can read but only you can write to the file. That way no one can accidentally overwrite your submission. If you do not allow anyone to read then I will not be able to read. This will result in a poor grade for the assignment (think zero).