CS 246 Assignment 8

Due

This assignment is to write a program that does Huffman compression / decompression of a file. For example:

    huffmann 1 file.txt  file.cmp
    huffmann 2 file.cmp  file.txt.2

The arguements to the program should be:

1 or 2. 1 indicates compression. 2 decompression.
If compression, it is the file to compress. If decompression, the file to decompress.
The file to put the results of the compressions / decompression.

From the user’s standpoint, running huffman 1 infile.txt outfile.cmp at the command line reads in infile.txt and produces a compressed file in outfile.cmp. The contents of infile.txt are compressed making the number of bytes written to stdout smaller than that of infile.txt. To decompress, run huffmann 2 outfile.cmp decomp.txt, which reads in outfile.cmp and creates decomp.txt, which should have the same exact contents as the original infile.txt.

These programs are to work by using Huffman compression. The followiing is a summary of the Huffmann Compression algorithm, with a view to writing the program

Calculate the frequencies of all the characters in the input file. For example, if the file contains
```
abracadabra
```
then you will have 5 letters with frequencies given in the following table

Character Frequency

a 5
b 2
c 1
d 1
r 2
For each character with a non-zero frequency, create a "tree" with exactly one structure containing the character and its frequency. These trees (each initially containing one node) are collectively known as the forest.
Let x be the minimum tree in the forest and y be the next-minimum tree in the forest. (Comparing trees is done by the frequency; to break ties use the character value. Hence, for abracadabra, the order of the letters (from smallest to largest) is [c,d,b,r,a]. Remove x and y from the forest. Create a new node (call it n), whose left child is x and whose right child is y. (So your structure will need to have spaces for frequency, the character -- as an integer, and a left and right pointer to other instances of the structure.) Choose a fresh negative number; that is, a negative number that hasn’t been used yet. A convenient way to do this is to count down from -1. Let n’s ch field be this negative number, and let n’s freq field be the sum of x’s freq and y’s freq. Put n into the forest.
If the forest has more than one element, repeat step 3. Otherwise, we have arrived at the final Huffman tree.
The Huffman tree tells you how to encode every character. To encode a character (all of which are stored in leaves of the tree), consider the path from the root to the leaf containing that character. For every left child traversed, add a 0 to the encoding. For every right child, add a 1. So, if the path from the root to the leaf containing r requires going left, then left, then right, the encoding for r is 001. Using this notion of encoding, build a Huffman table that allows you to look up a character and find its encoding.
There are 128 ascii characters. For each, write its frequency to the output file; the nthint is the frequency for the nth ASCII character. If the text to be encoded is abracadabra, int number 97 will have the value 5, whereas all ints before that will be 0.
For each character in the input file, write its encoding in the output file as a sequence of bits. Do not use bytes.
We’re done. Close the files and free your memory.

Character	Frequency
a	5
b	2
c	1
d	1
r	2

As you can see from these instructions, a compressed file has a 512-byte header: 128 4-byte ints giving the frequencies (in the compressed text) of all ASCII characters (for abracadra there are a lot of zeroes). This header can make a compressed file larger than the original, for small enough originals.

To decompress a file:

Read in the table of character frequencies in the first 128 ints in the input file.
Build the Huffman tree (as in steps 2-4 above).
For every bit read in from the input file, traverse down the Huffman tree (going left for 0 bits and right for 1 bits). When you reach a leaf, there will be a character stored in the leaf node. Write this character to the output file.
Repeat step 3 for every character in the input file (as determined by looking at the frequency of the root of the Huffman tree, which is necessarily the total number of characters in the input).
We’re done. Close the files and free your memory.

You may, and indeed should test your code against the code of your classmates. You may also test against mine which is available on the lab computers (e.g. 165.106.10.186) in the directory /home/gtowell/Public/246/HW8. To aid in your testing and debugging, if you execute

	huff textfile -1 > /dev/null

you will get, to stderr, a list of all 128 ascii characters (by number), their frequency in the entire file, the number of bits in their encoding and the encoding. Possibly more helpful still, change the -1 to a positive number (N). This gives you the same information, but for only the first N characters in the input file.

A file compressed by your code should be decompressable by mine, and the converse.

All tests will be on files that contain only characters in the ASCII set.

Some hints:

The method described above works ONLY on ASCII characters in the range 0-127. It does not handle anything else -- and will break with anything else. To simplify the problem (a little) you may assume that the files to be compressed contain only ASCII characters.
Writing and reading compressed files means that you have to work with bits not bytes. This is awkward with the functions in stdio.h, all of which work with bytes. Consider writing bits. You will need a structure that stores the bits to write, as well as the number of bits that have been written. When this number goes above 8 (the number of bits in a byte), write out the byte and clear the bits from your buffer. When reading, you will have to do the opposite, reading in a byte and storing it, while offering up one bit at a time via a function.
When writing a compressed file, you might read the last input character and still have some unwritten bits in your buffer. Make sure to flush the buffer before close the file.
When reading a compressed file, there may be some leftover bits at the end of reading. (That’s because the writer might have written a number of bits that is not divisible by 8.) That’s OK—just don’t read them.
You have choices when it comes to implementing your forest. Best would be using a min-heap. But it’s OK if you use some other structure. You can get 10 bonus, extra-credit points for using a heap, though.

Finally, code organization matters. As usual, try to be inspired by OO ideas. While it might be convenient to put all of your code into one file; don't. Insteadbreak it up into at least two .c files and one .h. (You could certainly use more.)

What to submit:

README file. This is a file literally named README. This file should follow the format of https://cs.brynmawr.edu/cs246/README.txt Your readme file should clearly state how to compile and run you program.
All program files for this task.
A script showing the use of your code This should illustrate all of the apps functions (There are really only 2).
Your makefile. Yes, you must have one.
Include in your submission a reflection. This file should contain reflections on this assignment (what was hard? what wasn’t so bad? how much time did it take?) as well as any general reflections you wish to share about the course.
Do NOT include emacs temp files, or the temp files of any other program. For emacs these are of the form xxxx~ and #xxxx#. Also do not include an executable files or intermediate compilation results (.o files).

How to submit

Do not use the submit script
Use tar to create an compressed archive of your submission. Name the tar file something clever like YOURLOGIN_qwe_8.tar.gz where "qwe" is a set of three "secret characters".
Copy your tar file to /home/gtowell/submissions/spring2021/cmsc246/project8. Only I can get a directory listing of this directory so the three secret characters keeps other students from accidentally overwriting (before you get permission set as below), or reading, your work.
Set permissions on your submission so that anyone can read but only you can write to the file. That way no one can accidentally overwrite your submission. If you do not allow anyone to read then I will not be able to read. This will result in a poor grade for the assignment (think zero).