**CMSC 380: Science of Information
Assignment#1**

**Due on Monday: September 30, 2019.**

**Part I: **Write a program to process a large text as if it were an information source. Treating the letters of alphabet as symbols (ignore all other characters, convert all to upper/lowercase), compute the following:

- A table and plot of the letter frequencies.
- Use above to compute the entropy of the text.

Based on the results you obtain from above, answer the following:

1. If the probabilities of all letters were equal (i.e. uniform), what would be the entropy of english texts?

2. Is the computed entropy of the above text greater than, equal to, or less than your answer to (1)? Why?

**Extra Credit: **Do all of the above for text(s) in one other language (German, French, Italian, Spanish, ...)

**Part II**: [This is a pen and paper exercise]

Suppose an information source, S has two symbols: {A, B} with p(A) = 0.75, p(B) = 0.25.

First, compute the entropy of this source. Also, assign a Huffman code to the source, S. Compute L, the average code length.

Next, consider a source that has sequences of two symbold from the source, S. I.e. {AA, AB, BB, BA}. Construct a Huffman code for this source. Compute L, the average code length.

Finally, consider a source that has sequences of three symbold from the source, S. I.e. {AAA, AAB, ..., BBB}. Construct a Huffman code for this source. Compute L, the average code length.

Examine the values of L for the three sources above, and see if you can say something about their relationship to entropy.

**Hand in a written report containing the following:**

For Part I: Details of the text: Title, Author, size, etc. Identify the source of the text (where you obtained the data file). And all of the above data obtained by your program. Include the documented program source at the end of your report. Include your work and answers for Part II.