CS383: Lab 10 -- Heaps Law

In this lab you will be building a concordance (it need not be alphabetical) over a set of files. The goal is to test how well Heap's law holds.

To do this do the following.

  1. Start a global concordance that is initially empty. (If you are using Java, you might make this a HashSet.)
  2. For each file:
    1. Construct a "file" concordance (a list of the unique words in the file).
    2. Determine and record the percentage of words in the file concordance that are ALREADY in the global concordance.
    3. Merge the file concordance into the global concordance

Data

The data you will be using for this lab are the full text of chapters from six novels by Charles Dickens. These files are available on the UNIX system at /home/gtowell/Public/CS383/Dickens. If you are so motivated, there is also a tar file of the complete file set that would be easy to copy to your own machine at /home/gtowell/Public/CS383/split.tar. Each of the files in the Dickens directory is formatted to make the task as easy as possible. Specifically, one word per line, all lower case, all punctuation removed. So all you need to do is read the file and build the concordance. You do not need to do anything to modify the text after reading.

There are 402 files in the dataset.

Please process the files in alphabetical order. It makes it easier for me to detect if you are doing everything correctly.

How is this related to Zips' Law

If you plot the data with percentage on the Y axis and the file "number" on the X-axis you should get a curve that looks similar to one for Zips' Law. That is, the percentage of words you have already seen should be always rising, but never reaching 100%. It would be possible to create a Zips law curve from this data but the counting process would be a little different. .

What to hand in

Send to gtowell@brynmawr.edu the data you recorded. The data should have two items per line: the name of the file and the percentage of previously seen words. The first lines of your data should be:
    Bleak1_0.txt 0.000
    Bleak1_1.txt XXX.XXX
which says that every word is the first file is composed only of words you have never seen before. XXX.XXX should be a positive number between less than 100.

Also attach any code you wrote for this lab to your email.