CS383: Lab 10 -- Heaps Law

In this lab you will be building a concordance over and among a set of files. The goal is to test how well Heap's law holds.

To do this do the following.

  1. Start a global concordance that is initially empty
  2. For each file construct a concordance (a list of the unique words in the file).
  3. Determine and record the percentage of words in the file concordance that are NOT in the global concordance.
  4. Merge the file concordance into the global concordance
  5. Repeat until there are no more files

Data

The data you will be using for this lab are the full text of chapters from six novels by Charles Dickens. These files are available on the UNIX system at /home/gtowell/Public/383/Dickens. If you are so motivated, there is also a tar file of the complete file set that would be easy to copy to your own machine at /home/gtowell/Public/383/split.tar. Each of the files in the Dickens directory is formatted to make the task as easy as possible. Specifically, one word per line, all lower case, all punctuation removed. So all you need to do is read the file and build the concordance. You do not need to do anything to modify the text after reading.

There are 402 files in the dataset.

Please process the files in alphabetical order. It makes it easier for me to detect if you are doing everything correctly.

What to hand in

Send to gtowell@brynmawr.edu the data you recorded. The data should have two items per line: the name of the file and the percentage of new words. The first line of your data should be:
    Bleak1_0.txt 100.000
which says that every word is the first file has never been seen before.

Also attach any code you wrote for this lab to your email.