CS 206 -- Assignment 6

Trees and Hashtables

In this assignment you will read a dataset of the following form:
assignment
sir
sirs
signer
sign
assign
assignments
signer
assigns
So, the dataset consists of words, 1 word per line. You may assume that every word is in lower case and that every line has a word.

The first task is to get and use the code Stemmer.java from /home/gtowell/cs206/a6/. Figure out how to use this code to get the "stemmed" stemmed form of each word. (The stemmed form is the word with any suffixes removed.) You might discover that the documentation of this code, while long is woefully insufficient to make the code easily used. (So all of you who are handing in poorly documented assignments ....) Note that this stemmer is very aggressive and occasionally wrong so it will remove some things that you would not call suffixes. Do not worry about this, you task is only to use this code, not to correct it.

Once you are able to stem words, build a data structure that holds, for each stemmed word: the stem, all of the unstemmed variants, and the number of times some unstemmed variant of the word appears in the dataset.

After reading in all of the words, you should print out an alphabetized list of the stems, the number of times that occurred and the morphological variants that appeared. For example, for the above mini dataset you might print out:

assign 4 assign assigns assignment
sign 3 sign signer
sir 2 sir sirs
For the morphological variants, the order is not important.

As always, there are some constraints on your solution:

  1. The time to insert a new root form (e.g., sir, line or assign from above) must be O(log(n)) where n is the number of root forms
  2. The time to update the information about a know root form must be O(1).
  3. The time to make the final printout must be O(m) where m is the number of morphological variants seen.
  4. The program must take one command line argument, the name of the file containing the words to be processed.
  5. There should be no constraints on the system other than the amount of memory available

Test data

There are a couple of test datasets in /home/gtowell/cs206/a6. Your life will probably be easier if you test your code using these sets.

Due:

Part 1: 40 points April 15: A 1 page textual description (email is the preferred way of giving this to me) of the algorithms and data structures you intend to use. This plan needs to be plausible, thatis it must be able do the assignment within the time constraints. If I do not receive a description by April 15, the maximum number of points you can receive on this assignment is 80.

Part 2: 60 points April 22 The completed program. The program need not follow that plan handed in previously, but if it does not, you should explain why not.

Hand in:

No paper is required is required for the final program. An electronic submission is sufficient.

Hints:

You might end up using a lot of data structures in this assignment. You might also find it useful to share some information between some of your data structures.

I strongly recommend that you think hard about this assignment before writ