CS 206 - Introduction to Data Structures

Word Counts

Mar 14, prior to 11:59pm

Jane Austen wrote books. Fairly long books. Putting 6 of them together, you get a over 700,000 words. But how many distinct words did she use? This assignment is to write a java program to do the following:
1. Determine the number of unique words she used (in a provided collection of her works).
2. Compare the time required to determine this number across several different algorithms
3. Determine the three most common words in this collection and how many times each occurs.
Get the files: ProbeHTInc.java, Map206.java, Map206Interface.java, SepChainHT.java and janeausten.txt from /home/gtowell/Public/206/a4. The last is the full text (plus some headers and footers) 6 novels by Jane Austen (Pride and Prejudice, Emma, Sense and Sensibility, Northanger Abbey and Mansfield Park). Also in this directory is ham.txt, the text of "Green Eggs and Ham". This might be a good thing to develop using as it is much shorter than janeausten.txt.

Requirements

1. Write a class -- perhaps named MyReader with the following suggested properties (innovation is encouraged):
1. It takes a single arguement in its constructor. That arguement is the name of the a file (probably either "ham.txt" or "janeausten.txt").
2. It has a single public method nextWord() which returns the next word from the file. When all words have been read, the nextWord should return null. So, the use of this class would be something like:
```            MyReader myreader = new MyReader("ham.txt");
while (true) {
if (line==null)
break;
// do something with the word.
}
```
As an alternative, you could give MyReader a public method nextWords() which returns an array of strings (possibly consisting of all of the words in one line of the file). When the entire file has been read, the function would return null. In this case, the code above would need to be slightly modified.
2. Your MyReader class should do something to parse out common punctuation and also change everything to lower case. For instance, you might use the following:
```              BufferedReader br; // set up as usual
// other code here
String betterline = line.trim().toLowerCase().replaceAll("[\\.\\'?!,\\\"]", "");
```
You are not required to use the above code (or even understand it), this is just an example of code that would meet this requirement.
3. You must implement the get and put methods in ProbeHTInc. ("Inc" in the class name is short for incomplete.) Your implementation should use either quadratic probing. Your implementation need not be concerned with deletions and therefore does not need to use tombstones.
4. Your program(s) should, for every unique word in the provided file, compute the number of times that word appears. For instance, here is the complete table for Dr Seuss' "Green Eggs and Ham".
```a          59
am          3
and        25
anywhere    8
are         2
be          4
boat        3
box         7
car         7
could      14
dark        7
do         37
eat        25
eggs       11
fox         7
goat        4
good        2
green      11
ham        11
here       11
house       8
i          72
if          1
in         40
let         4
like       44
may         4
me          4
mouse       8
not        84
on          7
or          8
rain        4
sam         6
sam-i-am   13
say         5
see         4
so          5
thank       2
that        3
the        11
them       61
there       9
they        2
train       9
tree        6
try         4
will       21
with       19
would      26
you        34
```
You are not required to produce such a table, I provide it here simply as an example of the data you are expected to collect.
5. You must test the speed of the following algorithms on janeausten.txt
1. 206Map
2. SeparateChainHT
3. ProbeHTInc -- with your get and put methods
4. HashMap -- from java.util
206Map on janeausten.txt can be expected to take several minutes.
6. All programs must be runnable from the command line.
7. Each of the different algorithms to be tested may be run from a separate command line arguement via multiple main functions For example
```			java program1 janeausten.txt
java program2 janeausten.txt
etc
```
Alternately you may create a Main.java file that starts the test you want. For instance, this might look like
```			java main 1 janeausten.txt
java main 2 janeausten.txt
etc
```
Other alternative exist. However you arrange this, be sure it is well documented in the "how to run" section of the Readme file.
8. Your program(s) must allow the textfile to be read to be specified from the command line. E.g., java program ham.txt
9. From each of these algorithms other than java.util.HashMap devise a way to show the three most common words. You approach can be extremely ugly. For instance, for "Green Eggs and Ham" this would be:
```			not        84
i          72
them       61
```
10. Age old requirements about comments and exception handling still apply. Similarly, everything should be properly encapsulated.
11. In your readme file you must include exact times required for each algorithm.
12. Your code may be tested using a different file (given from the command line) so you should not do anything to "hard code" aspects to a particular text file.

Electronic Submissions

Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer. The submission should include the following items: