CS 151 - Introduction to Data Structures
Word Counts
Jane Austen wrote books. Fairly long books. Putting 6 of them together, you get over 700,000 words. But how many
distinct words did she use? This assignment is to write a java program to do the following:
- Determine the number of unique words Austen used (in a provided collection of her works).
- Compare the time required to determine this number across several different algorithms
Get the files: ProbeHTInc.java, Map151.java, Map151Interface.java, SepChainHT.java and janeoneword.txt from
/home/gtowell/Public/151/A04. The last is the full text (plus some headers and footers) of 6 novels by Jane Austen
(Pride and Prejudice, Emma, Sense and Sensibility, Northanger Abbey and Mansfield Park). Also in this directory is
hamoneword.txt, the text of "Green Eggs and Ham". This might be a good thing to develop using as it is much shorter
than
janeoneword.txt.
Requirements
- Write a class -- perhaps named MyReader with the following suggested properties (innovation is
encouraged):
- It takes a single parameter in its constructor. That parameter is the name of the a file to be used
as input (probably
either "hamoneword.txt" or "janeoneword.txt").
- It has a single public method nextWord() which returns the next word from the file. When
all words have been read, the nextWord should return null. So, the use of this class would be
something like:
MyReader myreader = new MyReader("ham.txt");
while (true) {
String line = myreader.nextWord();
if (line==null)
break;
// do something with the word.
}
janeoneword.txt and hamoneword.txt are formatted to make this task easier. In particular, all
punctuation
has been removed and there is at most one word per line. (Some line have no words.) You should write
the nextWord() method to return the next word rather than the next line. Hence, if the file contains
word1
word2
(i.e. a blank line between word1 and word2)
then MyReader consecutive calls to nextWord would return "word1" followed by "word2" rather than
"word1" followed by a blank.
- You must implement the get and put methods in ProbeHTInc. ("Inc" in the class name is short for incomplete.)
Your implementation should use linear probing. Your implementation need not be concerned with
deletions and therefore does not need to use tombstones. If we did not discuss tombstones in class, then you
do not know and do not care.
- Your program(s) should, for every unique word in the provided file, compute the number of times that word
appears. For instance, here is the complete table for Dr Seuss' "Green Eggs and Ham".
[i:85]
[am:16]
[sam:19]
[that:3]
[do:37]
[not:84]
[like:44]
[you:34]
[green:11]
[eggs:11]
[and:25]
[ham:11]
[them:61]
[would:26]
[here:11]
[or:8]
[there:9]
[anywhere:8]
[in:40]
[a:59]
[house:8]
[with:19]
[mouse:8]
[eat:25]
[box:7]
[fox:7]
[could:14]
[car:7]
[they:2]
[are:2]
[may:4]
[will:21]
[see:4]
[tree:6]
[let:4]
[me:4]
[be:4]
[train:9]
[on:7]
[say:5]
[the:11]
[dark:7]
[rain:4]
[goat:4]
[boat:3]
[so:5]
[try:4]
[if:1]
[good:2]
[thank:2]
You are not required to produce such a table, I provide it here simply as an example of the data you are
expected to collect. On the other hand, this table should be very easy to create using Map151 (see the
toString method of Map151).
- You must test the speed of the following algorithms on janeoneword.txt
- Map151
- SeparateChainHT
- ProbeHTInc -- with your get and put methods
- HashMap -- from java.util
Map151 on janeoneword.txt can be expected to take a while.
- All programs must be runnable from the command line. Note that the programs for running each of these tests
will be highly repetitive. So get one working perfectly, then cut and paste.
- Each of the different algorithms to be tested should be runnable from separate command line arguments. For
example
java SepCh janeoneword.txt
java Probe janeoneword.txt
etc
(This assumes you have a main method in a file SepCh.java and a main method in another file Probe.java.
Where reasonable, you may put main maethods into existing files.)
Alternately you may create a Main.java file that starts the test you want. For instance, this might look
like
java Main 1 janeoneword.txt
java Main 2 janeoneword.txt
etc
where 1 indicates Map151, 2 indicates SeparateChain, etc.
Other alternative exist.
However you arrange this, be sure it is well documented in the "how to run" section
of the Readme file.
- Your program(s) must allow the textfile and the type of test to be specified from the command line.
- Age old requirements about comments and exception handling still apply. Similarly, everything should be
properly encapsulated.
- In your readme file you must include a table of the times you recorded for each algorithm. For instance, you
might
provide a table similar to the following
Map SepChain Probe java.util.hashmap
average 2 3 4 5
std dev 1 2 2 1
where the units in the table are well defined (by you) and the data is real (unlike here). You should do
multiple
runs of each test and report average and standard deviations of those runs.
- In your readme file, you must write a paragraph (or so) analysing the times reported in your table. The
should comment on what you believe the times are as they are (which algorith is fastest, why, which is
second fastest, why, etc. Linking this analysis to big-O analysis would be a good idea.
- Your code may be tested using a different text file (given from the command line) but that file will follow
the one word per line format of janeoneword.txt.
Electronic Submissions
Your program will be graded based on how it runs on the
department’s Linux server, not how it runs on your computer.
The submission should include the following items:
- README: This file should follow the format of this sample README
(https://cs.brynmawr.edu/cs151/README.txt)
- Within the README file, include a brief (max 1 paragraph) reflection on what went well and/or poorly
and why
- Also within Readme, your analysis paragraph and data table.
- Source files: Every .java file used in the final version of every part of your project (including the
imported files)
- Unique Data files used: If any
The following steps for submission assume you are using VSC, and that you created a project
named AssignmentN in the directory /home/YOU/cs151/
- For this assignment N=4
- Put the README file into the project directory
- Go to the directory /home/YOU/cs151
- Enter /home/gtowell/bin/submit -c 151 -p N -d AssignmentN
For more on using the submit script click here