CS 206 - Introduction to Data Structures
Word Counts
Due Oct 15, prior to 11:59pm
In a collection of Dickens novels there are 82136 unique words. This assignment is to write a java program to do the following:
- Confirm that that there are indeed 82136 unique words using several different algorithms
- Compare the time required to determine this number between the algorithms
- Determine the two most common words in this collection and how many times each occurs.
Get the files BookReader.java, ProbeHTInc.java, Map206.java, Map206Interface.java, SepChainHT.java and dickens.txt from /home/gtowell/Public/206/a4. The last is the collection of several of Dickens' works. I claim that this file has 82136 unique words (more properly that my BookReader class delivers 82136 unique strings). The BookReader class reads the file and gives you one word at a time -- via the "nextWord()" method. You might complain that BookReader is not perfect at its job. That is OK. It does not need to be perfect, it is good enough and performs its jobs exactly the same way every time. Also in this directory is ham.txt, the text of "Green Eggs and Ham". This might be a good thing to develop using as it is much shorter than dickens.txt.
Requirements
- You must use my BookReader class to read the text file.
- You must implement the get and put methods in ProbeHTInc. You implementation should use either quadratic probing or double hashing. You implementation need not be concerned with deletions and therefore does not need to use tombstones.
- Your program(s) must allow the textfile to be read to be specified from the command line. E.g., java program dickens.txt Command line arguements are the topic of Lab 5.
- Your program(s) should, for every unique word in the provided file, compute the number of times that word appears. For instance, here is the complete table for Dr Seuss' "Green Eggs and Ham".
a 59
am 3
and 25
anywhere 8
are 2
be 4
boat 3
box 7
car 7
could 14
dark 7
do 37
eat 25
eggs 11
fox 7
goat 4
good 2
green 11
ham 11
here 11
house 8
i 72
if 1
in 40
let 4
like 44
may 4
me 4
mouse 8
not 84
on 7
or 8
rain 4
sam 6
sam-i-am 13
say 5
see 4
so 5
thank 2
that 3
the 11
them 61
there 9
they 2
train 9
tree 6
try 4
will 21
with 19
would 26
you 34
You are not required to produce such a table, I provide it here simply as an example of the data you are expected to collect.
- You must test the speed of the following algorithms on dickens.txt
- 206Map
- SeparateChainHT
- ProbeHTInc -- with your get and put methods
- HashMap -- from the java.util
Note that 206Map on dickens.txt can be expected to take several minutes.
- All programs must be runnable from the command line.
- Each of the different algorithms to be tested may be run from a separate command line arguement via multiple main functions For example
java program1 dickens.txt
java program2 dickens.txt
etc
Alternately you may create a Main.java file that starts the test you want. For instance, this might look like
java main 1 dickens.txt
java main 2 dickens.txt
etc
Other alternative exist. However you arrange this, be sure it is well documented in the "how to run" section of the Readme file.
- From each of these algorithms other than java.util.HashMap devise a way to show the three most common words. You approach can be extremely ugly. For instance, for "Green Eggs and Ham" this would be:
not 84
i 72
them 61
- Age old requirements about comments and exception handling still apply. Similarly, everything should be properly encapsulated.
- In your readme file you must include exact times required. (Depending on the speed of your computer the Map206 implementation could take in excess of 10 minutes
- I may test your code using a different file (given from the command line) so you should not do anything to "hard code" aspects to dickens.txt
Electronic Submissions
Your program will be graded based on how it runs on the
department’s Linux server, not how it runs on your computer.
The submission should include the following items:
- README: This file should follow the format of this sample README (https://cs.brynmawr.edu/cs206/README.txt)
- Within the README file, include a brief (max 1 paragraph) reflection on what went well and/or poorly and why
- Also within Readme, include a comparison of the times required for the task by your hashtable and ArrayList implementations. Also, some analysis of these times would be good.
- Source files: Every .java file used in the final version of every part of your project (including the imported files)
- Unique Data files used: If any
The following steps for submission assume you are using VSC, and that you created a project named AssignmentN in the directory /home/YOU/cs206/
- For this assignment N=11
- Put the README file into the project directory
- Go to the directory /home/YOU/cs206
- Enter submit -c 206 -p N -d AssignmentN
For more on using the submit script click here