CS 151 - Introduction to Data Structures
Assignment 5 -- Word Counts
Usual Style comments
As usual, read the program design principles and code formatting
standards carefully. You are expected to adhere to all stated standards.
Overview
Pre-victorian english authors wrote a lot. In this assignment you will explore some aspects of that premise. I have picked two authors for you to work with (choose one). Edward Gibbon who wrote "The Decline and Fall of the Roman Empire" (six volumes) and Sir Walter Scott who wrote a bunch of long novels (I have picked 7 of his novels including "Ivanhoe" and "Rob Roy").
This assignment is to write a java program to do the following:
- Determine the number of unique words used by your chosen author (in my data file).
- Determine the most common word in the collection and how often it occurs.
- Compare the time required to determine this number across several different algorithms. Timing was discussed in class in conjunction with asymptotic analysis. The code is https://cs.brynmawr.edu/cs151/Data/Hw5/Timer.java.txt
Setup
Get the files: ProbeHTInc.java, Map151.java, Map151Interface.java, SepChainHT.java and ReadCSV.java (ReadCSV is identical to the code you have used in the previous assignments). Note that this code is very similar to code discussed in class, although possibly with slightly different names. This code is available by scp at
/home/gtowell/Public/151/Hw5
Use scp to get the code. For example:
scp YOURUNIXLOGIN@goldengate.cs.brynmawr.edu:/home/gtowell/Public/151/Data/Hw5/SepChainHT.java SepChainHT.java
or from the web at (for example):
https://cs.brynmawr.edu/cs151/Data/Hw5/SepChainHT.java.txt
Data
Available from the same location are the text files for you to work with. ScottOne.txt for Sir Walter Scott and GibbonOne.txt for Edward Gibbon. Also at this location is
hamoneword.txt, the text of "Green Eggs and Ham". This might be a good thing on which to develop your code as it is much shorter
than either Gibbon or Scott.
These files all have had all of their punctuation removed and have been formatted with one word on each line. You may write your own reader, or you can use ReadCSV (even though these are not CSV files). If you use ReadCSV, do so exactly as in the previous assignments. The only difference is that you should read only the item in position 0 in the string array from the iterator for each line (indeed the array should have only one item in it).
Requirements
-
In the file ProbeHTInc, add line-by-line comments to the methods get, put and find. These comments should described exactly what each line in these methods is doing, and why. I would expect that the comments will have more characters than the functions themselves. Also, complete the top-level documentation of find.
- Your program(s) should, for every unique word in the provided file, compute the number of times that word
appears. For instance, here is the complete table for Dr Seuss' "Green Eggs and Ham".
[i:85]
[am:16]
[sam:19]
[that:3]
[do:37]
[not:84]
[like:44]
[you:34]
[green:11]
[eggs:11]
[and:25]
[ham:11]
[them:61]
[would:26]
[here:11]
[or:8]
[there:9]
[anywhere:8]
[in:40]
[a:59]
[house:8]
[with:19]
[mouse:8]
[eat:25]
[box:7]
[fox:7]
[could:14]
[car:7]
[they:2]
[are:2]
[may:4]
[will:21]
[see:4]
[tree:6]
[let:4]
[me:4]
[be:4]
[train:9]
[on:7]
[say:5]
[the:11]
[dark:7]
[rain:4]
[goat:4]
[boat:3]
[so:5]
[try:4]
[if:1]
[good:2]
[thank:2]
You are not required to produce such a table, I provide it here simply as an example of the data you are
expected to collect. This table could also be used to check the data collection in your program. On the other hand, this table should be very easy to create using Map151 (see the
toString method of Map151).
- You must test the speed of the following algorithms on either the Gibbon or Scott file
- Map151
- SeparateChainHT
- ProbeHTInc -- with your fully documented get and put methods (see below)
Map151 can be expected to take a while, perhaps even quite a while. (On my computer, "quite a while" was somewhere in the neighborhood of 1 minute. Your speed will vary.)
- All programs must be runnable from the command line. Note that the programs for running each of these tests
will be highly repetitive. So get one working perfectly, then cut and paste.
- Each of the different algorithms to be tested should be runnable from separate command line arguments. For
example
java SepCh https://cs.brynmawr.edu/cs151/Data/Hw5/GibbonOne.txt
java Probe https://cs.brynmawr.edu/cs151/Data/Hw5/GibbonOne.txt
(This assumes you have a main method in a file SepCh.java and a main method in another file Probe.java and those main methods do the data collection and timing discussed above.
Where reasonable, you may put main methods into existing files.)
Alternately you may create a Main.java file that starts the test you want. For instance, this might look
like
java Main 1 https://cs.brynmawr.edu/cs151/Data/Hw5/GibbonOne.txt
java Main 2 https://cs.brynmawr.edu/cs151/Data/Hw5/GibbonOne.txt
etc
where 1 indicates Map151, 2 indicates SeparateChain, etc.
Other alternative exist.
However you arrange this, be sure it is well documented in the "how to run" section
of the Readme file.
- Your program(s) must allow the text and the type of test to be specified from the command line.
- Age old requirements about comments and exception handling still apply. Similarly, everything should be
properly encapsulated.
- In your readme file you must include a table of the times you recorded for each algorithm. For instance, you
might
provide a table similar to the following
Map SepChain Probe
average 2 3 4
std dev 1 2 2
where the units in the table are well defined (by you) and the data is real (unlike here). You should do
multiple
runs of each test and report average and standard deviations of those runs.
- In addition to the timing table, in your readme file include two pieces of information about your chosen collection: the number of words, and the number of unique words
- In your readme, add a statement of what is the most common word, and how many times does the most common word occur. To figure this out, use the keySet method that is provided with Map151Impl, SepChainHT and ProbeHTInc. For instance, here is an example of the use of keySet
Map151Impl<String, String> s = new Map151Impl<>();
// do something to fill the map
for (String k : mm.keySet()) {
System.out.println(k);
}
- Your code may be tested using a different text file (given from the command line) but that file will follow
the one word per line format of the Gibbon and Scott files
Electronic Submissions
Your program will be graded based on how it runs on the
department’s Linux server, not how it runs on your computer.
Be sure to include with your submission:
1. README:
The usual plain text file README
2. Source files:
All .java files that are not part of the standard Java libraries, regardless of whether you wrote them
or simply used ones supplied with the assignment.
4. Output file
None required.
DO NOT INCLUDE:
Data files that are read from the class site.
The following steps for submission assume that you created a project directory named
Assignment5 in the directory /home/YOU/cs151/ on the UNIX servers
- Make sure all of your code files and the requested output file are in that directory. (Use scp as needed to copy files from your local machine to that directory on UNIX
- Similarly put the README file into the project directory on UNIX
- on a UNIX machine, go to the directory /home/YOU/cs151
- Enter /home/gtowell/bin/submit -c 151 -p 5 -d Assignment5
For a longer version of these instructions see earlier homeworks.