CS 151 - Introduction to Data Structures

Word Counts

Jane Austen wrote books. Fairly long books. Putting 6 of them together, you get over 700,000 words. But how many distinct words did she use? This assignment is to write a java program to do the following:
  1. Determine the number of unique words Austen used (in a provided collection of her works).
  2. Compare the time required to determine this number across several different algorithms
Get the files: ProbeHTInc.java, Map151.java, Map151Interface.java, SepChainHT.java and janeoneword.txt from /home/gtowell/Public/151/A04. The last is the full text (plus some headers and footers) of 6 novels by Jane Austen (Pride and Prejudice, Emma, Sense and Sensibility, Northanger Abbey and Mansfield Park). Also in this directory is hamoneword.txt, the text of "Green Eggs and Ham". This might be a good thing to develop using as it is much shorter than janeoneword.txt.

Requirements

  1. Write a class -- perhaps named MyReader with the following suggested properties (innovation is encouraged):
    1. It takes a single parameter in its constructor. That parameter is the name of the a file to be used as input (probably either "hamoneword.txt" or "janeoneword.txt").
    2. It has a single public method nextWord() which returns the next word from the file. When all words have been read, the nextWord should return null. So, the use of this class would be something like:
                  MyReader myreader = new MyReader("ham.txt");
                  while (true) {
                      String line = myreader.nextWord();
                      if (line==null) 
                          break;
                      // do something with the word.
                  }
              
      janeoneword.txt and hamoneword.txt are formatted to make this task easier. In particular, all punctuation has been removed and there is at most one word per line. (Some line have no words.) You should write the nextWord() method to return the next word rather than the next line. Hence, if the file contains
      word1
      
      word2
      					
      (i.e. a blank line between word1 and word2) then MyReader consecutive calls to nextWord would return "word1" followed by "word2" rather than "word1" followed by a blank.
  2. You must implement the get and put methods in ProbeHTInc. ("Inc" in the class name is short for incomplete.) Your implementation should use linear probing. Your implementation need not be concerned with deletions and therefore does not need to use tombstones. If we did not discuss tombstones in class, then you do not know and do not care.
  3. Your program(s) should, for every unique word in the provided file, compute the number of times that word appears. For instance, here is the complete table for Dr Seuss' "Green Eggs and Ham".
    				[i:85]
    				[am:16]
    				[sam:19]
    				[that:3]
    				[do:37]
    				[not:84]
    				[like:44]
    				[you:34]
    				[green:11]
    				[eggs:11]
    				[and:25]
    				[ham:11]
    				[them:61]
    				[would:26]
    				[here:11]
    				[or:8]
    				[there:9]
    				[anywhere:8]
    				[in:40]
    				[a:59]
    				[house:8]
    				[with:19]
    				[mouse:8]
    				[eat:25]
    				[box:7]
    				[fox:7]
    				[could:14]
    				[car:7]
    				[they:2]
    				[are:2]
    				[may:4]
    				[will:21]
    				[see:4]
    				[tree:6]
    				[let:4]
    				[me:4]
    				[be:4]
    				[train:9]
    				[on:7]
    				[say:5]
    				[the:11]
    				[dark:7]
    				[rain:4]
    				[goat:4]
    				[boat:3]
    				[so:5]
    				[try:4]
    				[if:1]
    				[good:2]
    				[thank:2]					
    
    You are not required to produce such a table, I provide it here simply as an example of the data you are expected to collect. On the other hand, this table should be very easy to create using Map151 (see the toString method of Map151).
  4. You must test the speed of the following algorithms on janeoneword.txt
    1. Map151
    2. SeparateChainHT
    3. ProbeHTInc -- with your get and put methods
    4. HashMap -- from java.util
    Map151 on janeoneword.txt can be expected to take a while.
  5. All programs must be runnable from the command line. Note that the programs for running each of these tests will be highly repetitive. So get one working perfectly, then cut and paste.
  6. Each of the different algorithms to be tested should be runnable from separate command line arguments. For example
    			java SepCh janeoneword.txt 
    			java Probe janeoneword.txt 
    			etc
    		
    (This assumes you have a main method in a file SepCh.java and a main method in another file Probe.java. Where reasonable, you may put main maethods into existing files.)

    Alternately you may create a Main.java file that starts the test you want. For instance, this might look like
    			java Main 1 janeoneword.txt 
    			java Main 2 janeoneword.txt 
    			etc
    		
    where 1 indicates Map151, 2 indicates SeparateChain, etc. Other alternative exist.

    However you arrange this, be sure it is well documented in the "how to run" section of the Readme file.
  7. Your program(s) must allow the textfile and the type of test to be specified from the command line.
  8. Age old requirements about comments and exception handling still apply. Similarly, everything should be properly encapsulated.
  9. In your readme file you must include a table of the times you recorded for each algorithm. For instance, you might provide a table similar to the following
    			     Map     SepChain     Probe      java.util.hashmap
    		average    2            3         4                      5
    		std dev    1            2         2                      1
    		
    where the units in the table are well defined (by you) and the data is real (unlike here). You should do multiple runs of each test and report average and standard deviations of those runs.
  10. In your readme file, you must write a paragraph (or so) analysing the times reported in your table. The should comment on what you believe the times are as they are (which algorith is fastest, why, which is second fastest, why, etc. Linking this analysis to big-O analysis would be a good idea.
  11. Your code may be tested using a different text file (given from the command line) but that file will follow the one word per line format of janeoneword.txt.

Electronic Submissions

Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer. The submission should include the following items:

The following steps for submission assume you are using VSC, and that you created a project named AssignmentN in the directory /home/YOU/cs151/

  1. For this assignment N=4
  2. Put the README file into the project directory
  3. Go to the directory /home/YOU/cs151
  4. Enter /home/gtowell/bin/submit -c 151 -p N -d AssignmentN

For more on using the submit script click here