CS 206 - Introduction to Data Structures

Word Counts

Due Oct 15, prior to 11:59pm

In a collection of Dickens novels there are 82136 unique words. This assignment is to write a java program to do the following:
  1. Confirm that that there are indeed 82136 unique words using several different algorithms
  2. Compare the time required to determine this number between the algorithms
  3. Determine the two most common words in this collection and how many times each occurs.
Get the files BookReader.java, ProbeHTInc.java, Map206.java, Map206Interface.java, SepChainHT.java and dickens.txt from /home/gtowell/Public/206/a4. The last is the collection of several of Dickens' works. I claim that this file has 82136 unique words (more properly that my BookReader class delivers 82136 unique strings). The BookReader class reads the file and gives you one word at a time -- via the "nextWord()" method. You might complain that BookReader is not perfect at its job. That is OK. It does not need to be perfect, it is good enough and performs its jobs exactly the same way every time. Also in this directory is ham.txt, the text of "Green Eggs and Ham". This might be a good thing to develop using as it is much shorter than dickens.txt.

Requirements

  1. You must use my BookReader class to read the text file.
  2. You must implement the get and put methods in ProbeHTInc. You implementation should use either quadratic probing or double hashing. You implementation need not be concerned with deletions and therefore does not need to use tombstones.
  3. Your program(s) must allow the textfile to be read to be specified from the command line. E.g., java program dickens.txt Command line arguements are the topic of Lab 5.
  4. Your program(s) should, for every unique word in the provided file, compute the number of times that word appears. For instance, here is the complete table for Dr Seuss' "Green Eggs and Ham".
    a          59
    am          3
    and        25
    anywhere    8
    are         2
    be          4
    boat        3
    box         7
    car         7
    could      14
    dark        7
    do         37
    eat        25
    eggs       11
    fox         7
    goat        4
    good        2
    green      11
    ham        11
    here       11
    house       8
    i          72
    if          1
    in         40
    let         4
    like       44
    may         4
    me          4
    mouse       8
    not        84
    on          7
    or          8
    rain        4
    sam         6
    sam-i-am   13
    say         5
    see         4
    so          5
    thank       2
    that        3
    the        11
    them       61
    there       9
    they        2
    train       9
    tree        6
    try         4
    will       21
    with       19
    would      26
    you        34
    
    You are not required to produce such a table, I provide it here simply as an example of the data you are expected to collect.
  5. You must test the speed of the following algorithms on dickens.txt
    1. 206Map
    2. SeparateChainHT
    3. ProbeHTInc -- with your get and put methods
    4. HashMap -- from the java.util
    Note that 206Map on dickens.txt can be expected to take several minutes.
  6. All programs must be runnable from the command line.
  7. Each of the different algorithms to be tested may be run from a separate command line arguement via multiple main functions For example
    			java program1 dickens.txt 
    			java program2 dickens.txt 
    			etc
    		
    Alternately you may create a Main.java file that starts the test you want. For instance, this might look like
    			java main 1 dickens.txt 
    			java main 2 dickens.txt 
    			etc
    		
    Other alternative exist. However you arrange this, be sure it is well documented in the "how to run" section of the Readme file.
  8. From each of these algorithms other than java.util.HashMap devise a way to show the three most common words. You approach can be extremely ugly. For instance, for "Green Eggs and Ham" this would be:
    			not        84
    			i          72
    			them       61			
    		
  9. Age old requirements about comments and exception handling still apply. Similarly, everything should be properly encapsulated.
  10. In your readme file you must include exact times required. (Depending on the speed of your computer the Map206 implementation could take in excess of 10 minutes
  11. I may test your code using a different file (given from the command line) so you should not do anything to "hard code" aspects to dickens.txt

Electronic Submissions

Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer. The submission should include the following items:

The following steps for submission assume you are using VSC, and that you created a project named AssignmentN in the directory /home/YOU/cs206/

  1. For this assignment N=11
  2. Put the README file into the project directory
  3. Go to the directory /home/YOU/cs206
  4. Enter submit -c 206 -p N -d AssignmentN

For more on using the submit script click here