CS 206 - Introduction to Data Structures

Word Counts

Mar 14, prior to 11:59pm

Jane Austen wrote books. Fairly long books. Putting 6 of them together, you get a over 700,000 words. But how many distinct words did she use? This assignment is to write a java program to do the following:
  1. Determine the number of unique words she used (in a provided collection of her works).
  2. Compare the time required to determine this number across several different algorithms
  3. Determine the three most common words in this collection and how many times each occurs.
Get the files: ProbeHTInc.java, Map206.java, Map206Interface.java, SepChainHT.java and janeausten.txt from /home/gtowell/Public/206/a4. The last is the full text (plus some headers and footers) 6 novels by Jane Austen (Pride and Prejudice, Emma, Sense and Sensibility, Northanger Abbey and Mansfield Park). Also in this directory is ham.txt, the text of "Green Eggs and Ham". This might be a good thing to develop using as it is much shorter than janeausten.txt.

Requirements

  1. Write a class -- perhaps named MyReader with the following suggested properties (innovation is encouraged):
    1. It takes a single arguement in its constructor. That arguement is the name of the a file (probably either "ham.txt" or "janeausten.txt").
    2. It has a single public method nextWord() which returns the next word from the file. When all words have been read, the nextWord should return null. So, the use of this class would be something like:
                  MyReader myreader = new MyReader("ham.txt");
                  while (true) {
                      String line = myreader.nextWord();
                      if (line==null) 
                          break;
                      // do something with the word.
                  }
              
      As an alternative, you could give MyReader a public method nextWords() which returns an array of strings (possibly consisting of all of the words in one line of the file). When the entire file has been read, the function would return null. In this case, the code above would need to be slightly modified.
  2. Your MyReader class should do something to parse out common punctuation and also change everything to lower case. For instance, you might use the following:
                  BufferedReader br; // set up as usual
                  // other code here
                  String line = br.readLine();
                  String betterline = line.trim().toLowerCase().replaceAll("[\\.\\'?!,\\\"]", "");
              
    You are not required to use the above code (or even understand it), this is just an example of code that would meet this requirement.
  3. You must implement the get and put methods in ProbeHTInc. ("Inc" in the class name is short for incomplete.) Your implementation should use either quadratic probing. Your implementation need not be concerned with deletions and therefore does not need to use tombstones.
  4. Your program(s) should, for every unique word in the provided file, compute the number of times that word appears. For instance, here is the complete table for Dr Seuss' "Green Eggs and Ham".
    a          59
    am          3
    and        25
    anywhere    8
    are         2
    be          4
    boat        3
    box         7
    car         7
    could      14
    dark        7
    do         37
    eat        25
    eggs       11
    fox         7
    goat        4
    good        2
    green      11
    ham        11
    here       11
    house       8
    i          72
    if          1
    in         40
    let         4
    like       44
    may         4
    me          4
    mouse       8
    not        84
    on          7
    or          8
    rain        4
    sam         6
    sam-i-am   13
    say         5
    see         4
    so          5
    thank       2
    that        3
    the        11
    them       61
    there       9
    they        2
    train       9
    tree        6
    try         4
    will       21
    with       19
    would      26
    you        34
    
    You are not required to produce such a table, I provide it here simply as an example of the data you are expected to collect.
  5. You must test the speed of the following algorithms on janeausten.txt
    1. 206Map
    2. SeparateChainHT
    3. ProbeHTInc -- with your get and put methods
    4. HashMap -- from java.util
    206Map on janeausten.txt can be expected to take several minutes.
  6. All programs must be runnable from the command line.
  7. Each of the different algorithms to be tested may be run from a separate command line arguement via multiple main functions For example
    			java program1 janeausten.txt 
    			java program2 janeausten.txt 
    			etc
    		
    Alternately you may create a Main.java file that starts the test you want. For instance, this might look like
    			java main 1 janeausten.txt 
    			java main 2 janeausten.txt 
    			etc
    		
    Other alternative exist. However you arrange this, be sure it is well documented in the "how to run" section of the Readme file.
  8. Your program(s) must allow the textfile to be read to be specified from the command line. E.g., java program ham.txt
  9. From each of these algorithms other than java.util.HashMap devise a way to show the three most common words. You approach can be extremely ugly. For instance, for "Green Eggs and Ham" this would be:
    			not        84
    			i          72
    			them       61			
    		
  10. Age old requirements about comments and exception handling still apply. Similarly, everything should be properly encapsulated.
  11. In your readme file you must include exact times required for each algorithm.
  12. Your code may be tested using a different file (given from the command line) so you should not do anything to "hard code" aspects to a particular text file.

Electronic Submissions

Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer. The submission should include the following items:

The following steps for submission assume you are using VSC, and that you created a project named AssignmentN in the directory /home/YOU/cs206/

  1. For this assignment N=4
  2. Put the README file into the project directory
  3. Go to the directory /home/YOU/cs206
  4. Enter /home/gtowell/bin/submit -c 206 -p N -d AssignmentN

For more on using the submit script click here