Homework 5

More with Computational Linguistics

Due: Oct 26, prior to 11:59:59 PM

Overview

From last week ....Computational Linguistics is an interdisciplinary field that applies concepts from the field of computer science, math, and logic to solving problems related to and getting an understanding of linguistics and natural (or "human") languages. One of the ways to study a text is to compare the relative frequency of letters in two (or more) texts. In this assignment you will do just that (and practice using arrays).

Specifically, you will write a Java program that reads and stores the count of the number of uses of each letter in three books: Green Eggs and Ham by Dr Seuss; Pride and Prejudice by Jane Austen and Rob Roy by Sir Walter Scott. The full text of the books is available on Unix at:

    /home/gtowell/Public/CS113/HW5/ham.txt
    /home/gtowell/Public/CS113/HW5/PP.txt
    /home/gtowell/Public/CS113/HW5/RobRoy.txt
The file ham.txt is identical to the one used in HW4.

You can almost exactly re-use code from the previous assignment for reading and storing character counts. The only adjustment you need is that the text of Pride and Prejudice and Rob Roy both contain a few non-ASCII characters. Hence, instead of lines like:

    // fileReader is my FileReader and counts is an array of int 
    int c = fileReader.read();
    counts[c]++;
in your reading of the file you will need
    int c = fileReader.read();
    if (c<128)
        counts[c]++;

Collecting char counts

Create two array as follows:
  1. a one dimensional array of length three of type String. This array should contain the names of the files to be read.
  2. a two dimensional array of size 3x128 of type int. This will contain character counts. (Using a 2-d array will require a further (minor) adjustment to your code for collecting char counts.)
With these two arrays created, collect character counts for each file within a loop (which goes from 0 to less than 3). A loop over just 3 items may seem a little silly; imagine that you have to read 20 (or 2000) files rather than 3. Your code should be able to deal with more files without a significant re-write. Within the loop to 3 you will have the file reading loop from homework 4.

Now that you have the data collected do the following, for each file (ie. 0..3 in your 2-D array):

Requirements:

Suggestions:

Submitting

Create a readme

Use VSC to create another file in your HW5 directory. This file should be named "Readme". The contents of this file should follow this sample.

Before submitting you should have at least 2 files in your HW5 directory: ZZZZZ.java, and Readme. (You might also have .class files and .txt files.)

Submit

If you did this work on your own computer

You will first need to copy the files from you own computer to a lab computer. To do so, you can go into the lab as with HW2 or use ssh. Either way, you will need to create HW5 directory within your CS113 directory on the Unix machines. Recall, that this can be done with the following commands:
    cd
    cd CS113 
    mkdir HW5 
Once you have made the HW5 directory in Unix, open a terminal on you own computer and in that terminal use "cd" to navigate to the directory containing your work for this assignment. Assuming you use the same directory structure on your own computer and in the lab, this process can be accomplished with the following commands
    cd
    cd CS113
    cd HW5
Then use the scp command to copy each of the files you want to submit from your computer to the lab. For example:
    scp Readme UNIX_NAME@goldengate.cs.brynmawr.edu:CS113/HW5/Readme
As always, when you read "UNIX_NAME" put in your UNIX user name. Also, with each scp command you will need to enter your UNIX password.

Actually submit

Open a terminal in UNIX (again, you can use SSH to do so from your laptop) and execute the following Unix commands (assuming you put HW5 directory into a CS113 directory in your home directory).
    cd
    cd CS113
    /home/gtowell/bin/submit -c 113 -d HW5 -p 5
In response to the submit command you should see a series of messages ending with:
    
    Submitting archive...
    Submission complete! Submission timestamp is 2023-08-08-15-30-28-EDT.