Homework 5
More with Computational Linguistics
Due: Oct 26, prior to 11:59:59 PM
Overview
From last week ....Computational Linguistics is an interdisciplinary field that applies concepts from the field of computer science, math, and logic to solving problems related to and getting an understanding of linguistics and natural (or "human") languages.
One of the ways to study a text is to compare the relative frequency of letters in two (or more) texts. In this assignment you will do just that (and practice using arrays).
Specifically, you will write a Java program that reads and stores the count of the number of uses of each letter in three books: Green Eggs and Ham by Dr Seuss; Pride and Prejudice by Jane Austen and Rob Roy by Sir Walter Scott. The full text of the books is available on Unix at:
/home/gtowell/Public/CS113/HW5/ham.txt
/home/gtowell/Public/CS113/HW5/PP.txt
/home/gtowell/Public/CS113/HW5/RobRoy.txt
The file ham.txt is identical to the one used in HW4.
You can almost exactly re-use code from the previous assignment for reading and storing character counts. The only adjustment you need is that the text of Pride and Prejudice and Rob Roy both contain a few non-ASCII characters. Hence, instead of lines like:
// fileReader is my FileReader and counts is an array of int
int c = fileReader.read();
counts[c]++;
in your reading of the file you will need
int c = fileReader.read();
if (c<128)
counts[c]++;
Collecting char counts
Create two array as follows:
- a one dimensional array of length three of type String. This array should contain the names of the files to be read.
- a two dimensional array of size 3x128 of type int. This will contain character counts. (Using a 2-d array will require a further (minor) adjustment to your code for collecting char counts.)
With these two arrays created, collect character counts for each file within a loop (which goes from 0 to less than 3). A loop over just 3 items may seem a little silly; imagine that you have to read 20 (or 2000) files rather than 3. Your code should be able to deal with more files without a significant re-write. Within the loop to 3 you will have the file reading loop from homework 4.
Now that you have the data collected do the following, for each file (ie. 0..3 in your 2-D array):
Requirements:
- Read the data into a 2-d array within a loop from 0 to less than 3. (As above, pretend you are doing everything for 2000 files rather than 3.)
- Read and collect into a 2-d array all of the char counts in the three files prior to computing any of the statistics. (Yes, I know you could compute things one file at at time.)
- You must do the stats calculations within a loop from 0 to less than 3. You can/should have other loops within the 0..3 loop.
- You program must print results in some readable manner akin to the one above
- You code should follow java coding conventions.
- In your readme, include a description of your statistic and why you chose it.
Suggestions:
- Compute the total number of characters and total number of letters first. The correct number of characters (in the ASCII character set) in each file is 3457, 738659, and 1134449 for ham.txt and PP.txt and RobRoy.txt, respectively.
- Once you are confident you have those counts correct, compute the most common letter and the stat of your own devising
Submitting
Create a readme
Use VSC to create another file in your HW5 directory. This file should be named "Readme". The contents of this file should follow this sample.
Before submitting you should have at least 2 files in your HW5 directory: ZZZZZ.java, and Readme. (You might also have .class files and .txt files.)
Submit
If you did this work on your own computer
You will first need to copy the files from you own computer to a lab computer. To do so, you can go into the lab as with HW2 or use ssh. Either way, you will need to create
HW5 directory within your CS113 directory on the Unix machines. Recall, that this can be done with the following commands:
cd
cd CS113
mkdir HW5
Once you have made the HW5 directory in Unix, open a terminal on you own computer and in that terminal use "cd" to navigate to the directory containing your work for this assignment. Assuming you use the same directory structure on your own computer and in the lab, this process can be accomplished with the following commands
cd
cd CS113
cd HW5
Then use the scp command to copy each of the files you want to submit from your computer to the lab. For example:
scp Readme UNIX_NAME@goldengate.cs.brynmawr.edu:CS113/HW5/Readme
As always, when you read "UNIX_NAME" put in your UNIX user name. Also, with each scp command you will need to enter your UNIX password.
Actually submit
Open a terminal in UNIX (again, you can use SSH to do so from your laptop) and execute the following Unix commands (assuming you put HW5 directory into a CS113 directory in your home directory).
cd
cd CS113
/home/gtowell/bin/submit -c 113 -d HW5 -p 5
In response to the submit command you should see a series of messages ending with:
Submitting archive...
Submission complete! Submission timestamp is 2023-08-08-15-30-28-EDT.