Homework 4

Due: Oct 12, prior to 11:59:59 PM

Overview

Computational Linguistics is an interdisciplinary field that applies concepts from the field of computer science, math, and logic to solving problems related to and getting an understanding of linguistics and natural (or “human”) languages.
This field includes such tasks as speech recognition, machine translation, text summarization, and social media mining, among many others.
In many cases, problems in computational linguistics involve taking a large input, e.g. a text document or audio clip, and breaking it into smaller, individual components. These individual components can then be visualized using a histogram, which is a graphical display of data using vertical bars of different heights, typically to show the distribution of values in some range. For instance, a histogram can be used to visualize the frequency or number of occurrences of certain words, individual letters, or combinations of letters within a certain document.
In this assignment, you will write a Java program that displays a histogram of the number of occurrences of letters in the Dr Seuss classic "Green Eggs and Ham". Here is a portion of the output you will be creating in this assignment
63 ? 16 xxxxxxxxxxxxxxxx
64 @ 0 
65 A 18 xxxxxxxxxxxxxxxxxx
66 B 0 
67 C 3 xxx
68 D 1 x
69 E 6 xxxxxx
70 F 0 
71 G 0 
72 H 4 xxxx
73 I 88 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
In the above, each row refers to a single letter. The first thing in each row -- a number -- is the ASCII value of the letter. The second thing is the letter itself. The third is the number of times that letters appears in "Green Egs and Ham" and the fourth is one 'x' for each time the letter appears. ('I' appears frequently thanks to the name "Sam-I-Am".)

Setup

Start by making a new directory for homework 4.

Next, copy a file containing "Green Eggs and Ham" into the directory you just created. The file is available on Unix in the file

            /home/gtowell/Public/CS113/HW4/ham.txt
        
Note that the "/" at the beginning of this file name is important. Be sure to get everything in this file name or the copy will fail. Use cp or scp to copy this file to your Unix account or laptop, respectively.

Reading a File

There are many ways to read from files in Java. As you will be reading character by character, you will need a fairly low-level reader. The Java "FileReader" class will work just fine. Here is an example of using FileReader to read a single character from the file "ham.txt". (Feel free to copy this program into your HW4 directory and try in out.)
1     import java.io.FileReader;
2     import java.io.IOException;
3    
4     public class ReadOne {
5         public static void main(String[] args) {
6             try {
7                 FileReader r = new FileReader("ham.txt");
8                 if (r.ready()) {
9                     int int_read = r.read();
10                    char char_read = (char) int_read;
11                    System.out.println(int_read + "  " + char_read);
12                }
13            }
14            catch (IOException e) {
15                System.out.println("Problem: " + e);
16            }
17        }    
18    }
(Line numbers are included for the discussion below, they are not part of the code)

The Java FileReader class does exactly what you might expect, it reads files; to get it stated as on line 7, we give it the name of the file to be read. Then, just prior to actually reading anything, we ask the file reader, on line 8, if there is anything to read. If there is something to read, on line 9 we actually read it. The read returns the ASCII value of the character, so on line 10 we convert that number into an actual character so that on line 11 we can print both the actual character and its ASCII value.

Java requires that anything that might have problems of a particular form be between "try" and "catch". For example, line 7 might have a problem if the file being opened did not exist. Similarly, line 9 might have a problem if the file disappeared between opening and line 9. If one of these bad things happen, the program immediately goes down to the catch (line 14) and executes whatever code is in the catch block (line 15). You should almost always have a print statement within the catch block. Sometimes it makes sense to have other things as well.

So, all of this code that does the actions you actually want is in a block between the try (line 6) and the catch (line 14). We will discuss "try" and "catch" in class. The easiest thing to do will be to follow the pattern in the ReadOne program. That is make "try {" the first thing in the main method and
    catch (IOException e) {
        System.out.println("Problem: " + e);
        return;
    }    
the last.

Collecting char counts

Create an array type int of length 128 that will hold the number of times each character is seen. (In the file "ham.txt", the int returned by r.read() is always less than or equal to 127; hence an array of length 128 allows you to collect the count for each character.) Adjust the ReadOne program so that it reads every character from "ham.txt" rather than just the first. To do so, all you need to do is replace the "if" in line 8 to something else. (Hint, you need a loop.) You will eventually want to delete line 11 also. Each time you read a character, increment the number in the corresponding place in the array you just created.

With the counts collected, all you have to do is create the output. Some suggestions about that.

Submitting

Create a readme

Use VSC to create another file in your HW4 directory. This file should be named "Readme". The contents of this file should follow this sample.

You should have at least 3 files in your HW4 directory: LetterCount.java, ham.txt and Readme. (You might also have .class files.)

Submit

If you did this work on your own computer

You will first need to copy the files from you own computer to a lab computer. To do so, yu can go into the lab as with HW2 or use ssh. Either way, you will need to create HW4 directory within your CS113 directory on the Unix machines. Recall, that this can be done with the following commands:
        cd
        cd CS113 
        mkdir HW4 
    
Once you have made the HW4 directory in Unix, open a terminal on you own computer and in that terminal use "cd" to navigate to the directory containing your work for this assignment. Assuming you use the same directory structure on your own computer and in the lab, this process can be accomplished with the following commands
        cd
        cd CS113
        cd HW4
    
Then use the scp command to copy each of the files you want to submit from your computer to the lab. For example:
        scp Readme UNIX_NAME@goldengate.cs.brynmawr.edu:CS113/HW4/Readme
    
As always, when you read "UNIX_NAME" put in your UNIX user name. Also, with each scp command you will need to enter your UNIX password.

Actually submit

Open a terminal in UNIX (again, you can use SSH to do so from your laptop) and execute the following Unix commands (assuming you put HW4 directory into a CS113 directory in your home directory).
        cd
        cd CS113
        /home/gtowell/bin/submit -c 113 -d HW4 -p 4
    
In response to the submit command you should see a series of messages ending with:
        
        Submitting archive...
        Submission complete! Submission timestamp is 2023-08-08-15-30-28-EDT.