It is strongly recommended that you
work with one partner on this assignment.
I strongly encourage you to work in pairs on the project assignments.
The first step to building a search engine is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this assignment you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out all words (in alphabetical order) that occur at least the minimum frequency number of times and print each of these word's frequency count.
WORD FREQUENCY ---- --------- artifical 5 cat 3 dog 7 ...
Begin by experimenting with the Scanner class and its
associated test class provided below. The Scanner class is
written to scan either a file or a string. For this assignment
you will use the Scanner class for scanning a file. The
constructor for the Scanner class requires a filename. It then
opens the given file and will return the next token that it
finds when the method getNextToken()
is called.
For our purposes a token will be a single non-alphabetic and
non-numeric character, such as punctuation, or a string of
contiguous alphabetic and numeric characters up until white
space is encountered or the end of file.
Once you understand how the Scanner class works, modify the TestScanner class to read in a filename from the command line, and test that it works (see the section on Command Line Arguments below).
Next, you need to implement the insertElement
,
findMinimumNode
, and findMaximumNode
methods of the LinkedBinarySearchTree class. Make sure that
your insertElement
does not allow duplicate
elements to be inserted in the tree. Once you have tested
these changes to the LinkedBinaryTree class, you can start
implementing the main part of this assignment.
Your program will take two or three command line arguments:
% java cs206proj.part1.WordFrequencyTree input_file_name min_freq_num [ignore_file_name]
min_freq_num
, is used
to determine which elements in your WordFrequency tree your
program prints out (only words with a frequency count >=
min_freq_num). WordFrequency
object and add
it to a binary search tree sorted alphabetically. Only
unique words should be added to the tree. When a duplicate
word is encountered, just update the frequency count of its
WordFrequency
object that is already in the
tree. After processing the input file. Your program should print
out all words in the WordFrequencyTree in alphabetical
order that occur at least min_freq_num
times. With each word you should list its frequency count.
Command Line Arguments
Recall that command line arguments are passed into a main
method as an array of Strings:
public static void main(String args[]) {To get the number of command line args, check value of
args.length
and to get a particular command line argument access the
corresponding args array entry (arg[0]
is the first
command line argument as a String). To convert a command line
argument from its String form to another type, use the valueOf
methods of the Integer, Float, etc. classes. For example:
# if this is the command line # % java MatrixMult 10 15 matrix_contents # # This is what the code in the main method might look like that # converts the command line args 10 and 15 to int values # public static void main(String args[]) { if(args.length < 3) { System.out.println( "Usage: MatrixMult num_rows num_cols" + " matrix_contents_file"); } // arg[0] is the string "10" in the example command line above int m = (Integer.valueOf(args[0])).intValue(); // arg[1] is the string "15" int n = (Integer.valueOf(args[1])).intValue(); // arg[2] is the string "matrix_contents" String matrixFile = args[2];
WordFrequencyTree Class
You should create a WordFrequencyTree
class
that implements the BinarySearchTree
interface.
This class will have a LinkedBinaryTree
data
member (it may need additional data members too), and its
implementation of the BinarySearchTree
interface
methods just invoke the corresponding method on its LinkedBinaryTree
data member. In addition it should have at least three
constructors: a default constructor, a constructor that takes
an input file argument and a constructor that takes an input
file and an ignore file argument. The second two constructors
will process the file(s) and create the initial word frequency
tree as specified above.
The following are examples of what your command line should look like, and what your program should do for different calls
# program will add WordFrequency objects for each token # in the index.html input file that is not on the html_ignore_list # and will print out in alphabetical order all words that occur # at least 3 times in the tree # % java cs206proj.part1.WordFrequencyTree index.html 3 html_ignore_list # # program will add WordFrequency objects for each token # in the index.html input file and will print out in alphabetical # order all words that occur at least 4 times in the tree # % java cs206proj.part1.WordFrequencyTree index.html 4 # # program will print a usage message to stdout and exit # (ex) usage: WordFrequencyTree inputfile min_freq_num # or: WordFrequencyTree inputfile min_freq_num ignorelist # % java cs206proj.part1.WordFrequencyTree index.html
Part of this assignment includes developing a good
ignorelist file for html files. As you test your program for
different .html input files, develop a list of words that
should be ignored from html files, and use this list as the
optional third command line argument to your program. You will
submit your html_ignore
file as part of your
assignment solution.
testscannerfile
You can find starting code for this project at /home/eeaton/public/cs206/Proj1.zip
on the
Bryn Mawr CS linux systems. Just copy it to your own
folder and unzip it:
cp /home/eeaton/public/cs206/Proj1.zip ~
If you are working remotely on
your own computer, try:
unzip ~/Proj1.zip
scp USERNAME@powerpuff.brynmawr.edu:/
home/eeaton/public/cs206/Proj1.zip .
unzip Proj1.zip
html_ignore
file, containing tokens that
should be ignored from an html input file. If you develop your project in Eclipse, compress the entire project directory into a single zip or tar file and submit this via dropbox. For example, if you called your project CS206Project, then the project directory should look something like the following:
CS206Project/bin/cs206proj/part1/
and the .java files would all be under CS206Project/src/cs206proj/part1/
.
The html_ignore
and README
files
would go under CS206Project/
.Lastname1Lastname2-Proj1.zip
(or .tar
). The
person who does NOT submit the project should include a simple
text file in their dropbox folder called README that states
the name of the person they teamed with for this assignment.