You may work with one partner on this assignment. That person must be different from your partner for the previous assignment.
I strongly encourage you to work in pairs on the project assignments.
The first step to building a search engine is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this assignment you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out all words (in alphabetical order) that occur at least the minimum frequency number of times and print each of these word's frequency count.
WORD FREQUENCY ---- --------- artifical 5 cat 3 dog 7 ...
Begin by experimenting with the Scanner class and its
associated test
class provided below. The Scanner class is written to scan
either a file
or a string. For this assignment you will use the Scanner
class for scanning
a file. The constructor for the Scanner class requires a
filename. It then opens the given file and will return the
next token that it finds when the method getNextToken()
is called. For our purposes a token will be a single
non-alphabetic and non-numeric character, such as punctuation,
or a string of contiguous alphabetic and numeric characters up
until white space is encountered or the end of file.
Once you understand how the Scanner class works, modify the TestScanner class to read in a filename from the command line, and test that it works (see the section on Command Line Arguments below).
Next, you need to implement the insertElement
,
findMinimumNode
, and findMaximumNode
methods of the LinkedBinarySearchTree class. Make sure that
your insertElement
does not allow
duplicate elements to be inserted in the tree. Once you have
tested these
changes to the LinkedBinaryTree class, you can start
implementing the main
part of this assignment.
Your program will take two or three command line arguments:
% java cs206proj.part1.WordFrequencyTree input_file_name min_freq_num [ignore_file_name]
min_freq_num
, is used to determine which
elements in your WordFrequency
tree your program prints out (only words with a frequency
count >= min_freq_num). WordFrequency
object and add
it to a binary search tree sorted alphabetically. Only
unique words should be added to the tree. When a duplicate
word is encountered, just update the frequency count of its
WordFrequency
object that is already
in the tree.
After processing the input file. Your program should print out
all words
in the WordFrequencyTree in alphabetical order that
occur at least min_freq_num
times. With each
word you should list
its frequency count.
Command Line Arguments
Recall that command line arguments are passed into a main
method as an array of Strings:
public static void main(String args[]) {To get the number of command line args, check value of
args.length
and to get a particular command line argument access the
corresponding args array entry (arg[0]
is the first
command line argument as a String). To convert a command line
argument from its String form to
another type, use the valueOf
methods of the
Integer, Float, etc. classes. For example:
# if this is the command line # % java MatrixMult 10 15 matrix_contents # # This is what the code in the main method might look like that # converts the command line args 10 and 15 to int values # public static void main(String args[]) { if(args.length < 3) { System.out.println( "Usage: MatrixMult num_rows num_cols" + " matrix_contents_file"); } // arg[0] is the string "10" in the example command line above int m = (Integer.valueOf(args[0])).intValue(); // arg[1] is the string "15" int n = (Integer.valueOf(args[1])).intValue(); // arg[2] is the string "matrix_contents" String matrixFile = args[2];
WordFrequencyTree Class
You should create a WordFrequencyTree
class that
implements the BinarySearchTree
interface. This
class will have a LinkedBinaryTree
data member
(it may need additional data members too), and its
implementation of
the BinarySearchTree
interface methods just
invoke the corresponding method on its LinkedBinaryTree
data member. In addition it should have at least three
constructors: a default constructor, a constructor that takes
an input file argument and a constructor that takes an input
file and an ignore file argument.
The second two constructors will process the file(s) and
create the initial word frequency tree as specified above.
The following are examples of what your command line should look like, and what your program should do for different calls
# program will add WordFrequency objects for each token # in the index.html input file that is not on the html_ignore_list # and will print out in alphabetical order all words that occur # at least 3 times in the tree # % java cs206proj.part1.WordFrequencyTree index.html 3 html_ignore_list # # program will add WordFrequency objects for each token # in the index.html input file and will print out in alphabetical # order all words that occur at least 4 times in the tree # % java cs206proj.part1.WordFrequencyTree index.html 4 # # program will print a usage message to stdout and exit # (ex) usage: WordFrequencyTree inputfile min_freq_num # or: WordFrequencyTree inputfile min_freq_num ignorelist # % java cs206proj.part1.WordFrequencyTree index.html
Part of this assignment includes developing a good ignorelist
file for html files. As you test your program for different
.html input files, develop a list of words that should be
ignored from html files, and use this list as the optional
third command line argument to your program. You will submit
your html_ignore
file as part of your assignment
solution.
testscannerfile
html_ignore
file, containing tokens that
should be ignored from
an html input file.