You may work with one partner on this assignment. That person must be different from your partner for part 1 of the project.
I strongly encourage you to work in pairs on the project assignments.
For this program you will implement part of a web search engine that orders web pages based on how well they match a search query. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:
The urlListFile should contain a list of URLs, one per line. An example urlListFile might contain:
http://cs.brynmawr.edu/~eeaton http://cs.brynmawr.edu/~dkumar http://cs.brynmawr.edu/~dblank http://cs.brynmawr.edu/~dxu http://cs.brynmawr.edu/Courses/cs206/fall2011/index.html http://cs.brynmawr.edu/Courses/cs206/fall2011/assignment/proj1.html http://cs.brynmawr.edu/Courses/cs206/fall2011/assignment/proj2.html
The ignoreFile should contain a list of words that you would like to ignore as you count word frequencies in html files (just as you did in the last assignment).
In order to process queries from a user, you'll need to create a new class that joins together a URL string with a WordFrequencyTree representing that web page's content. Call this class URLContent. Your program should create a list of URLContent objects, one for each URL that appears in the urlListFile.
Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or Q to quit), and then lists all URL's that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis after each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list.
Enter a query or Q to quit. Search for: neural networks Relevant pages: http://cs.brynmawr.edu/~dblank (priority = x) http://cs.brynmawr.edu/~dkumar (priority = y) Search for: artificial intelligence Relevant pages: http://cs.brynmawr.edu/~eeaton (priority = x) http://cs.brynmawr.edu/~dkumar (priority = y) http://cs.brynmawr.edu/~dblank (priority = z) Search for: Q
To find the results of the query in order, you will process each WordFrequencyTree in the list of URLContent objects, create a priority queue element for it, and add it to a priority queue for the search. Then use the priority queue to print out the matching urls in order. The priority value is based on how well the web page matches the words in the query. Remember that in a priority queue, low values equate with high priority.
The first part of this assignment will be figuring out how to use the classes given to you. Once you have run the test programs for these classes, and understand how they work, then you can start implementing code.
Start by implementing the insert method in the HeapPriorityQueue class. Test that this works before moving on to the next part.
Next, implement the part of your program that processes the
urlListFile. Use the Scanner
class to
read in the webpage associated with each URL. For each
webpage, calculate the word frequencies for the corresponding
web page. Determine a scoring method based on these word
frequencies that ranks the relative importance of a webpage to
the query.
Next, implement that part that reads in a search query,
builds a priority queue by inserting (URLContent, key
)
pairs where the key is the priority of the URL's
WordFrequencyTree based on how well it matches the query
string. Then print out the matching URLs in order of best to
worst match.
Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WordFrequencyTree for all three words to determine the URL's priority. Your should gradually expand your urlListFile until it contains 100 - 200 URLs.
Classes you'll need for this assignment include the classes you developed for part 1 of the project plus the following:
You can find starting code for this project at /home/eeaton/public/cs206/Proj2.zip
on the
Bryn Mawr CS linux systems. Just copy it to your own
folder and unzip it:
cp /home/eeaton/public/cs206/Proj2.zip ~
If you are working remotely on
your own computer, try:
unzip ~/Proj2.zip
scp USERNAME@powerpuff.brynmawr.edu:/
home/eeaton/public/cs206/Proj2.zip .
unzip Proj2.zip
html_ignore
File, containing tokens
that should be ignored from an html input file. urlListFile
of URLs on which you tested
your program containing 100 - 200 URLs. Lastname1Lastname2-Proj1.zip
(or .tar
). The person who does NOT submit
the project should include a simple text file in their dropbox
folder called README
that states the name of the
person they teamed with for this assignment.