CS 206: Project Part 2

Project Part 2: Finding the Most Relevant Webpages

Due: November 28, 2012 by 11:59:59pm

You may work with one partner on this assignment. That person must be different from your partner for part 1 of the project.

COURSE PROJECT DESCRIPTION

This assignment is the first part in a series of related assignments about the World Wide Web. Our ultimate goal is to build a Web browser with a search engine for a limited portion of the Web. Your search engine will have some of the features of common search engines such as Yahoo, Bing, or Google. By the end of the semester you will have implemented a Web browser with a fairly sophisticated search engine. Your Web browser will have the following capabilities:

It will display a Web page given a URL
It will display connectivity information of local Web pages
It will answer questions about the connectivity of local Web pages
Its search engine will search for good matching Web pages given a query string and display the resulting URLs in order of best match to worst (we will re-define a "good match" as we add more features to the search engine).
The Web browser will automatically display the best matching URL result of a search. Most of the work involves implementing the search engine part of the Web browser, and this is the part that you will begin implementing in the first part of the project.
I strongly encourage you to work in pairs on the project assignments.

PROBLEM DESCRIPTION

For this program you will implement part of a web search engine that orders web pages based on how well they match a search query. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:
The urlListFile should contain a list of URLs, one per line. An example urlListFile might contain:
The ignoreFile should contain a list of words that you would like to ignore as you count word frequencies in html files (just as you did in the last assignment).

In order to process queries from a user, you'll need to create a new class that joins together a URL string with a WordFrequencyTree representing that web page's content. Call this class URLContent. Your program should create a list of URLContent objects, one for each URL that appears in the urlListFile.

Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or Q to quit), and then lists all URL's that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis after each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list.
To find the results of the query in order, you will process each WordFrequencyTree in the list of URLContent objects, create a priority queue element for it, and add it to a priority queue for the search. Then use the priority queue to print out the matching urls in order. The priority value is based on how well the web page matches the words in the query. Remember that in a priority queue, low values equate with high priority.

GETTING STARTED

The first part of this assignment will be figuring out how to use the classes given to you. Once you have run the test programs for these classes, and understand how they work, then you can start implementing code.

Start by implementing the insert method in the HeapPriorityQueue class. Test that this works before moving on to the next part.

Next, implement the part of your program that processes the urlListFile. Use the Scanner class to read in the webpage associated with each URL. For each webpage, calculate the word frequencies for the corresponding web page. Determine a scoring method based on these word frequencies that ranks the relative importance of a webpage to the query.

Next, implement that part that reads in a search query, builds a priority queue by inserting (URLContent, key) pairs where the key is the priority of the URL's WordFrequencyTree based on how well it matches the query string. Then print out the matching URLs in order of best to worst match.

Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WordFrequencyTree for all three words to determine the URL's priority. Your should gradually expand your urlListFile until it contains 100 - 200 URLs.

CLASSES

Classes you'll need for this assignment include the classes you developed for part 1 of the project plus the following:
- PriorityQueue interface
- HeapPriorityQueue class
- Scanner class for scanning either an input string or a file or a URL (the ability to read URLs is new; it works the same as for files).
- TryHeap class a simple program that tests the HeapPriorityQueue class
- TryScanner class a simple program that tests the Scanner class. It demonstrates how to use the Scanner class to parse a webpage.
You can find starting code for this project at /home/eeaton/public/cs206/Proj2.zip on the Bryn Mawr CS linux systems. Just copy it to your own folder and unzip it:
cp /home/eeaton/public/cs206/Proj2.zip ~ unzip ~/Proj2.zipIf you are working remotely on your own computer, try:
scp USERNAME@powerpuff.brynmawr.edu:/home/eeaton/public/cs206/Proj2.zip . unzip Proj2.zip

ECLIPSE PROJECT ORGANIZATION -- Important!

You should work within the same Eclipse project that you did for part 1. That project should contain two packages: cs206proj.part1 and cs206proj.part2. The part1 package will contain all of your code from part 1 of the project. All new code should be placed within the cs206proj.part2 package. This will enable you to build off part 1 of the project easily.

When you submit your code, you must be certain to submit all of the code from part 1 that you used in this project as well.

SUBMITTING THE ASSIGNMENT

Submit the following via dropbox:
1. All .java files necessary for compiling your code (including the necessary ones from part 1 of the project).
2. All .class files of your code located in the proper package directories (including the necessary ones from part 1 of the project).
3. An html_ignore File, containing tokens that should be ignored from an html input file.
4. A urlListFile of URLs on which you tested your program containing 100 - 200 URLs.
5. A README file with your name and your partner's name (if you had one).
If you developed in Eclipse, simply compress the entire project directory into a single zip or tar file and submit this via dropbox, as detailed in the assignment submission instructions.

Once you have submitted your code, I strongly encourage you to test it, either from within dropbox using the command line, or by copying it to a clean location and then attempting to run it.

If you work with a partner, please only one of you submit your joint solution via dropbox. Name your submission Lastname1Lastname2-Proj1.zip (or .tar). The person who does NOT submit the project should include a simple text file in their dropbox folder called README that states the name of the person they teamed with for this assignment.
This assignment is based on a course project created by Tia Newhall and Lisa Meeden.

Project Part 2: Finding the Most Relevant Webpages

Due: November 28, 2012 by 11:59:59pm

COURSE PROJECT DESCRIPTION

PROBLEM DESCRIPTION

GETTING STARTED

CLASSES

ECLIPSE PROJECT ORGANIZATION -- Important!

SUBMITTING THE ASSIGNMENT