CS 206: Project Part 3

Project Part 3: Caching Search Results and Creating a User Interface

Due: Monday, November 28, 2011 by 11:59:59pm

You may work with one partner on this assignment. Anyone from the class is fine.

COURSE PROJECT DESCRIPTION

This assignment is the third part in a series of related assignments about the World Wide Web. Our ultimate goal is to build a Web browser with a search engine for a limited portion of the Web. Your search engine will have some of the features of common search engines such as Yahoo, Lycos, or Google. By the end of the semester you will have implemented a Web browser with a fairly sophisticated search engine. Your Web browser will have the following capabilities:

It will display a Web page given a URL
It will display connectivity information of local Web pages
It will answer questions about the connectivity of local Web pages
Its search engine will search for good matching Web pages given a query string and display the resulting URLs in order of best match to worst (we will re-define a "good match" as we add more features to the search engine).
The Web browser will automatically display the best matching URL result of a search. Most of the work involves implementing the search engine part of the Web browser, and this is the part that you will begin implementing in the first part of the project.

I strongly encourage you to work in pairs on the project assignments.

PROBLEM DESCRIPTION

This program consists of two main parts:

Part 1 of the assignment is building a GUI front-end to your search engine that also fetches and displays web pages.
Part 2 is adding caching of query results to your search engine, and speeding up the execution of queries by using the cached results of previous queries.

The solution that you submit should link the two parts together. However, you can start working on each part independently. When you add the second part to the GUI Web browser, print the caching information to the search results window as well as to stdout.

Part 1: GUI

Build a GUI front-end to your search engine. It should have the following 6 GUI components:

a URL text box (where the user can type in a url)
a fetch URL button
a display area for fetched urls
a query text box (where the user can type in a query)
a search button
a display area for search results

Your WebBrowser's main method will read in the url_list and ignore_list command line options, create a ProcessQueries object (from Project Part 2), and then create the GUI front-end.

The WebBrowser GUI will work like the following:

When the get URL button is selected, the url entered in the URL text box is evaluated, the url's webpage is fetched and its contents are displayed in the display box for fetched urls.
When the search button is selected, the query entered in the query box is evaluated and the search results box displays the ordered list of matching webpages (use your Project Part 2 solution). In addition, the webpage of the best match is fetched and displayed in the display box for fetched urls.

Your main class must be called WebBrowser.java and take in two command line arguments as in Part 2:
java cs206proj.part3.WebBrowser urlListFile ignoreFile

Part 2: Caching

Below is a demonstration of how your more efficient query processing program should behave. You will save all previous query results in a data structure called the cache. When the user types in a query, you will first check the cache to see if you have processed a similar request in the past.

#####################################

      Enter next Query String or -1 to quit

      #####################################

      artificial

   Results for your Query "artificial":

        -----------------------------------------------

        Query: artificial

              Query Result NOT Found
      in cache        <-------- print
      out whether or not query is in cache

              0 words from query
      found in cache    <-------- if not, print out
      how many words of the query are in the cache

   Matching URLs:

        cs.brynmawr.edu/~eeaton   priority: -8

      

      #####################################



      Enter next Query String or -1 to quit

      #####################################

      artificial intelligence

   Results for your Query "artificial
      intelligence":

        -----------------------------------------------

        Query: artificial intelligence

              Query Result NOT Found
      in cache

              1 words from query
      found in cache

   Matching URLs:

        cs.brynmawr.edu/~eeaton   priority: -17

      

      #####################################



      Enter next Query String or -1 to quit

      #####################################

      intelligence

   Results for your Query "intelligence":

        -----------------------------------------------

        Query: intelligence

              Query Result Found in
      cache

   Matching URLs:

        cs.brynmawr.edu/~eeaton   priority: -9

Notice that the first time the user asks about "artificial" we cannot find anything in the cache. But the next time the user asks about "artificial intelligence" we can grab the stored information about "artificial" and perform a search of WordFrequencyTrees for just the "intelligence" part. When the query "intelligence" is entered, its result simply can be obtained from the stored informatation in the cache. Also note that we want to be able to recognize that a query is the same regardless of the word order, so "artificial intelligence" is the same as "intelligence artificial".

GETTING STARTED

Part 1: GUI

Start by reading the Tutorial example about the JEditorPane class, and try out the example (TextSamplerDemoHelp.java) to get an idea of how to display .html files and search results (look at the JEditorPane's setPage and setText methods), and how to associate scroll bars with panes:

http://download.oracle.com/javase/tutorial/uiswing/components/editorpane.html

Getting your GUI to display url files and to display search result strings is easy if you use the JEditorPane's setPage and setText methods. Remember to import javax.swing.text.* and java.net.URL along with the standard Swing imports.

You will probably want to use the GridBagLayout to layout the buttons and the text display in the same panel. This will likely give the best results when the Window is re-sized (you want the display parts of the window to re-size vertically, but probably not the buttons and the search text and url text boxes).

To get the query result to print out nicely in the display, you need to convert the query results to a string (either in html or ASCII). One way to do this is to add a queryToString method (or a queryToHTMLString method) of the ProcessQuery class that takes as input the result returned by the performQuery method and converts the results to an string generating "\n" for new lines or < br > for new lines. You can then call setText on the JEditorPane to display the string.

Note: the links on the displayed webpage don't work. You do not have to add support to make links work. However, there is a way to add this functionality, and you can try it out if you'd like.

The name of your GUI class should be WebBrowser.java. This is also the main class that you will run to launch the WebBrowser.

You might also find the Java Trail on Swing helpful: http://download.oracle.com/javase/tutorial/ui/index.html

Part 2: Caching

The cache is represented as a hashtable where the key is the query string. The element stored with the query should be an instance of a new class called SavedResult. A SavedResult contains another hash table storing URLs with their count and a String answer.

hashtable cache
-------------------------------------------------------------------------
|key:		|	|	|	|	|	|	|	|
|  String  	| 	|	|	|	|	|	|	|
|element: 	|  	|	|	|	|	|	|	|
|  SavedResult	|	|	|	|	|	|	|       |
|		| 	|	|	|	|	|	|	|
|		|  |	|	|	|	|	|	|	|
-------------------|-----------------------------------------------------
		   |
                   \/
		SavedResult   
	        ---------     ----------------------------------------
		| table |---->| URL, count  |    |    |    |    |    |	
	        ---------     ----------------------------------------
		| answer|----> String containing query result, or null if  
		---------      this entry was created to satisfy a larger  
                | best  |      query of which this is a part
                | match |
                | URL   |----> The best matching URL as a string (this is for 
                ---------      automatically displaying the best matching URL
                               in your Web browser gui)


1. For each word in query, create an alphabetically ordered query
   string that is all one case and the same case as you stored tokens
   into your WordFrequencyTrees

2. Search for the query in the cache
        (a) if a match is found
            check answer field of entry
                if null, then create answer based on table
                return result 
                and add best matching URL string to this entry as well
 	(b) if a match is not found
	     (A) create a hashtable for the query string
	         for each word in the query
                 (1) if it is in the cache then get its hashtable and 
		     merge its contents with the hashtable for the query
		 (2) otherwise
		    (a) create an entry for the word by processing the 
		        WordFrequencyTrees for the word and creating a 
		        hashtable of (URL, count) entries hashed on URL
			(the result field is null)
		    (b) merge this hashtable with the hashtable for the query
	     (B) create a result priority queue from the entries in the query's hashtable
		 (priority is related to count field of URL), then use
		 this priority queue to create a String representation
	         of the query answer
	     (C) add the (query, (hashtable, answer)) to the cache
	     (D) return result to caller 


3. After each query print out:
	(a) whether or not the query was found in the cache 
	(b) if not, the number of query words that were found in the cache

The best place to implement caching is in your ProcessQueries class. However, you may also implement another class to perform the caching and make calls to the ProcessQueries class methods.

Think carefully about which implementation of the HashTable you plan to use for your cache: does it make a difference?

CLASSES

Classes you'll need for this assignment include the classes you developed for part 2 of the project, plus the following (available in your dropbox folder):

Dictionary.java -- an interface for the Dictionary ADT
DictionaryIterator.java -- an iterator for a Dictionary
ChainingHashtable.java -- a completed hashtable implementation that uses separate chaining for collisions
OpenAddressingHashtable.java -- a completed hashtable implementation that uses linear probing for collisions
TryHash.java -- test program for the hashtables
TryHashMyString.java -- test program for the hashtables that includes an example of an object with overridden hashCode() method

SUBMITTING THE ASSIGNMENT

Submit the following via dropbox:

All .java files necessary for compiling your code (including the necessary ones from part 2 of the project).
All .class files of your code located in the proper package directories (including the necessary ones from part 2 of the project).
An html_ignore File, containing tokens that should be ignored from an html input file.
A urlListFile of URLs on which you tested your program containing 100 - 200 URLs.
A README file with your name and your partner's name (if you had one).

If you work with a partner, please only one of you submit your joint solution via dropbox.

This assignment is based on a course project created by Tia Newhall and Lisa Meeden.