Project Part 3:  Caching Search Results and Creating a User Interface

Due: Monday, November 28, 2011 by 11:59:59pm

You may work with one partner on this assignment.  Anyone from the class is fine.


COURSE PROJECT DESCRIPTION

This assignment is the third part in a series of related assignments about the World Wide Web. Our ultimate goal is to build a Web browser with a search engine for a limited portion of the Web. Your search engine will have some of the features of common search engines such as Yahoo, Lycos, or Google. By the end of the semester you will have implemented a Web browser with a fairly sophisticated search engine. Your Web browser will have the following capabilities:

I strongly encourage you to work in pairs on the project assignments.


PROBLEM DESCRIPTION

This program consists of two main parts:

The solution that you submit should link the two parts together. However, you can start working on each part independently. When you add the second part to the GUI Web browser, print the caching information to the search results window as well as to stdout.
Part 1: GUI

Build a GUI front-end to your search engine. It should have the following 6 GUI components:
  1. a URL text box (where the user can type in a url)
  2. a fetch URL button
  3. a display area for fetched urls
  4. a query text box (where the user can type in a query)
  5. a search button
  6. a display area for search results

Your WebBrowser's main method will read in the url_list and ignore_list command line options, create a ProcessQueries object (from Project Part 2), and then create the GUI front-end.

The WebBrowser GUI will work like the following:


Your main class must be called WebBrowser.java and take in two command line arguments as in Part 2:  
java cs206proj.part3.WebBrowser urlListFile ignoreFile
Part 2: Caching

Below is a demonstration of how your more efficient query processing program should behave. You will save all previous query results in a data structure called the cache. When the user types in a query, you will first check the cache to see if you have processed a similar request in the past.

#####################################
Enter next Query String or -1 to quit
#####################################
artificial

  Results for your Query "artificial":
  -----------------------------------------------
  Query: artificial
        Query Result NOT Found in cache        <-------- print out whether or not query is in cache
        0 words from query found in cache    <-------- if not, print out how many words of the query are in the cache

  Matching URLs:
  cs.brynmawr.edu/~eeaton   priority: -8

#####################################

Enter next Query String or -1 to quit
#####################################
artificial intelligence

  Results for your Query "artificial intelligence":
  -----------------------------------------------
  Query: artificial intelligence
        Query Result NOT Found in cache
        1 words from query found in cache

  Matching URLs:
  cs.brynmawr.edu/~eeaton   priority: -17

#####################################

Enter next Query String or -1 to quit
#####################################
intelligence

  Results for your Query "intelligence":
  -----------------------------------------------
  Query: intelligence
        Query Result Found in cache

  Matching URLs:
  cs.brynmawr.edu/~eeaton   priority: -9


Notice that the first time the user asks about "artificial" we cannot find anything in the cache. But the next time the user asks about "artificial intelligence" we can grab the stored information about "artificial" and perform a search of WordFrequencyTrees for just the "intelligence" part. When the query "intelligence" is entered, its result simply can be obtained from the stored informatation in the cache. Also note that we want to be able to recognize that a query is the same regardless of the word order, so "artificial intelligence" is the same as "intelligence artificial".



GETTING STARTED

Part 1: GUI

Start by reading the Tutorial example about the JEditorPane class, and try out the example (TextSamplerDemoHelp.java) to get an idea of how to display .html files and search results (look at the JEditorPane's setPage and setText methods), and how to associate scroll bars with panes:
http://download.oracle.com/javase/tutorial/uiswing/components/editorpane.html
Getting your GUI to display url files and to display search result strings is easy if you use the JEditorPane's setPage and setText methods. Remember to import javax.swing.text.* and java.net.URL along with the standard Swing imports.

You will probably want to use the GridBagLayout to layout the buttons and the text display in the same panel. This will likely give the best results when the Window is re-sized (you want the display parts of the window to re-size vertically, but probably not the buttons and the search text and url text boxes).

To get the query result to print out nicely in the display, you need to convert the query results to a string (either in html or ASCII). One way to do this is to add a queryToString method (or a queryToHTMLString method) of the ProcessQuery class that takes as input the result returned by the performQuery method and converts the results to an string generating "\n" for new lines or < br > for new lines. You can then call setText on the JEditorPane to display the string.

Note: the links on the displayed webpage don't work. You do not have to add support to make links work. However, there is a way to add this functionality, and you can try it out if you'd like.

The name of your GUI class should be WebBrowser.java.  This is also the main class that you will run to launch the WebBrowser.

You might also find the Java Trail on Swing helpful:  http://download.oracle.com/javase/tutorial/ui/index.html

Part 2: Caching

The cache is represented as a hashtable where the key is the query string. The element stored with the query should be an instance of a new class called SavedResult. A SavedResult contains another hash table storing URLs with their count and a String answer.

hashtable cache
-------------------------------------------------------------------------
|key:		|	|	|	|	|	|	|	|
|  String  	| 	|	|	|	|	|	|	|
|element: 	|  	|	|	|	|	|	|	|
|  SavedResult	|	|	|	|	|	|	|       |
|		| 	|	|	|	|	|	|	|
|		|  |	|	|	|	|	|	|	|
-------------------|-----------------------------------------------------
		   |
                   \/
		SavedResult   
	        ---------     ----------------------------------------
		| table |---->| URL, count  |    |    |    |    |    |	
	        ---------     ----------------------------------------
		| answer|----> String containing query result, or null if  
		---------      this entry was created to satisfy a larger  
                | best  |      query of which this is a part
                | match |
                | URL   |----> The best matching URL as a string (this is for 
                ---------      automatically displaying the best matching URL
                               in your Web browser gui)


1. For each word in query, create an alphabetically ordered query
   string that is all one case and the same case as you stored tokens
   into your WordFrequencyTrees

2. Search for the query in the cache
        (a) if a match is found
            check answer field of entry
                if null, then create answer based on table
                return result 
                and add best matching URL string to this entry as well
 	(b) if a match is not found
	     (A) create a hashtable for the query string
	         for each word in the query
                 (1) if it is in the cache then get its hashtable and 
		     merge its contents with the hashtable for the query
		 (2) otherwise
		    (a) create an entry for the word by processing the 
		        WordFrequencyTrees for the word and creating a 
		        hashtable of (URL, count) entries hashed on URL
			(the result field is null)
		    (b) merge this hashtable with the hashtable for the query
	     (B) create a result priority queue from the entries in the query's hashtable
		 (priority is related to count field of URL), then use
		 this priority queue to create a String representation
	         of the query answer
	     (C) add the (query, (hashtable, answer)) to the cache
	     (D) return result to caller 


3. After each query print out:
	(a) whether or not the query was found in the cache 
	(b) if not, the number of query words that were found in the cache
The best place to implement caching is in your ProcessQueries class. However, you may also implement another class to perform the caching and make calls to the ProcessQueries class methods.

Think carefully about which implementation of the HashTable you plan to use for your cache: does it make a difference?


CLASSES

Classes you'll need for this assignment include the classes you developed for part 2 of the project, plus the following (available in your dropbox folder):


SUBMITTING THE ASSIGNMENT

Submit the following via dropbox:
  1. All .java files necessary for compiling your code (including the necessary ones from part 2 of the project).
  2. All .class files of your code located in the proper package directories (including the necessary ones from part 2 of the project).
  3. An html_ignore File, containing tokens that should be ignored from an html input file.
  4. A urlListFile of URLs on which you tested your program containing 100 - 200 URLs.
  5. A README file with your name and your partner's name (if you had one).

If you work with a partner, please only one of you submit your joint solution via dropbox.
This assignment is based on a course project created by Tia Newhall and Lisa Meeden.