Project Part 3:  Web Crawling, Caching Search Result, and (optional) GUI

Due: Wednesday, December 12, 2012 by 11:59:59pm

You may work with one partner on this assignment.  Anyone from the class is fine.
If you were unable to complete Project 2, then I strongly recommend you work with a partner who did complete it and start from their code.

You can download all starting code for this project from here.


COURSE PROJECT DESCRIPTION

This assignment is the third and final part in a series of related assignments about the World Wide Web. Our ultimate goal is to build a web crawler and search engine for a limited portion of the Web. Your search engine will have some of the features of common search engines such as Yahoo, Lycos, or Google. By the end of the semester you will have implemented a Web browser with a fairly sophisticated search engine. Your Web browser will have the following capabilities:

I strongly encourage you to work in pairs on the project assignments.


PROBLEM DESCRIPTION

This program consists of three main (and separable) parts:

The solution that you submit should link the three parts together. However, you can start working on each part independently. Parts 1 and 3 will not take you that long.  Part 2 will take longer since you must implement a hash table.  When you add the second part to the GUI Web browser, print the caching information to the search results window as well as to stdout.
Part 1: Web Crawling

For this part, you will have your web browser automatically discover new links given a starting point. You will discover these links by parsing a starting url's file and finding href links of other local webpages, and parsing them, and so on. Specialized programs, called WebCrawlers, do this on the web to discover new pages for search engines.  

Your Web browser will start by taking two or three command line arguments: the first is a start_url, the second is link_depth (the depth you will follow any link from the start_url), and the third is an optional html_ignore file. For example, you may run your program like this:

	% java WebBrowser www.brynmawr.edu 4 html_ignore_file
Starting at the initial URL, it will crawl the websites following href links to a depth of 4, yielding a list of URLs.  The list of URLs will then be used in place of the urls contained in the urlListFile from previous parts. For each url, you will create a URLContent object by parsing its file and create its WordFrequency tree. You should be able to just make a few modifications to your ProcessQueries class to get this to work.

You have been given a HREFScanner class that will take care of all the ugly parsing of url links. It works by first initializing it with a url, then it will open the url's file, scan its webpage for href tags and return the next valid url link from the scanned page when its getNextToken() method is called.

Getting Started

Here is a breakdown of the specific changes you will need to make to your existing code (again, the exact changes you may need to make will vary based on your specific implementation):
Change ProcessQueries
---------------------
(*) add a constructor that takes a start url and link_depth 
    limit (and optionally an html_ignore_list) as input, and 
    (1) recursively discovers the URLs up to the specified depth
    (2) creates URLContent list and cache as before

Here is pseudocode for how to recursively discover the URLs:

List<String> urls; // class-level variable

void crawlWeb(String url, int depth)
    base case:  return if depth == 0
   
    parse the url using the HREFScanner

    for each link given by the HREFScanner, do

        if the link is not a valid html file (e.g., it does not end in .html), skip it

        if ( ! urls.contains(link))
            urls.add(link)
            crawlWeb(link, depth-1)
       

The data structure used for urls in the pseudocode is a List.  You are encouraged to choose a different data structure for urls, picking one which is more efficient.  Which one should you use?



Part 2: Caching

Below is a demonstration of how your more efficient query processing program should behave. You will save all previous query results in a data structure called the cache. When the user types in a query, you will first check the cache to see if you have processed a similar request in the past.

#####################################
Enter next Query String or -1 to quit
#####################################
artificial

  Results for your Query "artificial":
  -----------------------------------------------
  Query: artificial
        Query Result NOT Found in cache        <-------- print out whether or not query is in cache

  Matching URLs:
  cs.brynmawr.edu/~eeaton   priority: -8

#####################################

Enter next Query String or -1 to quit
#####################################
artificial intelligence

  Results for your Query "artificial intelligence":
  -----------------------------------------------
  Query: artificial intelligence
        Query Result NOT Found in cache

  Matching URLs:
  cs.brynmawr.edu/~eeaton   priority: -17

#####################################

Enter next Query String or -1 to quit
#####################################
intelligence artificial

  Results for your Query "intelligence artificial":
  -----------------------------------------------
  Query: artificial
intelligence
        Query Result Found in cache

  Matching URLs:
  cs.brynmawr.edu/~eeaton   priority: -17


Notice that the first time the user asks about "artificial intelligence" we cannot find anything in the cache. But the next time the user asks about "intelligence artificial" we can grab the stored information that we previously displayed for "artificial intelligence" and return it immediately.  Also note that we want to be able to recognize that a query is the same regardless of the word order, so "artificial intelligence" is the same as "intelligence artificial".

To get the query result to print out nicely in the display, you need to convert the query results to a string (either in html or ASCII). One way to do this is to add a queryToString method (or a queryToHTMLString method) of the ProcessQuery class that takes as input the result returned by the performQuery method and converts the results to an string generating "\n" for new lines or < br > for new lines. You can then call setText on the JEditorPane to display the string.

Getting Started

Your first step should be to implement either a ChainingHashTable or an OpenAddressingHashTable (i.e., that does probing).  Your hash table implementation should use the provided dictionary interface in Dictionary.java.   Think carefully about which implementation of the HashTable you plan to use for your cache: does it make a difference?

Once your HashTable implementation is complete and working, use it to implement the cache.

The cache is represented as a hashtable where the key is the query string (with the words reordered to be in alphabetical order). The element stored with the query should be the String answer for that query.

hashtable cache
-------------------------------------------------------------------------
|key:		|	|	|	|	|	|	|	|
|  String  	| 	|	|	|	|	|	|	|
|element: 	|  	|	|	|	|	|	|	|
|  String	|	|	|	|	|	|	|       |
|		| 	|	|	|	|	|	|	|
|		|   	|	|	|	|	|	|	|
-------------------------------------------------------------------------



1. For each word in query, create an alphabetically ordered query
   string that is all one case and the same case as you stored tokens
   into your WordFrequencyTrees

2. Search for the query in the cache
        (A) if a match is found
            immediately return and display the corresponding answer
 	(B) if a match is not found
	     (a) create a result priority queue just like you did in project 2 to yield the answer
	     (C) add the (query, answer) to the cache
	     (D) return the answer to caller 


3. After each query print out:
	(a) whether or not the query was found in the cache 
	(b) the answer for that query
The best place to implement caching is in your ProcessQueries class. However, you may also implement another class to perform the caching and make calls to the ProcessQueries class methods.
Optional Part 3: GUI  [5 pts Extra Credit]

Build a GUI front-end to your search engine. It should have the following 6 GUI components:
  1. a URL text box (where the user can type in a url)
  2. a fetch URL button
  3. a display area for fetched urls
  4. a query text box (where the user can type in a query)
  5. a search button
  6. a display area for search results

Your WebBrowser's main method will read in the starting url and depth (for part 1) and ignore_list command line options, create a ProcessQueries object (from Project 2), and then create the GUI front-end.

The WebBrowser GUI will work like the following:
Most of the GUI has already been implemented for you.  Your effort will be connecting what you've already written to this code.

Your main class must be called WebBrowser.java and take in three command line arguments:  
java cs206proj.part3.WebBrowser startingURL depth ignoreFile

See part 1 for a more detailed description of these command line arguments.

Getting Started

Start by reading the Tutorial example about the JEditorPane class, and try out the example (TextSamplerDemoHelp.java) to get an idea of how to display .html files and search results (look at the JEditorPane's setPage and setText methods), and how to associate scroll bars with panes:
http://download.oracle.com/javase/tutorial/uiswing/components/editorpane.html
Getting your GUI to display url files and to display search result strings is easy if you use the JEditorPane's setPage and setText methods. Remember to import javax.swing.text.* and java.net.URL along with the standard Swing imports.

Note: the links on the displayed webpage don't work. You do not have to add support to make links work. However, there is a way to add this functionality, and you can try it out if you'd like.

The name of your GUI class should be WebBrowser.java.  This is also the main class that you will run to launch the WebBrowser.

You might also find the Java Trail on Swing helpful:  http://download.oracle.com/javase/tutorial/ui/index.html

You are welcome to modify the provided sample GUI however you like.  If you do, you will probably want to use the GridBagLayout to layout the buttons and the text display in the same panel. This will likely give the best results when the Window is re-sized (you want the display parts of the window to re-size vertically, but probably not the buttons and the search text and url text boxes).


Extra Credit

Do not start on extra credit parts until you have completed the regular part of the assignment 
You can receive up to 10 points worth of extra credit on this assignment by implementing one or more of the following extensions: Extra credit is graded on an all or nothing basis: either the feature completely works or it doesn't.  No extra credit will be given for partially implemented extra credit features.  Also, I will not provide you with assistance on completing the extra credit; you must figure out how to do it yourselves.

If you implement one or more extra credit features, be sure to include a description of the feature and how to test it in the README file.



CLASSES

Classes you'll need for this assignment include the classes you developed for part 2 of the project, plus the following:

You can download all code from here.


SUBMITTING THE ASSIGNMENT

Submit the following via dropbox:
  1. All .java files necessary for compiling your code (including the necessary ones from parts 1&2 of the project).
  2. All .class files of your code located in the proper package directories (including the necessary ones from parts 1&2 of the project).
  3. An html_ignore File, containing tokens that should be ignored from an html input file.
  4. A README file with your name and your partner's name (if you had one).
If you developed in Eclipse, simply compress your entire Eclipse project folder into a single .tar or .zip archive, and then copy this archive into dropbox to submit all of these files.

Once you have submitted your code, I strongly encourage you to test it, either from within dropbox using the command line, or by copying it to a clean location and then attempting to run it.

If you work with a partner, please only one of you submit your joint solution via dropbox.


ECLIPSE PROJECT ORGANIZATION -- Important!

You should work within the same Eclipse project that you did for parts 1 and 2.  That project should contain three packages:  cs206proj.part1 and cs206proj.part2 and cs206proj.part3.  All new code should be placed within the cs206proj.part3 package.  This will enable you to build off parts 1 and 2 of the project easily.

When you submit your code, you must be certain to submit all of the code from parts 1 and 2 that you used in this project as well.
This assignment is based on a course project created by Tia Newhall and Lisa Meeden.