Project Part 4:  Adding a Hyperlink Graph to Your WebBrowser

Due: Wednesday, December 7, 2011 by the start of class
Friday, December 9, 2011 at 11:59pm


You may work with one partner on this assignment.  Anyone from the class is fine.


PROBLEM DESCRIPTION

For this program you will add a graph of URL links to your Web browser. You will create the graph by parsing a starting url's file and finding href links of other local webpages, and parsing them, and so on. Specialized programs, called WebCrawlers, do this on the web to discover new pages for search engines.  The graph will contain a vertex for each url, and an edge (u, v) if there is a link from url u to url v. The graph can be added as a data member to your ProcessQueries or WebBrowser class depending on how you implemented your solution to part 3.

Your Web browser will start by taking two or three command line arguments: the first is a start_url, the second is link_depth (the depth you will follow any link from the start_url), and the third is an optional html_ignore file. For example, you may run your program like this:

	% java WebBrowser www.brynmawr.edu 5 html_ignore_file
Starting at the initial URL, it will crawl the websites following href links to a depth of 5, yielding a set of vertices for the hyperlink graph and connections between them.  The list of vertices will then be used to create all the data structures needed by your part 3 solution, taking the place of the urlListFile command line argument from previous parts. For each url, you will create a URLContent object by parsing its file and create its WordFrequency tree. You should be able to just make a few modifications to your ProcessQueries class to get this to work.

You will use the graph in two ways:

  1. You will create a "Graph Window" button and add it to your Web browser. When this button is selected it will bring up a new window with the following options:
    • A display window to print results of button actions
    • A "Reachable From" button and text window: when a url is entered in the text window, and the button is selected, it calls the reachableFrom method of your graph and prints out the results to the display window.
    • A "Shortest Path" button and text window: when a url is entered in the text window, and the button is selected, it calls the shorestPath method of your graph and prints out the results of the shortest path from the url to all other urls reachable from it. You are required to implement the shortestPath method of the Graph class.  (this is now an extra credit option)
    • A "Print Graph" button: when selected, the graph is printed to the display window.

    The Graph Window GUI has already been implemented for you, you just need to add a button to your WebBrowser to pop-up the Graph Window.

  2. You will add linkage information to compute a good search result: If a page that matches a query string is linked-to by many other pages its priority should be increased some amount based on its linked-to degree. You should use this new criterion combined with word frequency count information to order query results.

You have been given a HREFScanner class that will take care of all the ugly parsing of url links. It works by first initializing it with a url, then it will open the url's file, scan its webpage for href tags and return the next valid url link from the scanned page when its getNextToken() method is called.


GETTING STARTED

Here is a breakdown of the specific changes you will need to make to your existing code (again, the exact changes you may need to make will vary based on your specific implementation):
Change ProcessQueries
---------------------
(*) add a graph data member
(*) add a constructor that takes a start url and link_depth 
    limit (and optionally an html_ignore_list) as input, and 
    (1) creates a link graph (following links only link_depth deep from the 
	start url)
    (2) creates URLContent list and cache as before
    (3) incorporates linked-to information into determining a URL's priority
        when ordering query results

Graph
-----
(*) implement the shortestPath method  (This is now an extra-credit option)
    you can test your shortestPath method before you have other parts of the
    program working.   Just create a weighted directed graph in the main 
    method of TryGraph.java and call your shortestPath method on different
    start vertices.  Even though all the edges in the link-graph will have a 
    weight value of one, your implemenation should be more general and should
    work correctly on any weighted directed graph.

WebBrowser
---------
(*) add Graphics button that pops up a graphics window
    (makes a GraphGui object visible)  


Java Classes

Classes you'll need for this assignment include the classes you developed for part 3 of the project, plus the following (available in your dropbox folder):
  1. HREFScanner.java: a scanner class whose constructor takes a url, and parses the url's file returning href links to possibly valid webpages as the next token.  The main() method demonstrates its use.
  2. GraphGui.java and TryGraphGui.java to test it: a Swing Gui for displaying information about your link graph. TryGraphGui, shows you how you can add a button to your Web browser to pop-up the GraphGui window.
  3. Graph.java, Vertex.java, Edge.java, Weighted.java, WeightedInteger.java, Locator.java, HeapLocatorPriorityQueue.java, LocatorPriorityQueue.java, InvalidVertexLabelException.java: Classes necessary to run the incomplete implementation of the Graph class. You will need to implement the shortestPath method.

Extra Credit

Do not start on extra credit parts until you have completed the regular part of the assignment 
You can receive up to 17 points worth of extra credit on this assignment by implementing one or more of the following extensions: Extra credit is graded on an all or nothing basis: either the feature completely works or it doesn't.  No extra credit will be given for partially implemented extra credit features.  Also, I will not provide you with assistance on completing the extra credit; you must figure out how to do it yourselves.

If you implement one or more extra credit features, be sure to include a description of the feature and how to test it in the README file.


SUBMITTING THE ASSIGNMENT

Submit the following via dropbox:
  1. All .java files necessary for compiling your code (including the necessary ones from earlier parts of the project).
  2. All .class files of your code located in the proper package directories (including the necessary ones for this part of the project).
  3. An html_ignore File, containing tokens that should be ignored from an html input file.
  4. A README file with your name and your partner's name (if you had one).

If you work with a partner, please only one of you submit your joint solution via dropbox.
This assignment is based on a course project created by Tia Newhall and Lisa Meeden.